6.8

/10

Spotlight5 位审稿人

最低6最高9标准差1.2

3.4

置信度

正确性3.2

贡献度3.2

表达3.2

NeurIPS 2024

Principled Bayesian Optimization in Collaboration with Human Experts

Wenjie Xu,Masaki Adachi,Colin Jones,Michael A Osborne

OpenReview PDF

提交: 2024-04-23更新: 2024-11-06

TL;DR

Robust and efficient Bayesian optimisation algorithm that elicits knowledge from human experts to accelerate optimisation.

摘要

关键词

Bayesian optimisationhuman-AI collaborationknowledge elicitation

评审与讨论

审稿意见

评分: 6置信度: 32024-07-09

This submission discusses the application of human's expertise knowledge into Bayesian Optimization algorithm. A principled approach, COBOL, was proposed which provides two novel guarantees: handover guarantee that ensures queries for expert's label diminishes to zero with iterations; and no-harm guarantee that preserves comparable convergence rate to vanilla BO using a data-driven trust level adjustment. Numerical results show the robustness of COBOL and real world application proves COBOL outperforms state-of-the-art methods.

优点

This paper presents comprehensive theoretical proofs to support the proposed method. The math seems strong to me. Notations and assumptions are clearly stated at the beginning. Follow-up theorems are easy to follow.
I quite like the trust-level adjustment method. The authors present novelty of their work by differentiating themselves from those who requires hand-tuned trust-level functions. I like the idea of treating expert belief as the regularization of the acquisition function and the corresponding primal-dual optimization problem formulation seems reasonable to me.

缺点

My only concern regards the no-harm guarantee. From observation in Figure 3, the no-harm guarantee indeed presents comparable convergence rate as vanilla LCB, but never arrives at the same optimum as vanilla LCB within the given iterations under adversarial cases. By looking at line 5 in Algorithm 1, it seems adversarial expert's label could harm the quality of query point continuously, especially when the constant value of $\eta$ is set to be large so the confidence condition can be easily satisfied. To my understanding, $\lambda_0$ improves the quality of solution from expert-augmented LCB by reducing its value when the data tells expert's feedback is not trustful. Resilience to adversarial label is achieved through updating the primal-dual weight $\lambda_0$ , but I feel like $\eta$ should also vary with iterations since it is a critical hyperparameter in the algorithm that actually controls whether to accept expert-augmented query point $x_t^c$ .

问题

The estimation on the hyperparameter $B_g$ in lines 182-189 seems heuristic to me. Is there any reasons to initialize the norm bound at a small value and double it during the optimization?
Could the authors discuss briefly the possibility of applying such human collaborated BO formulation to different acquisition functions (e.g. Expected Improvement)?
In the plotting of real world Li+ methyl-acetate experiment in Figure 5, how can the curve for [44] KDD 2023 go up after 45th evaluation? Best observed value should be non-increasing with iterations, could the authors explain how they achieve these results?
Could the authors be more specific when referring to proofs and sections? e.g. not using phrases like "will be seen" (line 148), "detailed later" (line 171), "later experimental section" (line 176)

局限性

The authors mention limited capability of the algorithm on high-dimensional problems due to its GP-UCB based nature. The experiments are only conducted up to 4D. See also above for limitations.

作者回复

2024-08-05

Thank you very much for your valuable feedback and helpful suggestions. The following are our detailed responses.

Response to the concern regarding the no-harm guarantee and $\eta$ .

The suggestion for an adaptive trust weight $\eta$ is interesting and could offer better resilience to adversity. Given the limited time, we leave the design of a principled scheme to adapt the weight $\eta$ as future work. However, we want to emphasize that, even without such a scheme, our no-harm guarantee holds, both theoretically and empirically. We have prepared several experiments for further clarification.

Supporting experimental results. To clarify the robustness in the concerned cases, we added experiments (Fig. R2(b) in the global response) with longer iterations ( $T = 200$ ) to confirm the convergence on par with vanilla LCB. We confirmed the no-harm guarantee for both cases of adversarial feedback accuracy and large trust weight $\eta=100$ on adversarial feedback. For the large trust weight $\eta=100$ , we observed saturation behavior, where larger $\eta$ does not significantly change the convergence rate with given increase in $\eta$ .
Adaptation through the posterior standard deviation. Although $\eta$ is set to be a constant in our current design of the algorithm, there is still adaptation on trusting human or the vanilla BO algorithm through the time-varying posterior standard deviation. Intuitively, if originally the expert-augmented solution $x_t^c$ is trusted more, more samples are allocated to human-preferred region and $\sigma_t(x_t^c)$ drops quickly. Intuitively, if we keep sampling $x_t^c$ and $x_t^u\neq x_t^c$ , $\sigma_t(x_t^u)$ would finally be larger than $\eta\sigma_t(x_t^c)$ and we switch to sampling $x_t^u$ .
Choice of $\eta$ does not need to be very large in practice. Intuitively, $\eta$ captures the belief on the expertise level of the human. The more trust we have on the expertise of the human, the larger $\eta$ we can choose. But larger $\eta$ increases the risk of higher regret due to potential over-trust in adversarial human labeler. In our experience, $\eta$ does not need to be very large. Indeed, $\eta=3$ already achieves superior performance in our experiment. And in our updated Figure R2(b), the experimental results become insensitive to $\eta$ when $\eta>10$ .

We have added more discussions in the manuscript around Algorithm 1 to discuss the choice of $\eta$ .

Responses to other comments.

The estimation on the hyperparameter $B_g$ in lines 182-189 seems heuristic to me. Is there any reasons to initialize the norm bound at a small value and double it during the optimization?

Yes, this is a rough estimation. Our algorithm requires knowledge of a valid upper bound on the norm of $g$ , which is unknown a priori. Thus, we empirically check the LL value as a criterion. If we find the current bound $B_g$ to be invalid (that is, smaller than the norm of $g$ ), we double it. We do not use a too big (big enough so that $\geq||g||$ ) initial $B_g$ to avoid over-exploration in $g$ .

We would like to note that the MLE in GP also does not provide a principled method to estimate its hyperparameters (Karvonen et al.). Regret theory typically assumes the true hyperparameters are given. Thus, this is a shared challenge, and we have found heuristics that work in practice, similar to GP. Developing a more principled tuning method is our future work, and we believe a statistical test is a promising line of research.

T. Karvonen et al., Maximum Likelihood Estimation in Gaussian Process is Ill-Posed. JMLR 2023.

Could the authors discuss briefly the possibility of applying such human collaborated BO formulation to different acquisition functions (e.g. Expected Improvement)?

Our algorithm can be easily extended to other acquisition functions. For expected improvement, we can indeed use similar idea to constrained expected improvement to generate $x_t^c$ , $x_t^c \in \arg\max_{x\in\mathcal{X}} \mathbb{P}(x\text{ is accepted by human})\mathrm{EI}(x).$ However, our convergence rate depends on the GP-UCB algorithm, so the convergence rate will change to that of the selected acquisition function.

In the plotting of real world Li+ methyl-acetate experiment in Figure 5, how can the curve for [44] KDD 2023 go up after 45th evaluation? Best observed value should be non-increasing with iterations, could the authors explain how they achieve these results?

Upon checking the raw data, we found that a few results failed at the point of ascent. We were averaging after performing individual minimum operations. We have reviewed all results and rerun the experiments. Please review the new Figure for the confirmation (Fig. R3(b) in global response).

Could the authors be more specific when referring to proofs and sections? e.g. not using phrases like "will be seen" (line 148), "detailed later" (line 171), "later experimental section" (line 176)

Yes, we updated the manuscript to be more specific. Both lines 148 and 171 refers to Theorem 4.1 and Appendix B, and line 176 refers to Figure 3.

The authors mention limited capability of the algorithm on high-dimensional problems due to its GP-UCB based nature. The experiments are only conducted up to 4D. See also above for limitations.

Scalability to high dimensions is a common challenge for BO. In practice, existing generic techniques, such as decomposed kernels, can be applied in our algorithm to choose kernel functions and achieve scalability in high-dimensional spaces. We realized one minor mistake in the explanation of the Michalewicz function in Appendix F.2.1, which was written as 2D but is actually a 5D function. Thus, our results are up to 5D. However, we are not claiming our results are scalable to high dimensions without the above-mentioned techniques.

We have added similar discussions around Table 1 in the manuscript.

评论- Any further questions or suggestions?

2024-08-09

Dear reviewer iJqg,

  Your comments have helped us improve the quality of the manuscript a lot. And we hope our responses have addressed your comments adequately. Otherwise, please let us know if you have any further questions or suggestions. We are more than happy to provide further responses.

Best,

All the authors.

2024-08-13

Thank the authors for the response. My concerns have been well answered and the new results from rerunning the experiments seem convincing to me now. I have raised my score.

评论- Many thanks

2024-08-13

Dear reviewer iJqg,

 Many thanks for acknowledging our rebuttal and raising the score! 
 We are glad that your concerns are addressed.

Best,

All the authors

审稿意见

评分: 9置信度: 42024-07-11

The paper introduces a Bayesian optimization algorithm designed to include human expert knowledge. Expert input comes in the form of simple accept/reject feedback (as in "this experiment is / is not worth doing"). To incorporate feedback two models are maintained, (1) a standard GP model of the objective, and (2) a likelihood ratio model of human expert feedback. Theoretical performance guarantees are proven, precisely a so-called handover guarantee demonstrating sub-linearity in the number of expert queries, and a convergence-rate guarantee to ensure sub-linear performance. Experiments include a practical example on li-ion battery design.

优点

I am very impressed by this paper! The problem it considers is important but under-studied. The algorithm and its justification is novel and well-motivated, and the experimental results are impressive.

The core idea of using expert feedback in simple accept/reject form is certainly attractive compared to the more usual ranking approach and avoids many of the problems therein.
While I'm not entirely familiar with the model used for human feedback (likelihood ratio model) it seems to be a sensible choice.
The paper is reasonably easy to follow and has a clear flow (for the most part).
The theoretical analysis is interesting - I particularly appreciate the handover guarantee, which (as far as I know) is unique in the literature.
The results clearly demonstrate the potential of the algorithm.

缺点

Small points:

does assumption 2.3 really need a justification? This is a standard assumption.
line 114: is it usual to call $r$ a regularization term? Would it be easier to simply use $\xi$ as per assumption 2.6?
equation (4): use of $\xi$ here is a potential notational clash with assumption 3.7.

问题

are you assuming that the expert labeling model is time-invariant here? I am curious what might happen as the expert learns over the course of the optimization, particularly if the BO turns up new and unexpected results (which, in my experience, is not uncommon).
if I'm reading the regret-bound (6a) correctly this simply shows that, in the worst-case, the algorithm converges at the usual rate for BO. In the optimistic case - that is, assuming an expert providing non-adversarial feedback - do you have any thoughts on how/if this might be used to improve the regret bound? This would be non-trivial I suspect, but it would be good to better understand/bound the potential acceleration from human expert feedback.

局限性

n/a

作者回复

2024-08-05

Thank you very much for your positive feedback and the helpful suggestions. The following are our detailed responses.

P1

does assumption 2.3 really need a justification? This is a standard assumption.

We do some justification because it is also an important assumption and we want to differentiate from gradient-based optimization (where compact set assumption is often not needed).

P2

line 114: is it usual to call $r$ a regularization term? Would it be easier to simply use $\xi$ as per assumption 2.6?

Yes, this is common in theoretical studies. We are aware that $r$ is typically referred to as the Gaussian noise variance. Nevertheless, we follow the definition in [16], because even in the noiseless case, we need the regularisation term $rI$ for the invertibility of $K_{\mathcal{Q}_t^f}$ to avoid numerical instability.

P3

equation (4): use of $\xi$ here is a potential notational clash with assumption 3.7.

We have replaced $\xi$ with $\zeta$ to represent the dual update step size in the manuscript.

P4

are you assuming that the expert labeling model is time-invariant here? I am curious what might happen as the expert learns over the course of the optimization, particularly if the BO turns up new and unexpected results (which, in my experience, is not uncommon).

Here are our thoughts on extending our results to time-varying human behavior model:

Simple extension, yet not promising performance gain. We have added the experiments (Fig. R1 in global response). The most naïve approach for non-stationary model is windowing, i.e., forgetting the previous queried dataset. This can be very easy to apply to our setting, as it simply removes the old data from $g_t$ using the predefined iteration window. This increases the predictive uncertainty of $g_t$ , and therefore requires more queries to the human per iteration, yet our framework still works. As shown in the Fig. R1, this non-stationary approach can offer a slight performance gain when the human learning rate is very quick ( $\alpha_{lr} = 1$ ). However, we do not know if this is always the case, and it also requires more labeling effort from humans.
More sophisticated extension. Another more sophisticated approach is modelling the dynamics of behavioural change. A potential idea is modelling the human behaviour change as an implicit online learning process of the latent function $g$ . That is, $g_{t+1}=F(g_t, x_t, y_t),$ where $g_t$ is the human latent function at step $t$ . The forward dynamics $F$ captures the update of human latent function $g$ when observing the new data point $(x_t, y_t)$ . One potential $F$ is gradient ascent of log-likelihood as shown in $g_{t+1}=g_t+\lambda\nabla_g\log p_g(x_t, y_t)$ , where $p_g(x_t, y_t)$ is the probability of observing $y_t$ at the input $x_t$ given the black-box objective function is $g$ . We can then combine this dynamic with our likelihood ratio model. Since this part requires significantly different analysis and experiments, we leave it as future work.

We have added similar discussions to the appendix of the manuscript.

P5

if I'm reading the regret-bound (6a) correctly this simply shows that, in the worst-case, the algorithm converges at the usual rate for BO. In the optimistic case - that is, assuming an expert providing non-adversarial feedback - do you have any thoughts on how/if this might be used to improve the regret bound? This would be non-trivial I suspect, but it would be good to better understand/bound the potential acceleration from human expert feedback.

Thanks for the suggestion. Under our current mild assumption on the latent expert function $g$ , an order-wise convergence improvement is not attainable. Indeed, assumption becomes unrealistic if we really want an order-wise improvement. Therefore, we only demonstrate the empirical superiority of expert-assisted BO in the experiments. The following are the more specific discussions.

Order-wise improvement can not be attained under current mild assumption. $g$ may contain no information (e.g., $g=0$ ) or even adversarial. Even if human expertise is helpful, we can not guarantee an order-wise improvement either. For example, consider the following $g$ , $g(x)= f(x^\star)+c, \text{if } f(x)-f(x^\star)\leq c, \text{ and } f(x) \text{ otherwise},$ where $c>0$ is a positive constant. In practice, such a scenario means the human expert has some rough idea in a near-optimal region but not exactly sure where the exact optimum is. This is common in practice. In this case, human expert is helpful in identifying the region with $f(x)\leq f(x^\star)+c$ but no longer helpful for further optimization inside the region { $x\in\mathcal{X}|f(x)\leq f(x^\star)+c$ }. However, convergence rate is defined in the asymptotic sense. Hence, an order-wise improvement can not be guaranteed.
Assumption becomes unrealistic if we really want it. Some papers that show theoretical superiority [2, 6], yet the assumptions are unrealistic. For example, [6] assumed that the human knows the true kernel hyperparameters while GP is misspecified, and [2] assumed the human belief function $g$ has better and tighter confidence intervals over the entire domain. We can derive the better convergence rate of our algorithm than AI-only ones if we use [2] assumption, but this is unlikely to be true in reality. In fact, our method outperforms these method empirically (see Figure 5). This supports the superiority based on unrealistic conditions is not meaningful in practice.

We have updated the manuscript by adding some discussion near Theorem 4.1 and in the Appendix.

2024-08-11

I would like to thank the authors for their response and will leave my assessment unchanged.

评论- Many thanks

2024-08-12

Dear reviewer sMdb,

 Many thanks for acknowledging our response and again, for the very positive review!

Best,

All the authors

审稿意见

评分: 6置信度: 42024-07-15

The paper proposes to use human rejection as a feedback in human-AI collaborated Bayesian optimization. The proposed solution uses constrained optimization to explore the regions humans think might be beneficial. The authors offered guarantee that 1) the proposed algorithm has the same regret as the vanilla BO algorithm and 2) the number of queries needed from humans converges to 0. Empirically, the authors show that the proposed method outperforms existing method and has a human study with real human experts.

优点

The human-AI collaboration in bayesian optimization is important. The authors show an interesting human-AI collaboration system that humans provide acceptance-rejection feedback to help the BO process. The empirical results are good.

缺点

Insufficient theoretical results - what is human's value in the proposed Human-AI system: The authors showed two theoretical results, 1) the human-AI system is no worse than the AI system and 2) humans will be out of the system in the long run. But the key theoretical results are missing. The most important results should be how humans improve the proposed system compared to AI-only systems. Moreover, the authors should show how humans' expertise affect the convergence rate. In my opinion, the current theoretical results are not meaningful. This is the most important point the authors should address.

Feedback form: the authors use 'acceptance-rejection' feedback, which seems limited. Intuitively, humans can directly provide $x_t^c$ in this problem setting. Why the authors choose this form of feedback instead of a richer form of feedback.

Human choice model: humans are only required to choose from $x_t^c$ and reject in the paper. The vanilla LCB is never presented to the humans. Can authors explain why it is the case, maybe humans can also reject bad vanilla LCB suggestions.

Changing human behaviors: in practice, humans' behavior is often changing, especially in such interactive systems that humans observe new feedback at each time step. How can the proposed method can adapted into such settings?

Human study - No IRB approval: my understanding is for any human study. Investigators are required to obtain either an exempt determination or IRB approval before these activities can be initiated. While authors seem don't have either based on the checklist.

问题

See weakness.

局限性

Yes

作者回复

2024-08-05

Thank you for the valuable feedback. The following are our responses.

Insufficient theoretical results

We agree that acceleration through expert knowledge is central to our contribution. However, we do not believe that "theoretical" acceleration is essential for the completeness of our work. Our main contribution is the algorithm, which outperforms 6 baselines on 5 synthetic and 4 real-world tasks, which should be sufficient for empirical validation. While theoretical validation is nice to have for showing its effectiveness, it is not the only form of validation. Notably, our setting has the gap between what theory can offer and what users want.

In the expert-guided BO, there is a near consensus that the expert is a warm starter [34, 36, 37, 42, 44]. It is natural that experts can guide AI to more promising regions in the beginning. However, in the later stages, with a larger dataset, GP is more accurate in finding the global optimum, as confirmed in various literature [34, 36, 37, 42, 44]. This means that convergence in the later stage will not be very different from an AI-only system. Still, early-stage acceleration is of paramount importance for high-stakes optimization, such as battery design, where engineers strive to launch better products than competitors as early as possible.

However, theoretical analysis focuses on worst-case asymptotic convergence rates. This means that the warm-starter assumption does not significantly affect the theoretical convergence rate. This is why theoretical acceleration is not crucial in our setting. We believe the value of "conservative" theory lies more in providing safeguard guarantees, assuring we can safely and effectively combine experts and AI suggestions for early-stage acceleration. Indeed, ours is the first algorithm to enjoy a full guarantee (see Table 2).

We are constrained by the space to explain theoretical details, but please read the P5 for sMdb. In short, order-wise improvement is unattainable under current mild assumption and assumption can become unrealistic if we really want it. The baselines [2, 6] have theoretical acceleration under unrealistic assumptions. Figure 5 shows they do not outperform ours, highlighting such theoretical acceleration do not always translate to empirical success.

Feedback form

We choose `acceptance-rejection' feedback for two reasons:

Pinpoint form feedback does not necessarily perform better empirically. We added an experiment (Fig. R3(a) in the global response) comparing pinpoint and primal-dual approaches. The results show that our primal-dual approach performs better across varying feedback accuracy, at least within our framework. Existing works [6, 44] employ pinpoint feedback in different frameworks. However, Figure 5 clearly shows that neither [6] nor [44] performs better than ours. This fact is confirmed by several studies from different institutions, such as [2] and [63], both of which demonstrate that [6] performs even worse than vanilla LCB. Similarly, [33] shows that this type of feedback only works better if the expert's manual sampling is consistently superior to vanilla LCB. Such cases are rare in our examples (e.g., Rosenbrock), and [2, 63] confirm the same conclusion.

Acceptance-Rejection feedback is efficient and accurate in extracting human knowledge for BO. From the BO perspective, there are two reasons why we adopt the acceptance-rejection feedback. (1) Functional estimation approaches like ours can exploit more information. They can query multiple locations to experts to elicit their belief function, and reflects on the next query. (2) The precision of human pinpointing can be low. Humans excels at qualitative comparison than absolute quantities in feedback [41]. This is a widely accepted concept in human preference learning and economics, which has a long history of moving from cardinal utility (absolute) to ordinal utility (comparison) for robust estimation. Furthermore, acceptance-rejection feedback is easier for human to give.

Human choice model

We do not for two reasons.

The value of presenting the vanilla LCB suggestion is low. Rejecting LCB suggestions may be helpful to elicit human knowledge, but using $g_t$ distributional information to select the most informative points would be more query-efficient. More specifically, as the logic in line 5 of Algorithm 1 indicates, the vanilla LCB solution $x_t^u$ is used when either human augmented solution $x_t^c$ can not be the optimal solution for sure or the uncertainty of $x_t^c$ is already very low as compared to $x_t^u$ . In both cases, we would sample the vanilla LCB solution $x_t^u$ regardless of the acceptance/reject feedback from human. So it is not necessary to query human for this point.

Not presenting the vanilla LCB suggestion reduces the human labeling effort significantly. If we presented the vanilla LCB solution to the human, we would have suffered from a linear growth of cumulative queries to the human labeler. Instead, by not presenting it, we show a small sublinear bound on the cumulative human label queries. This is meaningful since it reduces the human labeling effort in a provable order-wise way.

Changing human behaviors

This is the same with P4 of sMdb, please read them as our response. In short, yes, our algorithm can do and we ran experiments in Fig. 1 in global response.

No IRB approval

Although we do not have IRB approval, our institution has reviewed and approved our study as low-risk since all experiments involve running software on open-source datasets. According to the NeurIPS 2024 ethics guidelines, adhering to existing protocols at authors' institution is required, and IRB approval is just one form of such protocols. Due to anonymity, we cannot provide evidence now, but we promise to attach it if accepted.

Thank you again for your valuable comments and suggestions. We have incorporated all of our discussions into the relevant sections of the manuscript.

2024-08-08

Thank authors for the feedback. My fundamental reason for rejection is the current paper does not provide an improvement guarantee, so theoretically this algorithm is no better than the vanilla LCB algorithm. I believe a finite-sample analysis in the early stage would greatly improve the paper. Currently, the theoretical analysis in the paper suggests the same regret as LCB and worse regret than UCB (see below), and both LCB and UCB do not have to query humans at all, so I should never use the proposed system from a theoretical perspective. In this sense, this paper feels incomplete to me.

Also I realize the ``no-harm guarantee'' is a little over-claimed, since the constant in the regret bound should be the constant of the LCB bound, not the UCB's bound. so the proposed human-AI system has a worse regret (in the constant) compared to the vanilla UCB algorithm even when there are a lot of samples. It is not clear when the proposed human-AI system can outperform the UCB algorithm.

评论- Reply for the comment (1/2)

2024-08-08

Thank you for your feedback and timely response. As requested, we demonstrate the theoretical acceleration if the expert is helpful.

Theoretical convergence acceleration

We begin by a "helpfulness" assumption on $g$ . We assume $g(x^\star)\leq0$ , meaning the expert accepts the ground-truth optimal solution with probability at least $0.5$ , i.e., better than random. The primal-dual algorithm implicitly solves the constrained problem $\min \underline{f}_t(x) \text{ subject to } \underline{g}_t(x)\leq0$ . It can be seen that we implicitly restrict the sample in the set $\mathcal{X}_t^g= \lbrace x\in\mathcal{X} \mid \underline{g}_t(x)\leq0 \rbrace$ (otherwise, we can run primal-dual dynamics for multiple steps until $x_t^c$ satisfies $\underline{g}_t(x_t^c)\leq0$ ). By the assumption, $x^* \in \mathcal{X}_t^g$ is guaranteed.

As we already assume $\underline{g}_t(x^\star)\leq g(x^\star)\leq0$ , we can trust the human more. To reflect this trust in the algorithm, we add a filter that requires each sample $x_t$ to satisfy,

\underline{g}_t (x_t) \leq 0 \ \text{and} \ \bar{g}_t(x_t) - \underline{g}_t(x_t) \leq g _\mathrm{thr}

Intuitively, this condition means that the human is quite certain that $x_t$ is worth sampling. Otherwise, we can query human for the "acceptance-rejection" feedback at the point $x_t$ without evaluating on the black-box objective function $f$ , and continue the loop. Interestingly, our original algorithm can be seen as a soft implementation of the above constraints since we do not assume helpful human a priori. With this filter, we restrict all the samples to the set $\mathcal{X}^g = \lbrace x\in\mathcal{X} \mid {g}(x)\leq g_\mathrm{thr} \rbrace$ (since all samples on the black-box objective function satisfies $\underline{g}\_t(x_t)\leq0$ and $g(x_t)\leq\bar{g}\_t(x\_t)\leq \underline{g}\_t(x_t)+g\_\mathrm{thr}$ , which implies $g(x_t)\leq g_\mathrm{thr}$ ).

Let $\text{vol}(\mathcal{X})$ be the volume of the domain. Then, we have $\text{vol}(\mathcal{X}_g) \leq \text{vol}(\mathcal{X})$ since $\mathcal{X}_g\subset\mathcal{X}$ . This is illustrated in Example 2.2, where accuracy coefficient $a$ is how effectively the expert can identify the global optimal $x^*$ . As shown in Kandasamy, et al., the maximum information gain (MIG) depends on the domain volume. Let $\Psi_t(\mathcal{X})$ be the MIG at $|\mathcal{Q}^f_t|$ -th iteration over the domain $\mathcal{X}$ . For instance, with the squared exponential kernel, MIG can be expressed as $\Psi_t(\mathcal{X}) \propto \text{vol}(\mathcal{X}) \log(t)^{d+1}$ . The simple regret of the GP-UCB algorithm has a bound written as $\Psi_t(\mathcal{X}) / \sqrt{|\mathcal{Q}^f_t|}$ . By incorporating our additional assumption with the adapted algorithm, the simple regret of our algorithm improves the vanilla GP-UCB by a factor of $\Psi_t(\mathcal{X}^g) / \Psi_t(\mathcal{X}) = \text{vol}(\mathcal{X}^g) / \text{vol}(\mathcal{X})$ . Thus, if expert suggestions reduce the domain volume, for example, by a factor of 10, the theoretical regret bound of our algorithm could be reduced by a factor of 10. This is broadly consistent with the experiments shown in Figure 4: when expert sampling is strong (i.e., the expert’s belief is closely centred around $x^*$ ), the acceleration becomes more pronounced.

No-Harm Guarantee

We are not entirely sure of your intent regarding "the constant of the LCB bound, not the UCB’s bound." Please clarify if our understanding is incorrect. Below is our understanding of your statement and our response.

To avoid confusion, let us define LCB and UCB: we consider UCB as the well-known GP-UCB algorithm [75] and LCB as our approach. Although our objective is minimisation (Eq. (1)), $x^* = \text{argmin} f(x)$ , while GP-UCB uses maximisation, $x^* = \text{argmax} f(x)$ , these are equivalent by negating the objective: $\text{argmin} f(x) = \text{argmax} -f(x)$ . Thus, vanilla LCB is the same as the GP-UCB algorithm, and we will use LCB and GP-UCB interchangeably.

We interpret "worse regret (in the constant)" to mean that our convergence rate in Appendix B.5 includes an additional constant $(2 + \eta)$ compared to the GP-UCB convergence rate. Setting $\eta = 0$ aligns our rate with GP-UCB, while $\eta > 0$ can worsen the rate only if experts provide adversarial feedback. The constant factor $\eta$ can be adjusted by the user to balance robustness against adversity with potential acceleration from beneficial feedback, reflecting the principle of 'no risk, no gain.' Please note that adversarial feedback is not intentional; human experts aim to help but may unintentionally provide ineffective guidance compared to an AI-only system. If experts are uncertain, setting $\eta=0$ can recover the same rate with GP-UCB. Thus, choosing $\eta > 0$ means that the users believe that our algorithm with their own advice offers greater value than GP-UCB.

CONTINUE TO THE NEXT REPLY

评论- Reply for the comment (2/2)

2024-08-08

Experts seek to be involved in decision-making due to their trustworthiness, self-efficacy, and responsibility for success—qualities that current BO algorithms, such as GP-UCB, do not address. At the very least, our algorithm is theoretically superior to a manual search, which lacks convergence guarantees, especially as users might avoid using vanilla BO due to concerns about its trustworthiness.

The term "no-harm guarantee" refers to the invariance of the order-wise convergence rate in the asymptotic sense. Existing works, such as [2], Mikkola et al., use this term for adversarial cases, where the convergence rate is slightly worse by a constant factor. If this terminology seems overstated, we can use "robustness guarantee" instead.

Clarification of our contribution

We view the acceleration analysis as a valuable but supplementary aspect of the paper. Our core contribution—the algorithm—remains unchanged regardless of this analysis. Our primary contribution includes the algorithm, which is supported by extensive empirical validation, and providing both theoretical no-harm and handover guarantees, the first-of-its-kind in the literature. We hope this clarification resolves your concerns: our algorithm indeed outperforms GP-UCB both theoretically and empirically when expert advice is helpful. We have added the above discussion in Section 4 in the manuscript.

Citations

K. Kandasamy, et al., Multi-fidelity Bayesian Optimisation with Continuous Approximations, ICML 2017
P. Mikkola, et al., Multi-Fidelity Bayesian Optimization with Unreliable Information Sources, AISTATS 2023.

评论- Any further questions or suggestions?

2024-08-09

Dear reviewer dzPz,

 Your review and comments have helped us improve the quality of the manuscript a lot. And we hope our rebuttal and further responses have addressed your concerns adequately. Otherwise, please let us know if you have any further questions or suggestions, especially on the theoretical convergence improvement. We are more than happy to provide further responses.

Best,

All the authors.

2024-08-10

Thank authors for the response.

For the improvement guarantee, I think the argument for the new algorithm is correct. Now my main concern is the algorithm in the paper/experiment does not match the algorithm that enjoys the improvement guarantee authors offered in the previous response. The algorithm presented in the paper does not enjoy this improvement guarantee. As I suggested earlier, it seems the paper is not complete and should go through another iteration.

For the no-harm guarantee, I was using the notation in the reward maximizing case (sorry for the confusion). Let me be more specific - I will use reward maximizing notation, so UCB chooses reward upper bound and LCB chooses reward lower bound.

$x^c$ : expert augmented selection

$x^*$ : optimal x

$x^u$ : UCB selection

For a vanilla UCB algorithm: the regret is $f(x^*)-f(x^u)=f(x^*)-UCB(f(x^u))+UCB(f(x^u))-f(x^u) \leq UCB(f(x^u)) - f(x^u)$

For the authors' proposed human-AI system (authors' proof above equation 41): the regret is $f(x^*)-f(x^c)=f(x^*)-UCB(f(x^c))+UCB(f(x^c))-f(x^c)$

$\leq UCB(f(x^*))-UCB(f(x^c))+UCB(f(x^c))-f(x^c)$

$= UCB(f(x^*))-UCB(f(x^u))+UCB(f(x^u))-UCB(f(x^c))+UCB(f(x^c))-f(x^c)$

$\leq UCB(f(x^u))-UCB(f(x^c))+UCB(f(x^c))-f(x^c)$

$\leq UCB(f(x^u))-LCB(f(x^u)) + UCB(f(x^c))-f(x^c)$

By comparing these two bounds, the authors' proposed system seem three times worse than the vanilla UCB algorithm, so not really no harm. It seems to me the harm is hidden in the constant.

(I am using reward maximizing notation, so the UCB corresponds to the LCB in the paper. the above proof is a translation of authors' proof above eq 41)

评论- Response on the no-harm guarantee

2024-08-10

On the no-harm guarantee

Thank you for your clarification! Now it is clear that we are on the same page. As you read further, on line 742, you will notice that the bound reaches $2(2 + \eta) \beta_{f_T} \sqrt{4(|Q^f_T|+2) \gamma^f_{|Q^f_T|}}$ . In comparison, the bound given by the GP-UCB algorithm [75] is $4 \beta_{f_T} \sqrt{4(|Q^f_T|+2) \gamma^f_{|Q^f_T|}}$ . The only difference lies in the constant term $(2 + \eta)$ , and our previous response provides its justification. In short, this happens only when the users' advice is unintentionally ineffective, and the decision to trust their advice is left to the users themselves. If they are aware that their advice is not reliable a priori, setting $\eta=0$ can align the convergence rate with that of GP-UCB. To benefit from human expertise, such a potential harm of constant-wise worse regret bound can not be avoided when the human expert turns out to be adversarial or ineffective. Although the term "no-harm guarantee" is standard in the literature in this asymptotic order-wise sense, we understand your concern. Therefore, we have changed it to "robustness guarantee."

2024-08-12

Thank authors for confirming my concerns.

For the improvement guarantee, my original and current concern is the algorithm described in the submitted manuscript should have an improvement guarantee. I think assuming some version of helpfulness is okay as long as it satisfies the robustness guarantee. The key is authors should show the proposed algorithm helps when humans are good, and does not hurt too much when humans are bad. So far, I don't think authors gave such an improvement for the algorithm in the manuscript.

For the original called 'no-harm guarantee' in the paper, authors confirmed that when humans are bad, there is harm in the constant (only no harm in the rate of T). I think such harm is natural since there is no free lunch and there is a price to pay when humans are bad. But in the initial submitted version, the presentation does not reflect it and the harm is hidden, so the presentation of the paper is not very satisfactory to me. I think authors should discuss this point in detail in the future version of the paper.

评论- Response to the new concern

2024-08-10

Many thanks for your engagement and constructiveness. We are pleased that you acknowledge our theoretical improvement guarantee.

Now my main concern is the algorithm in the paper/experiment does not match the algorithm that enjoys the improvement guarantee authors offered in the previous response. The algorithm presented in the paper does not enjoy this improvement guarantee. As I suggested earlier, it seems the paper is not complete and should go through another iteration.

Regarding your new concern, we want to clarify that the algorithm 1 shown in our submission and the slightly adapted version indeed match from a higher-level algorithmic perspective, with the difference being a slight implementation choice. The key algorithmic idea is the same: using human function $g$ as a constraint when selecting the new point. The difference is, the adapted version implements this as a hard constraint, while in our Algorithm 1, it is implemented as primal-dual dynamics, which is a popular method to solve a constrained problem, as shown in Eq. (4). The active learning constraint $\bar{g}_t(x)-\underline{g}_t(x)\leq g_\mathrm{thr}$ is also implemented in line 7 of our algorithm 1 as a soft constraint. As such, algorithm 1 can be seen as a relaxed implementation of the hard-constrained algorithm.

Actually, the hard-constrained version was our initial implementation, but we finally choose to relax the hard constraints in the submission due to the following reasons:

Relaxed version is more robust to adversity. The theoretical improvement guarantee highly relies on the "helpfulness" assumption $g(x^\star)\leq0$ . However, in practice, whether or not this assumption holds is unknown a priori. If $g(x^\star)>g_\mathrm{thr}$ instead, which can happen in practice if the human is adversarial or just not so experienced, adding the hard constraint would lead to the missing of global optimum since the hard constraint restricts all samples to satisfy $g(x)\leq g_\mathrm{thr}$ .
Relaxed version empirically performs better. We indeed tested the hard constrained version before submission, but we found that performance significantly improved when we relaxed the hard constraints, as in our current Algorithm 1. The practical superiority of the primal-dual method is well-known in general (S. Boyd, et al., 2004), even though it relaxes the hard constraint. The success lies in its flexibility; it can handle problems that may not be strictly feasible, which can occur during the learning process of both $f$ and $g$ .

Therefore, this is a deliberate design choice, not an indication of incompleteness. We prioritised practicality and the worst-case guarantee over conditioned guarantees, considering the high-stakes nature of many expert-in-the-loop applications. Achieving both simultaneously is challenging in general because it has the trade-off between robustness guarantee and theoretical performance guarantee, see e.g., Tsipras et al., ICLR 2019.

Practical relaxation of constraints is a common approach in many works. For instance, the well-known paper "Proximal Policy Optimization" (Schulman et al., 2017) proposes a practical implementation that employs a relaxed version of the "Trust Region Policy Optimization" (Schulman et al., 2015). As such, relaxing hard constraints at the implementation level is common, and we believe this new concern may not be grounds for rejection. Indeed, the key point of theory here is to understand its effectiveness in an ideal setting. We do not believe our manuscript requires another iteration, as this algorithm design choice is intentional based on robustness and practical considerations, and the main content does not need a major revision. Still, we have included our fruitful discussion and experimental comparison in the manuscript.

Citations

S. Boyd, et al., Convex Optimization, 2004
D. Tsipras, et al., Robustness May Be at Odds with Accuracy, ICLR 2019
J. Schulman., et al., Proximal Policy Optimization Algorithms, arXiv 2017
J. Schulman., et al., Trust Region Policy Optimization, ICML 2015

评论- Responses and summarization

2024-08-12

Thank you for investing your time in reviewing our paper. To avoid confusion, let us first summarise the current status to confirm our mutual understanding:

Summary of Our Discussion

We would like to emphasize that the core of our contribution is a novel algorithm with both theoretical robustness and handover guarantees, and empirical improvement, for which we have a relevant theorem and compelling empirical results. Additionally, we have, in the course of rebuttal, brought up an adapted novel algorithm (which was indeed our initial design), with an additional theoretical improvement guarantee --- we view this additional guarantee as supplementary to our core contribution, beyond its scope, not as a replacement. Our view is that our core contributions are significant and potentially impactful, just with what we have agreed (see below), as also appreciated by all the other reviewers.

What we have agreed

Our algorithm is the first-of-its-kind one that has two theoretical contributions of both robustness and handover guarantees. Numerous experimental results show that, human experts help when they are good and do not hurt much when they are not.
Our hard-constrained algorithm enjoys the third theoretical improvement guarantee, while our implemented algorithm in the original submission is a relaxed version of it.
Our robustness guarantee may be slightly worse at the constant level compared to GP-UCB, but the constant $\eta$ is controllable by users and not worse in an order-wise sense, as is commonly accepted in the literature [2, 36, Mikkola, et al., AISTATS 2023]. Nevertheless, this is worth noting for readers to understand the risk involving setting $\eta$ , so we have included it as a minor revision (which has been completed).

What we disagree for now

Whether it is even theoretically possible that the robustness-guaranteed algorithm also has an improvement guarantee, and whether it must have both for completeness.
Whether we should prioritize the non-robust algorithm with the conditional improvement guarantee. This depends on the hidden assumption that strict adherence to the improvement guarantee is superior to its relaxed counterpart; otherwise, strict adherence may not be justifiable.

Our responses.

Guaranteeing Both Robustness and Improvement May Be Incompatible in Theory.

To begin with, we want to emphasize that guaranteeing both improvement and robustness may be incompatible theoretically. From a theoretical perspective, they are in a trade-off relationship [D. Tsipras, et al., ICLR 2019]. This can be intuitively explained by the no-free-lunch theorem [Wolpert, et al. IEEE TEVC 1997]: if algorithm A outperforms B, it does so by exploiting 'biased' information. The 'bias' inherent in the improvement is contradictory to robustness. Our setting is unbiased, meaning we do not have prior knowledge of helpful or adversarial human expert. Therefore, we must make a design choice between prioritizing robustness or improvement as a theoretical contribution, depending on whether we assume that expert input can be adversarial (weak bias) or that it will always be helpful (strong bias). Indeed, there are lower bound results for the average-case regret of Bayesian optimization in the literature (e.g., see [J. Scarlett, et al., COLT 2017]). GP-UCB is alreadly nearly rate-optimal in achieving this lower bound. This means theoretical improvement is obtained in the price of worse robustness.

We Need to Balance between Improvement and Robustness.

Table R1: Comparison of cumulative regret bound.

Our algorithms	human	order	constant ratio over GP-UCB
Hard-constrained (Improvement-guaranteed)	adversarial	linear	$\to\infty$ (linear / sublinear)
Hard-constrained (Improvement-guaranteed)	helpful	sublinear	$\text{vol}(\mathcal{X}_g) / \text{vol}(\mathcal{X})$ ( $\leq 1$ )
Relaxed (Robustness-guaranteed)	adversarial	sublinear	$2(2+\eta) / 4$
Relaxed (Robustness-guaranteed)	helpful	sublinear	Empirically comparable to $\text{vol}(\mathcal{X}_g) / \text{vol}(\mathcal{X})$

We have two options: a robustness-guaranteed relaxed algorithm, which is our submitted version, and an improvement-guaranteed algorithm, which is the hard-constrained but essentially the same algorithmic idea. Refer to Table R1 for the theoretical comparison. In the case of helpful feedback, both algorithms achieve constant-wise improvement. However, when the human expert is adversarial, the improvement-guaranteed algorithm performs infinitely worse at the constant level than GP-UCB, as its cumulative regret growth becomes linear due to restricting sampling in a subset and missing global optimum, while the relaxed algorithm still achieves sublinear growth. Therefore, our current relaxed version is clearly more balanced than the hard-constrained one.

评论- Responses and summarization (cont.)

2024-08-12

While we agree that this is an interesting discussion for readers, it is supplementary content. The main paper should focus on clearly presenting the contributions of the main algorithm, which is already complex enough for a 9-page paper. Including a comparison with another algorithm in the main text could disrupt the smooth flow of the narrative. We appreciate your contribution to the discussion on algorithmic comparison and have respectfully included this content in the Appendix.

Citations

D.H., Wolpert, et al., "No Free Lunch Theorems for Optimization". IEEE TEVC 1997
J. Scarlett, et al., "Lower Bounds on Regret for Noisy Gaussian Process Bandit Optimization", COLT 2017.

Regarding your concern on "no-harm guarantee".

Regarding your concern on "no-harm guarantee", we understand it and have made the regret bound in Theorem 4.1 explicit in the parameter $\eta$ to reflect this constant-wise harm when the human expert is bad. Since this only requires revising the name of "no-harm guarantee", making the regret bound explicit in $\eta$ , and adding some relevant discussions, this issue can be addressed with minor revision (we have done so) and our paper does not need a new iteration because of this.

2024-08-13

Thank authors for the responses. I think authors did a good job addressing my concerns. I understand you cannot edit the manuscript this year so I cannot verify these changes in the manuscript, but I hope authors can discuss in detail about the improvement guarantee and the trade-offs of involving humans in the systems. I personally feel this is a more interesting part to me as a reader. I raised my score.

I only other concern is the IRB but I will defer that to AC and ethic reviewers.

评论- Many Thanks

2024-08-13

Dear reviewer dzPz,

 Many thanks for acknowledging our responses and raising the score! 
 For sure, we will discuss in detail about the improvement guarantee and the trade-offs of involving humans in the systems in the final version if our submission is accepted. (Indeed, we have been doing so.)

Best,

All the authors

审稿意见

评分: 7置信度: 32024-07-30

This paper introduces a robust Bayesian optimization algorithm that incorporates human expert knowledge to accelerate the optimization process. The key theoretic contributions include a handover guarantee, ensuring the number of expert labels required decreases over time, and a no-harm guarantee, ensuring the optimization performance is not adversely affected even with unreliable expert advice. The algorithm combines Gaussian processes for objective functions and likelihood ratio models for expert beliefs, using a primal-dual approach to mix these two surrogates, thereby enhancing the efficiency of handling expert inputs. Empirical validation on synthetic benchmarks and real-world tasks, such as lithium-ion battery design, shows the method's effectiveness and robustness, outperforming existing baselines.

优点

This paper introduces a first-of-its-kind principled approach to Bayesian optimization by integrating human expert knowledge. The authors provide solid theoretical guarantees, including the handover and no-harm guarantees, which are well-justified and supported by detailed proofs.

The empirical experiments are comprehensive, covering five synthetic benchmarks and one real-world application, as well as robustness and sensitivity analysis. The experiment results show the algorithm's improvement over existing baselines.

The paper is overall well-written and structured, with a clear presentation of its contributions, theoretical foundations, and experimental results. The inclusion of visual aids and detailed plots enhances understanding and reproducibility, though there is a lack of explanations on notations and terminologies, which I will elaborate in the Weakness section.

This paper's contribution has the potential to advance the state of the art in Bayesian optimization and inspire further research in human-AI collaboration.

缺点

The main weakness of this paper lies in its clarity.

The term "cost function" is misleading as it typically refers to black-box objective function evaluation costs in cost-aware Bayesian optimization, whereas this paper uses it to denote expert belief.
Despite having a notation paragraph at the end of Section 2 and an algorithm paragraph at the beginning of Section 4.2, it is still difficult to locate the definitions of the notations used in Algorithm 1. For instance, the notations in Line 1 are not clearly defined until Line 9, where it becomes apparent they represent the prior confidence set. Including a line of parameters at the beginning of the pseudocode would be helpful.
This issue extends to Figure 2. The caption could be more detailed to clarify the notations used in the figure, as it currently requires effort to understand.
Both LCB and UCB are defined, but only LCB is used in their algorithm, and the authors claim it is still based on the GP-UCB algorithm. It needs more clarification on when to use LCB and when to use UCB.
The term "LL values" appears multiple times but lacks a clear explanation. It seems to be associated with the LL maximizer mentioned somewhere in the paper but needs more clarification.
The x-axis of Figure 3 is confusing. For example, the number of function evaluations seems to go up to 50, yet the number "50" does not appear.
There is a lack of ablation study on the kernel choice and the number of dimensions. While I understand these may not be the primary focus of this paper, I am interested in the authors' thoughts on these aspects.
Additionally, it takes time to figure out how the synthetic agent response (including feedback accuracy and expert belief distribution) is modeled in the synthetic experiments. The authors should refer to Example 2.2 again or directly restate the expression in the synthetic dataset paragraph.

问题

Some suggestions to improve clarity have been mentioned in the Weakness section. I have two additional questions:

I would like to see a justification for the linear scaling in the synthetic agent response model in Example 2.2, as it is used throughout the synthetic experiment as well as the robustness and sensitivity analysis. Is this scaling general enough to represent typical synthetic agent responses?
I am interested in learning about the actual human labor efforts involved in the battery design task to understand the applicability of this work in real-world applications. The experiment results include the cumulative queries, but I would like to know more about the actual time spent on labeling. This can be a rough estimate, like a comparison in magnitude to the computational time and the function evaluation time.

局限性

The authors do have adequately addressed the limitations, even though I still have concerns in how costly the human labors in a real-world task (e.g., the battery design) can be.

作者回复

2024-08-05

Thank you very much for the positive feedback and the helpful comments. The following are our detailed responses.

Point-by-Point Responses to Weaknesses

To avoid any confusion, we have changed it to 'expert function' or just 'function' instead.
We have added the notations with explanations and some initializations at the beginning of the Algorithm 1.
We have added more details to clarify the notations used in Figure 2.
In the GP-UCB paper [75], maximization formulation is adopted. However, we consider the minimization formulation (as common for many applications in engineering) in our paper. So GP-UCB is modified to LCB in our paper. Vanilla LCB in this paper is essentially the same as GP-UCB in [75]. To clarify this issue, we have inserted a footnote in line 321.
We have added an explanation clearly stating that 'LL value' refers to the log likelihood value over the historical data up to step $t$ with a function $\hat{g}\in\mathcal{B}_g$ .
We have updated Figure 3 to avoid any confusion.
Kernel choice and scalability to high dimension are common challenges for BO. Theoretically, Table 1 shows the impact of the kernel function and dimension on both the cumulative regret and cumulative queries. This is similar to the GP-UCB algorithm. In practice, existing generic techniques, such as decomposed kernels, can be applied in our algorithm to choose kernel functions and achieve scalability in high-dimensional spaces. We have added similar discussions around Table 1 in the manuscript.
We have added some discussions to state it again at line 271.

Point-by-Point Responses to Questions

I would like to see a justification for the linear scaling in the synthetic agent response model in Example 2.2, as it is used throughout the synthetic experiment as well as the robustness and sensitivity analysis. Is this scaling general enough to represent typical synthetic agent responses?

It is challenging to encompass all possible agent responses in the experiments. In the synthetic response model $p_{x\succ_g0} = S(a \rho(f(x)))$ , our linear scaling function, combined with the accuracy coefficient, can cover the three important and common scenarios,

Helpful human expert: $a$ is significantly positive (e.g., $a=1$ ). In this case, $x^\star$ is mapped to a high probability of being accepted, and the point with large $f(x)$ value is mapped to a low probability of being accepted.
Purely random response: $a=0$ , every point is accepted with probability of $0.5$ .
Adversarial human expert: $a$ is significantly negative (e.g., $a=-1$ ). In this case, $x^\star$ is mapped to a low probability of being accepted.

While Theorem 4.1 assures the versatility of our algorithm for any response function, we also wanted to empirically confirm its reliability with another possible response function. Thus, we added experiments in Fig. R2(a) in the global response, selecting [37] as a representative example from the literature. [37] adopted human belief as a multivariate normal distribution (MVN), where the mean and variance of the MVN correspond to the most promising location of the global optimum and the confidence in the belief, respectively. We have tested the strong, weak, and wrong models by following [37]. Fig. R2(a) show that the outcomes do not differ significantly from those in Example 2.2. In particular, the “wrong” case is centred at the point farthest from $x^*$ within the domain, representing another adversarial response and demonstrating the efficacy of the no-harm guarantee.

I am interested in learning about the actual human labor efforts involved in the battery design task to understand the applicability of this work in real-world applications. The experiment results include the cumulative queries, but I would like to know more about the actual time spent on labeling. This can be a rough estimate, like a comparison in magnitude to the computational time and the function evaluation time.

Human labelling costs vary among experts but typically range from a few seconds to several minutes. As shown in Figures 6 and 7, the BO overhead ranges from a few seconds to tens of seconds, which is comparable. However, we assume that the function evaluation time is very expensive. The battery design tasks in Figure 5 are demonstration purposes using open datasets. In the real-world development, creating a prototype battery requires at least three days for manufacturing and one week for testing. This experiment costs at least $10,000 when outsourced, making the labelling cost negligible by comparison. This high expense is why experts are reluctant to rely solely on opaque AI algorithms for decision-making. They prefer to be involved in the design process for trustworthiness, despite being uncertain of their effectiveness of aiding BO in advance. From this human perspective, our approach can be seen as augmenting human ability through BO with a provable convergence guarantee. Thus, human involvement is more of a prerequisite in our setting rather than an additional cost. In fact, the handover guarantee can reduce their effort compared to a manual search (see Figure 7, where the cumulative queries of ours are smaller than those of a manual search).

However, even though the above cases are our main target and perspective, one may wish to extend to other scenarios. For example, users might want to be involved in scenarios with cheap feedback, such as hyperparameter tuning of lightweight ML models. In this setting, we agree that the labelling cost would matter.

We have added this limitation and similar discussion to the manuscript.

Response to Limitations

The authors do have adequately addressed the limitations, even though I still have concerns in how costly the human labors in a real-world task (e.g., the battery design) can be.

Please see our response to the last comment.

Thank you again for valuable feedback. We hope our rebuttal clarifies our points.

2024-08-11

Thanks for your detailed responses. The discussions on the response model and the human labeling costs now appear more comprehensive to me. However, I still have further concerns regarding the practicability of your approach. Although the human labeling times might be comparable to the BO overheads, involving human experts introduces additional labor costs. This means that merely comparable theoretical guarantees and experiment results might not be sufficient to justify the use of human experts over simply running a vanilla LCB algorithm. Nonetheless, this paper opens a door to further exploration into BO with human-AI collaboration, so I have increased my score.

评论- Many thanks

2024-08-11

Dear reviewer Ct1E,

 Many thanks for acknowledging our rebuttal and raising the score! 
 We understand your further concern and will try to discuss it further in the paper if accepted.

Best,

All the authors

作者回复

2024-08-04

Global response and added experiments

The authors would like to thank all reviewers for their effort. As per the reviewers' requests, we have added five additional plots to the rebuttal PDF and incorporated these figures into our manuscript.

Fig. R1: Nonstationary Human Accuracy (dzPz, sMdb)

We tested the scenario where the accuracy of human experts’ labeling improves over time. Our algorithm can extend to such scenarios, but the performance gain is limited to certain conditions. Further work is needed for the future work.

Fig. R2(a): Variants of Human Belief Model (Ct1E)

We tested another human belief model proposed by [37]. Our algorithm works for the strong, weak, and wrong belief models introduced in [37].

Fig. R2(b): Confirming No-Harm Guarantee (iJqg)

To confirm that our algorithm converges on par with vanilla LCB in adversarial scenarios, we extended the iterations to 200. Our algorithm converges to the same regret as vanilla LCB for both feedback accuracy adversity and strong trust weights on adversarial feedback.

Fig. R3(a): Pinpoint Form (dzPz)

We tested the efficacy of pinpoint feedback suggested by dzPz. The results do not show a promising performance gain over our primal-dual approach.

Fig. R3(b): Bug in Figure 5 (iJqg)

Thank you for pointing this out. We have rerun the experiments, and the bug has been resolved.

We would like to thank all reviewers again for their comments and suggestions, which have significantly improved the quality of our manuscript. We have already updated our manuscript to reflect our fruitful discussion. We hope this rebuttal clarifies any initial confusions and concerns. We are happy to discuss further if more clarification is needed.

审稿意见

评分: 6置信度: 32024-08-10

The paper studies the integration of human expert knowledge into Bayesian Optimization (BO) processes through binary accept/reject recommendations. The authors introduce an approach that ensures two key guarantees: Handover Guarantee: The approach proves a sublinear bound on the cumulative number of binary labels needed from experts. Over time, the need for expert labels decreases, thereby saving effort and computation time. No-Harm Guarantee: The algorithm adapts the trust level in expert advice based on data, ensuring that the convergence rate will not be worse than using BO without expert advice, even in cases where the expert's advice may be adversarial. The proposed method is empirically validated in real-world tasks, particularly in the design of lithium-ion batteries, demonstrating its robustness and superiority over existing methods.

优点

Originality: The paper presents an innovative approach by integrating human expert knowledge into BO with theoretical guarantees. The introduction of the "handover" and "no-harm" guarantees is novel, addressing a significant gap in existing literature where human-in-the-loop methods often lack formal theoretical guarantees.

Quality: The authors provide clear theoretical formulations with proofs, that establish the validity of their approach. The empirical evaluations uses both synthetic and real-world datasets, and the results are compelling, showing the practical effectiveness of the method.

Clarity: The paper is well-organized and clearly written. The concepts, while complex, are explained in a step-by-step manner, making the paper accessible to readers. Given this is 41-page long paper, I especially like the table of contents in Appendix.

Significance: The approach could have broad implications for how expert knowledge is integrated into automated systems, making it a valuable addition to the literature.

缺点

Limited Generalizability: While the paper provides strong theoretical guarantees, the method's effectiveness may be limited to specific types of expert interactions (binary labeling). The approach might not generalize well to scenarios where expert input is more complex or continuous.

Computational Complexity: The algorithm's dependence on GP and the primal-dual method could be computationally expensive, particularly in high-dimensional settings. This could limit the practicality of the approach in real-time applications or for problems with very large datasets.

问题

Given the potential computational overhead of the proposed method, have the authors compared the computational time of your proposed method with other baselines?

Moreover, particularly in high-dimensional spaces, have the authors considered any strategies for reducing computational complexity? For example, could dimensionality reduction techniques be integrated into the process?

局限性

评论- Responses on the computational complexity (cont.)

2024-08-11

Moreover, particularly in high-dimensional spaces, have the authors considered any strategies for reducing computational complexity? For example, could dimensionality reduction techniques be integrated into the process?

Thanks for the suggestion. Besides the scalable GP techniques mentioned in our previous response, existing dimensionality reduction techniques for Bayesian optimization can also be integrated into our process. For example, line BO [Kirschner, et al., 2019] and random embedding techniques [Wang, et al., 2016] can also be integrated to our process by iteratively restricting the search space to a line or applying similar random embedding techniques.

Citations

Kirschner, Johannes, et al. "Adaptive and safe Bayesian optimization in high dimensions via one-dimensional subspaces." ICML 2019.
Wang, Ziyu, et al. "Bayesian optimization in a billion dimensions via random embeddings." JAIR 55 (2016): 361-387.

We have added similar discussions above to the relevant places in the manuscript.

评论- Responses to the review (cont.)

2024-08-11

On 'Computational Complexity'

The algorithm's dependence on GP and the primal-dual method could be computationally expensive, particularly in high-dimensional settings. This could limit the practicality of the approach in real-time applications or for problems with very large datasets.

Scalability in Surrogate Models.

Computational complexity is a common challenge for Bayesian optimization due to the usage of Gaussian process, which requires $\mathcal{O}(n^3)$ ( $n$ is the number of data points) computational time for inference. In the literature, there are many works on computationally scalable Gaussian process [Liu, et al.]. Indeed, our method only requires access to the lower/upper confidence bounds and posterior uncertainty of the unknown functions. These posterior confidence bounds and uncertainty can be derived using the computationally scalable methods in the literature.

For example, sparse GPs have been proposed to approximate the true posterior using inducing points [Titsias, 2009]. Another line of work (e.g., [Salimbeni, 2017]) uses stochastic variational inference for scalable GP. Kernel approximation methods, such as [Rahimi, et al., 2007], are also popular scalable methods. Recent work [Lederer, et al., 2021] also proposes a tree-based efficient GP inference method. Beyond Gaussian process, we expect that combining the recent works on neural process [Garnelo, et al.], a computationally efficient stochastic process learning method, with our method is also a promising direction to scale up to real-time or high dimensional problems.

Scalability in Primal-Dual Method.

In each step $t$ , we only require one step of primal update and one step of dual update. The main computational load is on the primal update, where an unconstrained nonlinear optimization problem is required to be solved as shown in Prob. (5). In low dimensions, this can be solved using grid search. In medium to high dimensions, we can use existing fast nonlinear programming solvers, such as Ipopt (which is highly scalable), to solve the primal update problem.

Given the potential computational overhead of the proposed method, have the authors compared the computational time of your proposed method with other baselines?

We understand it may be difficult to check all appendices, but we want to highlight that we provided a computational time comparison in Figure 7 of Appendix F.4.1. Our method is not the slowest and is consistently on par with other baseline methods. While the algorithmic overhead ranges from a few seconds to tens of seconds across experiments, the human labeling time varies from a few seconds to several minutes. Thus, the algorithmic overhead is comparable to the labeling time.

We would also like to draw your attention to the fact that we assume the function evaluation time is very expensive. The battery design tasks shown in Fig. 5 are for demonstration purposes using open datasets. In real-world development, creating a prototype battery requires at least three days for manufacturing and one week for testing. This experiment costs at least $10,000 when outsourced, making the computational overhead negligible by comparison. This high expense is why experts are reluctant to rely solely on opaque AI algorithms for decision-making; they prefer to be involved in the design process for trustworthiness, even if they are uncertain about the effectiveness of aiding BO in advance.

However, we acknowledge that, while the above cases are our primary focus, there may be other scenarios to consider. For instance, users might want to apply this approach in contexts with cheap feedback or even real-time feedback, such as hyperparameter tuning of lightweight ML models. In such cases, we agree that the computational overhead would be comparably more significant.

Citations

R.A. Bradley, et al. "Rank analysis of incomplete block designs: I. the method of paired comparisons." Biometrika 1952
M. Titsias, "Variational learning of inducing variables in sparse Gaussian processes." AISTATS 2009
M. Garnelo, et al. "Conditional neural processes." ICML 2018
H. Liu, et al. "When Gaussian process meets big data: A review of scalable GPs." IEEE Trans. Neural Netw. Learn. Syst. 2020
H. Salimbeni, et al. "Doubly stochastic variational inference for deep Gaussian processes." NeurIPS 2017
A. Rahimi, et al. "Random features for large-scale kernel machines." NeurIPS 2007
A. Lederer, et al. "Gaussian process-based real-time learning for safety critical applications." ICML 2021.

We have added similar discussions above to the relevant places in the manuscript.

评论- Responses to the review

2024-08-11

Thank you very much for the positive feedback and the helpful comments. The following are our detailed responses.

On 'Limited Generalizability'

In fact, our rebuttal experiments (Figures R2, R3 in the global response) demonstrated that our approach can be extended to accommodate a broader range of human feedback. We chose the "binary labeling" format primarily due to its empirical success, as our experiments showed that our approach outperformed other baselines that adapted to different forms of feedback. The details are as follows:

Other feedback forms.

We understand your reference to "more complex" as pertaining to forms (b) and (d), and "continuous" as relating to forms (a) and (c).

[(a)] Pinpoint form [6, 33, 44] adopt this form that the algorithm asks the humans to directly pinpoint the next query location.
[(b)] Pairwise comparison [2] adopts this form that the algorithm presents paired candidates, and the human selects the preferred one.
[(c)] Ranking [7] adopts this form that the algorithm proposes a list of candidates, and the human provides a preferential ranking.
[(d)] Belief function [36, 37] adopt a Gaussian distribution as expert input. Unlike the others, this form assumes an offline setting where the input is defined at the beginning and remains unchanged during the optimization. Human experts must specify the mean and variance of the Gaussian, which represent their belief in the location of the global optimum and their confidence in this estimation, respectively.

Slight modification can adapt these forms to our method.

[(a)] Pinpoint form We can simply replace the expert-augmented LCB $x^c_t$ in line 3 of Algorithm 1 with the pinpointed candidate (see Figure R3 of the global response).
[(b)] Pairwise comparison By adopting the Bradley-Terry-Luce (BTL) model (Bradley et al., 1952), we can extend our likelihood ratio model to incorporate preferential feedback. This allows us to obtain the surrogate $g$ , while the other parts of our algorithm remain unchanged.
[(c)] Ranking Ranking feedback can be decomposed into multiple pairwise comparisons. Therefore, we can apply the same method as in the pairwise comparison.
[(d)] Belief function We can use this Gaussian distribution model as the surrogate $g$ (see Fig. R2 in the global response for details and results).

Binary labelling empirically performs best in our experiments.

The primary reason we adopted binary labeling is due to its empirical success, as demonstrated in Fig. 5. None of the other formats, including (a) pinpoint form [6, 44] and (b) pairwise comparison [2], outperforms our method. In the experiments by [2], the authors showed that (a) pairwise comparison outperforms both (d) belief form [36] and (c) preferential ranking. Therefore, it logically follows that our binary labeling format yields the best performance. Additionally, Fig. R3 in the global response reaffirms that the pinpoint form performs worse than our binary labeling format.

The main reasons why the binary format works better are as follows:

[(a)] Pinpoint form The accuracy of pinpointing is generally lower than that of kernel-based models. Humans excel at qualitative comparison rather than estimating absolute quantities [41]. Numerous studies [2, 42, 44, 63] have confirmed that manual search (pinpointing) by human experts only outperforms in the initial stages, with standard BO with GP performing better in later rounds. [33] shows that this type of feedback only outperforms when the expert's manual sampling is consistently superior to the standard BO. However, such cases are rare in our examples (e.g., Rosenbrock), and [2, 63] corroborate this conclusion.
[(b)] Pairwise comparison This format relies on two critical assumptions: transitivity and completeness. Transitivity assumes no inconsistencies, which are often referred to as a "rock-paper-scissors" relationship. However, real-world human preferences frequently exhibit this issue [14]. Completeness assumes that humans can always rank their preferences at any given points. In practice, when a user is unsure which option is better, this assumption does not hold. Our imprecise probability approach avoids these issues by not relying on an absolute ranking structure [5, 35].
[(c)] Ranking Ranking is an extension of pairwise comparison and has been classically researched as the Borda count, which is known not to satisfy all rational axioms. Theoretically, the Condorcet winner in pairwise comparison is the only method that is known to identify the global maximum of ordinal utility.
[(d)] Belief function This is another form of absolute quantity, which humans are generally not proficient at estimating. Additionally, the offline nature of this method does not allow for knowledge updates.

We have added discussions in the manuscript after introducing the 'acceptance-rejection' feedback and in the Appendix.

最终决定Accept (spotlight)

2024-09-25

This paper addresses a very interesting problem, has solid empirical and theoretical results, and was very well-received by the reviewers. I also commend the authors on how they engaged with the reviewers. Excellent work! But, as always, please take care to incorporate feedback on doing the final edits for camera-ready.

Principled Bayesian Optimization in Collaboration with Human Experts

摘要

评审与讨论

优点

缺点

问题

局限性

Response to the concern regarding the no-harm guarantee and η\etaη.

Responses to other comments.

优点

缺点

问题

局限性

P1

P2

P3

P4

P5

优点

缺点

问题

局限性

Insufficient theoretical results

Feedback form

Human choice model

Changing human behaviors

No IRB approval

Theoretical convergence acceleration

No-Harm Guarantee

Clarification of our contribution

Citations

On the no-harm guarantee

Summary of Our Discussion

What we have agreed

What we disagree for now

Our responses.

Regarding your concern on "no-harm guarantee".

优点

缺点

问题

局限性

Point-by-Point Responses to Weaknesses

Point-by-Point Responses to Questions

Response to Limitations

Global response and added experiments

优点

缺点

问题

局限性

Citations

On 'Computational Complexity'

Citations

On 'Limited Generalizability'

Other feedback forms.

Slight modification can adapt these forms to our method.

Binary labelling empirically performs best in our experiments.

Response to the concern regarding the no-harm guarantee and $\eta$ .