PaperHub
6.5
/10
Poster4 位审稿人
最低6最高7标准差0.5
6
7
6
7
3.3
置信度
正确性3.3
贡献度2.8
表达3.0
NeurIPS 2024

Bandits with Preference Feedback: A Stackelberg Game Perspective

OpenReviewPDF
提交: 2024-05-13更新: 2025-01-13

摘要

关键词
Preference FeedbackDuelling BanditsReproducing Kernel Hilbert SpaceStackelberg GamesExploration-exploitation dilemma

评审与讨论

审稿意见
6

This paper considers bandits with preference feedback. It first constructs a novel confidence set that covers the ground truth with high probability. Then from a Stackelberg game perspective, it proposes an efficient algorithm that enjoys tighter regret bound than SOTA.

优点

  1. The technique used to construct the confidence set is interesting. The resulting confidence set is tighter.
  2. The Stackelberg game perspective is interesting, and allows the author(s) to design the algorithm with better exploration-exploitation trade-off as demonstrated in the experiment.

缺点

The major concern is the practical applicability of the algorithm. Seems that the proposed algorithm can hardly scale up to a higher dimension (e.g., dimension equals to 7). In the proposed algorithm, a complicated sequential optimization problem needs to be solved. Notably, the experiment only considers two-dimensional problem, in sharp contrast to recent works (e.g., 12-dimensional problem considered in Xu et al. [2024]).

问题

  1. In the abstract, the review claims that the regret bound is 'rate-optimal'. Is there a matching lower bound?
  2. Can the results be generalized to other link functions?
  3. In line 168, should it be supaB\sup_{|a|\leq B} instead of supaB\sup_{a\leq B}?
  4. Could the authors explain the linear growth of multiSBM in Fig. 1 (b)?

局限性

Some major comments:

  1. The paper discusses the comparisons to [Xu et al. 2024]. Seems that the theoretical gain of T14T^{\frac{1}{4}} mainly comes from tighter confidence set rather than the design of the algorithm. If the algorithm POP-BO in [Xu et al. 2024] is equipped with this tighter confidence set, it could also achieve the same regret bound. To see which algorithm is empirically better, it would be interesting to compare the proposed algorithm with POP-BO equipped with the tighter confidence set in this paper.

Besides the points mentioned above, here are some minor comments:

  1. In line 109, I guess it should be {0,10, 1} instead of [0,1][0, 1].
  2. In line 178 to 179, typo in "ridge estimator estimator".
  3. In line 574, abuse of the notation ss.
作者回复

Thank you for your time and the points you raised. We addressed them in the paper and we believe it will help the paper to reach a broader community. Follows our response to your questions.

The major concern is the practical applicability of the algorithm. Seems that the proposed algorithm can hardly scale up to a higher dimension (e.g., dimension equals to 7). In the proposed algorithm, a complicated sequential optimization problem needs to be solved. Notably, the experiment only considers two-dimensional problem, in sharp contrast to recent works (e.g., 12-dimensional problem considered in Xu et al. [2024]).

We acknowledge that scalability is a crucial point for practicability. Bilevel optimization problems are known to be hard, however, due to their common large-scale applications (e.g. for data re-weighting in tuning LLMs[4]), there are efficient approximation algorithms. This is mentioned in Line 275, and given the technical tools for vectorizing, parallelizing, and utilizing multiple machines, we do not consider this to be a major concern for applicability.

To address your concern, we are adding an experiment on a 32-dimensional problem using restaurant reviews from the Yelp open dataset. We believe this further demonstrates the scalability of our approach. For further discussion, please see our general rebuttal and the uploaded PDF that contains its cumulative regret plot [Fig A-b], as a proof-of-concept. We highlight that we have not tuned or modified the algorithms for this experiment, and they are being used off-the-shelf. The regret curve is best viewed as a proof-of-concept, rather than an algorithmic benchmark.

In the abstract, the review claims that the regret bound is 'rate-optimal'. Is there a matching lower bound?

Thank you for raising this question. We realize that this claim is not accurate enough, and we have now updated the abstract.

There are no lower bounds, for dueling bandits or bandits with direct feedback, under the RKHS assumption. This has been an open problem [1] and is only resolved under a GP assumption [2] or for time-continuous bandits [3]. These papers hint that the existing upper bounds are not improvable.

Our regret rates in TT, match the tightest existing upper bounds for cumulative regret in kernelized bandits, that is, O~(γTT)\tilde{\mathcal{O}}(\gamma_T \sqrt{T}), up to the choice of kernel. We updated the abstract and specifically removed the word “rate-optimal” to avoid confusion. We have also included this remark about the rates, updating the paragraph on Lines 309-313.

Can the results be generalized to other link functions?

Our theoretical analysis does in fact hold under a broader set of link functions s:R[0,1]s: \mathbb{R} \to [0, 1] as long as ss is,

  • monotonically increasing and symmetric around 00, i.e., s(x)=1s(x)s(x) = 1-s(-x) and s(0)=0.5s(0) = 0.5,
  • differentiable and κ=supaB1/s˙(a)\kappa = \sup_{a \leq B} 1/\dot{s}(a) exists. We have kept the problem setting for the sigmoid function, as it has a simple probabilistic interpretation (feedback yy becomes a Bernoulli random variable) and connects to the common Bradly-Terry model.

However, we have added a remark to Section 3 and pointed out this extension.

In line 168, should it be supaBsup_{|a| \leq B} instead of supaBsup_{a \leq B}?

The sigmoid function is symmetric therefore the two supremum are equivalent.

Could the authors explain the linear growth of multiSBM in Fig. 1 (b)?

Thank you for raising this point. The linear regret of MultiSBM was a result of a minor error in our implementation. We corrected it and updated our results. Fig A and Table B in the uploaded PDF include the corrected performance of MultiSBM. Due to the single-page limit of the PDF, we included only the Ackley function's cumulative regret plot but have updated all figures in the paper. MultiSBM is now a competitor to MaxMinLCB for smooth functions (e.g. Branin and Matyas), however, for the more challenging problems (e.g. Ackley and Eggholder) our algorithm maintains its performance advantage. MaxMinLCB remains the top-performing algorithm across the problem instances.

The paper discusses the comparisons to [Xu et al. 2024]. Seems that the theoretical gain of T14T^{\frac{1}{4}} mainly comes from tighter confidence set rather than the design of the algorithm. If the algorithm POP-BO in [Xu et al. 2024] is equipped with this tighter confidence set, it could also achieve the same regret bound. To see which algorithm is empirically better, it would be interesting to compare the proposed algorithm with POP-BO equipped with the tighter confidence set in this paper.

The POP-BO algorithm is built on the same logic as the MultiSBM that we included in our experimental results. We have updated the paper specifying this connection. MultiSBM (POP-BO) performs competitively on smooth and simpler reward functions, however, our algorithm maintains its performance advantage on the more challenging functions and the Yelp experiment. We kindly refer the reviewer to the general rebuttal for a further discussion on the similarities between MultiSBM, POP-BO, and MaxMinLCB.


References:

[1] Vakili, Sattar, Jonathan Scarlett, and Tara Javidi. "Open problem: Tight online confidence intervals for RKHS elements." Conference on Learning Theory. PMLR, 2021.

[2] Scarlett, Jonathan, Ilija Bogunovic, and Volkan Cevher. "Lower bounds on regret for noisy gaussian process bandit optimization." Conference on Learning Theory. PMLR, 2017.

[3] Lattimore, Tor. "A lower bound for linear and kernel regression with adaptive covariates." The Thirty Sixth Annual Conference on Learning Theory. PMLR, 2023.

[4] Pan, Rui, et al. "ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting." arXiv preprint arXiv:2406.19976 (2024).

评论

I would like to thank the authors for the detailed responses. Most of my concerns are addressed. And I would like to maintain my rating.

审稿意见
7

This paper considers novel game-theoretic acquisition function for pairwise action selection with preference feedback. It is tailored to the setting with infinite domains and nonlinear kernelized rewards. The preference-based confidence sequences for kernelized utility functions are shown to be tight and anytime valid. The proposed algorithm MAXMINLCB is shown to satisfy a sublinear regret. Various simulations were conducted to showcase the advantage of the proposed method.

优点

  1. Although preference-based bandit optimization with linear utility functions is fairly well understood, such approaches cannot capture real-world problems with complex nonlinear utility functions. This paper aims to close this gap. The considered problem is timing and interesting.

  2. The technical contribution is non-trivial. Although there have been attempts to prove convergence of kernelized algorithms for preference-based bandits, such works employ a regression likelihood model which requires them to assume that both the utility and the probability of preference lie in an RKHS. Moreover, a sample-efficient algorithm is lacking for such approaches. In contrast, this work uses a kernelized logistic negative log-likelihood loss to infer the utility function, and provide confidence sets for its minimizer.

  3. Some theoretical result, like Kernelized Logistic Confidence Sequences in Theorem 2, is also of independent interest.

  4. In spite of a theoretical paper, it is well written and is easy to follow.

缺点

  1. In practice, how to determine the hyper-parameters like γt\gamma_t, LL, and BB in (5)? Is there any data-driven way to select them?

  2. In the main Theorem 6, the regret bound is γTDT\gamma_T^{D}\sqrt{T}. The term γTD\gamma_T^{D} is the T-step information gain of kernel, which is also a function of TT. The authors claim that this rate improves that of Xu et al. (2024) by a factor of T1/4T^{1/4}. However, the cumulative regret bound in Theorem 5.2 of Xu et al. (2024) is of a similar order. Xu et al. (2024) also provided explicit regret upper bounds for various common kernels in Theorem 5.5. Hence, it is also interesting to provide an explicit form of γTD\gamma_T^{D} for some common kernels, and to compare these regret upper bounds in a fair way.

  3. In Figure 1, the authors presented the result for the Ackley function, which shows a clear advantage of the proposed method. However, in more extensive simulations (e.g., Matyas function in Figure 6, ) in the appendix, the proposed method is outperformed by the competitors. It is helpful to provide some discussion on these results, and offer some insights on when the proposed method would work well. This is helpful for practitioners.

问题

see Weakness

局限性

N/A

作者回复

Thank you for your time and comments. We are glad that you find our contributions to be timely and strong. We have reflected your suggestions in the paper, making it more clear for future readers.

In practice, how to determine the hyper-parameters like 𝛾𝑡, 𝐿, and 𝐵 in (5)? Is there any data-driven way to select them?

There are two common ways:

  • If there's enough budget for data acquisition, these can be selected via cross-validation on a small sampled pool.
  • Another approach is to create a second bandit instance on top, to optimize for these parameters [1, 2].

In practice, we directly tune for β_t\beta\_t, rather than separately for its elements (e.g. no need to tune γ_t\gamma\_t). In our experiments, we just set β_t=1.0\beta\_t = 1.0 without tuning it.

In the main Theorem 6, the regret bound is γTT\gamma_T\sqrt{T}. The term γT\gamma_T is the T-step information gain of kernel, which is also a function of 𝑇. The authors claim that this rate improves that of Xu et al. (2024) by a factor of 𝑇1/4. However, the cumulative regret bound in Theorem 5.2 of Xu et al. (2024) is of a similar order. Xu et al. (2024) also provided explicit regret upper bounds for various common kernels in Theorem 5.5. Hence, it is also interesting to provide an explicit form of γT(D)\gamma_T^{(D)} for some common kernels, and to compare these regret upper bounds in a fair way.

On the regret rate. Please note that the definition of βT\beta_T in [Xu et al.] is looser by a factor of T1/4T^{1/4} than ours. In Theorem 5.2, the regret bound is: βTT=T3/4γT\sqrt{\beta_T} \sqrt{T} = T^{3/4}\gamma_T This looseness by a factor of T1/4T^{1/4} can also be viewed in Theorem 5.5, where the regret of a linear bandit is reported to be O(T3/4)O(T^{3/4}).

Regarding the dueling information gain. Our regret bounds are indeed in terms of γT(D)\gamma_T^{(D)} and Xu et al. presents their result for γT\gamma_T. We have now updated the text, refraining from directly comparing the tightness of rates, for the comparison to be more rigorous. However, we expect γT(D)\gamma_T^{(D)} and γT\gamma_T to exhibit identical rates with TT which is the case for linear kernels. We are working on extending this result to more complex kernels. Our intuition is that summing two kernels (to create the dueling kernel) does not change the kernel complexity. Thank you for this suggestion!

In Figure 1, the authors presented the result for the Ackley function, which shows a clear advantage of the proposed method. However, in more extensive simulations (e.g., Matyas function in Figure 6, ) in the appendix, the proposed method is outperformed by the competitors. It is helpful to provide some discussion on these results, and offer some insights on when the proposed method would work well. This is helpful for practitioners.

We appreciate the observations and agree that a discussion on these results is beneficial for practitioners. Overall, we see MaxMinLCB as a robust choice demonstrated by its consistent performance across the wide range of optimization problems considered in our experimental results.

Branin, Matyas, and Rosenbrock are relatively smooth functions that present easier optimization problems. We observe less differentiated performance between the algorithms. For example, MultiSBM excels and outperforms MaxMinLCB for Branin and Matyas.

In contrast, functions such as Ackley, Eggholder, Hoelder, and Michalewicz present more challenging optimization problems characterized by larger gradients, valleys, ridges, and numerous local optima. In these cases, MaxMinLCB demonstrates a clear advantage.

Figure C in the uploaded PDF visualizes the Ackley, Matyas, and Branin functions to highlight these differences.

We will include a detailed discussion in the revised manuscript to provide practitioners with insights into when the proposed method is most effective and how it compares to other algorithms under various conditions.

References

[1] Berkenkamp, Felix, Angela P. Schoellig, and Andreas Krause. "No-regret Bayesian optimization with unknown hyperparameters." Journal of Machine Learning Research 20.50 (2019): 1-24. [2] Sessa, Pier Giuseppe, et al. "Multitask learning with no regret: from improved confidence bounds to active learning." Advances in Neural Information Processing Systems 36 (2023): 6770-6781.

评论

the authors' clarification is helpful. I do not have any additional question and would like to maintain my rating.

审稿意见
6

The paper examines the problem of bandit optimization with preference feedback in large domains and nonlinear (kernelized) rewards. It introduces MAXMINLCB, which adopts a game-theoretic approach to action selection under comparative feedback. Additionally, it proposes kernelized preference-based confidence sets, which can be utilized in related problems.

优点

(1) Rather than jointly selecting the arms in dueling bandits, the proposed method jointly optimizes both actions by choosing them as the equilibrium of a two-player zero-sum Stackelberg game. This approach enables a more efficient exploration/exploitation trade-off.

(2) The regret guarantee presented in this paper is tighter by a factor of T1/4T^{1/4} compared to Xu et al. (2024).

缺点

(1) Although the paper uses a kernelized logistic model to approximate the rewards, this approach may remain too simplistic for capturing the complexity of rewards in real-world applications.

(2) The paper lacks a comparison in the experiments with the related work by Xu et al. (2024).

(3) In real applications, it is more common to rank between two state-action pairs. However, the paper does not consider contextual information and solely focuses on the multi-armed setting, which is less interesting and useful.

问题

(1) Can you implement the algorithms compared in Section 6.2 using the original confidence sets from the references?

(2) Can this work be extended to contextual bandits with preference feedback?

局限性

Please see weaknesses.

作者回复

Thank you for your time and the points you raised. We addressed them in the paper and we believe it will help the paper to reach a broader community. In light of these updates, we appreciate if you can reconsider your assessment on the scope and relevance of the paper.

Although the paper uses a kernelized logistic model to approximate the rewards, this approach may remain too simplistic for capturing the complexity of rewards in real-world applications.

We would first like to highlight that the MaxMinLCB acquisition function is not tied to the kernelized model in this paper and may be used together with any calibrated model, e.g. a Bayesian Neural Network with valid uncertainty estimates.

We believe that a kernelized approach for adaptive learning is still relevant in real-world problems, e.g. if vector embeddings of the data are available. For complex data modalities, finding an accurate feature map is beyond the computational budget of many practitioners, and we believe there is value in developing computationally light methodology which can be set up on top of available vector embeddings. It is in fact common to use kernelized and linear models on top of LLM embeddings in Active Preference Optimization [3, 4].

We are conducting a new experiment to demonstrate this claim given vector embeddings of Yelp reviews for restaurants [Fig. A Tab. B- Uploaded PDF]. We can find the preferred choices, based on comparative feedback of a user and observe that the algorithm still works on a text-based domain, even though it is designed for euclidean spaces. Please refer to the general response for details of this experiment. Note that neither of the algorithms have been modified or tuned to this new domain, therefore the regret act as a proof-of-concept, rather than a benchmark.

While the key contributions of this paper are intended to be primarily theoretical, we aim to add this experiment to demonstrate the practical applicability of our problem setting. We hope that this experiment can make the paper more accessible to a broader community.

Can this work be extended to contextual bandits with preference feedback? The paper does not consider contextual information and solely focuses on the multi-armed setting, which is less interesting and useful.

Thank you for the interesting question! We have been thinking about contextual extensions as future work. Here’s our take:

  1. If the context is stochastic and given by the environment, our approach almost immediately applies.

    Contextual bandits and linear/kernelized bandits are equivalent up to choice of the feature map/kernel. This is extensively discussed in Chapter 19.1 and 19.2 of the bandit book [1] for the linear setting, and analyzed in [2] for the kernelized setting. We updated section 3 and added this remark.

  2. If the context can be actively chosen by the agent (e.g. choosing a prompt from a fine-tuning dataset), we require a non-trivial extension of our algorithm.

    This is an interesting scenario, in which the agent is optimizing both for the context and action pairs. We are in fact working on a follow-up project, which uses our novel MaxMinLCB acquisition function to solve this problem, with an eye to active preference optimization [3, 4]. Here, the aim is to give a worst-case sample complexity bound. We now mention this in the Conclusion section.

The paper lacks a comparison in the experiments with the related work by Xu et al. (2024).

We would like to highlight that the POP-BO algorithm proposed by Xu et al. (2024) uses the same action selection logic as the MultiSBM algorithm, which is already included in our benchmarks. We have updated section 6, specifying this connection. Our experimental results indicate that MultiSBM performs comparably or even slightly better than MaxMinLCB on smooth functions. However, MaxMinLCB outperforms MultiSBM on more challenging functions such as Ackley and Eggholder. Figure A and Table B in the uploaded PDF compare performance on smooth and wilder functions.

Can you implement the algorithms compared in Section 6.2 using the original confidence sets from the references?

Most baselines are designed for linear settings, which is misspecified for the complex test functions that we consider [c.f. Fig. C in the rebuttal PDF]. This would result in linear regret curves [c.f. Fig. 1a of the paper]. Therefore, we implemented the baselines using our improved confidence sets to make the regret curves more informative. This allows the reader to compare the acquisition functions more fairly.

The IDS algorithm and POP-BO have kernelized confidence sets but are (theoretically) expected to be looser [c.f. Fig. 1a of the paper]. We will update the paper with a comparison to these algorithms (using their original confidence sets). Thank you for this suggestion!


References:

[1] Lattimore, Tor, and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[2] Krause, Andreas, and Cheng Ong. "Contextual gaussian process bandit optimization." Advances in neural information processing systems 24 (2011).

[3] Das, Nirjhar, et al. "Active Preference Optimization for Sample Efficient RLHF." ICML 2024 Workshop on Theoretical Foundations of Foundation Models.

[4] Mehta, V., Das, V., Neopane, O., Dai, Y., Bogunovic, I., Schneider, J., & Neiswanger, W. (2023). Sample Efficient Reinforcement Learning from Human Feedback via Active Exploration.]

评论

Thanks for the rebuttal and most of my questions have been addressed. I have updated the score accordingly.

审稿意见
7

This paper considers bandit optimization with preference feedback over continuous action spaces and kernelized reward function. The goal in this problem is to minimize the dueling regret against an optimal action over a finite time-horizon. Previous works on this problem are either restricted to finite action spaces or linear reward functions. The proposed algorithm casts the problem as kernalized logistic regression and designs confidence sets for the relative preference between two actions. It then proposes an action selection strategy based on a game theoretic Leader-Follower formulation that utilizes these confidence intervals. The paper provides a regret bound as well as empirical evaluation for the proposed algorithm.

优点

The main contributions of the paper are two-fold:

  1. Expanding the existing literature on dueling bandits by studying kernelized reward functions under infinite and continuous action sets. This requires new techniques to bound the confidence intervals.

  2. Proposing a principled game-theoretic approach to action selection in dueling bandits that can be of further interest.

In my opinion these are two important contributions to the literature. Since these ideas are likely to be relevant to other learning problems with preference feedback such as RLHF, I think that the results in this paper have a good scope. The paper is well-written and the contributions are clear.

缺点

Experimental evaluation can include other algorithms that are known to perform better than RUCB such as RMED (Komiyama et al., 2015) and Double Thompson Sampling (Wu and Liu, 2016).

问题

It would be interesting to see if the current approach can be extended to other link functions beyond the sigmoid such as probit.

局限性

The paper can include a section of limitations of the current work in terms of social impact.

作者回复

Thank you for the valuable review and constructive suggestions. Please find below our comments on the raised questions.

Experimental evaluation can include other algorithms that are known to perform better than RUCB such as RMED (Komiyama et al., 2015) and Double Thompson Sampling (Wu and Liu, 2016).

We appreciate the recommendation and acknowledge that RMED and Double Thompson Sampling (D-TS) perform better than some of our benchmark algorithms on the finite arm dueling bandit problem. However, our paper focuses on the continuous domain, making these algorithms less comparable for our study. We have added the references to the related works section.

In our experimental setup, we intentionally separated the confidence set estimation and action selection to evaluate these problems independently. Although many of the action selection algorithms we used are designed for the finite arm problem, they can be adapted to the continuous domain with minor modifications, as detailed in appendix D and E. In contrast, for RMED and D-TS we have not found straightforward extensions to the continuous domain. For instance, D-TS depends on counting the number of times arm ii beats arm jj, which is challenging to interpret in a continuous setting. Nonetheless, we are keen to hear if you have a suggestion on how to incorporate these algorithms into our comparisons.

It would be interesting to see if the current approach can be extended to other link functions beyond the sigmoid such as probit.

Thank you for this suggestion! Our theoretical analysis does in fact hold under a broader set of link functions s:R[0,1]s: \mathbb{R} \to [0, 1] as long as ss is,

  • monotonically increasing and symmetric around 00, i.e., s(x)=1s(x)s(x) = 1-s(-x) and s(0)=0.5s(0) = 0.5,
  • differentiable and κ=supaB1/s˙(a)\kappa = \sup_{a \leq B} 1/\dot{s}(a) exists. We have kept the problem setting for the sigmoid function, as it has a simple probabilistic interpretation (feedback yy becomes a Bernoulli random variable) and connects to the common Bradly-Terry model.

However, we have added a remark to Section 3 and pointed out this extension.

评论

Thank you for your response to my questions! I would like to keep my original score.

作者回复

We thank all reviewers and chairs for their work reviewing our paper.

We are delighted to receive high-quality constructive feedback highlighting that our contributions are “clear” and our “ideas are likely to be relevant to other learning problems with preference feedback such as RLHF”. We are certain that these inputs are going to improve our work and enhance the paper’s contribution to the community.

In the following, we address questions that appear in more than one reviews, and could be valuable in a broader context.

Connection between MaxMinLCB, MultiSBM, and POP-BO

We would like to highlight that both POP-BO[Xu et al., 2024] and MultiSBM [Ailon et al., 2014] build on the same ideas. These algorithms select one action x_t\mathbf{x}\_t in each time step by maximizing an optimistic estimate of the winning probability against the action chosen in the previous time step x_t1\mathbf{x}\_{t-1}. Then compares the actions x_t\mathbf{x}\_t and x_t1\mathbf{x}\_{t-1}.

Even though this connection is not mentioned in [Xu et al. 2024], the two algorithms are equivalent if used with the same confidence sets. Consequently, the results we present for MultiSBM in Section 6 carry over to POP-BO.

Additionally, we would like to note that the MaxMinLCB acquisition function can be viewed as a generalization of the above-mentioned strategies. One could formulate both POP-BO and MultiSBM as follows:

max_{xx_t1}LCB_t(x,x(x))\max\_\{\mathbf{x} \in \\{\mathbf{x}’\_{t-1} \\}\} LCB\_{t}(\mathbf{x}, \mathbf{x}'(\mathbf{x})) s.t.x(x)=min_{xMt}LCB_t(x,x)s.t. \mathbf{x}'(\mathbf{x}) = \min\_\{\mathbf{x} \in \mathcal{M}_t\} LCB\_{t}(\mathbf{x}, \mathbf{x}').

The main difference between this optimization problem and the MaxMinLCB defined in Eq. (7) is that domain for x_t\mathbf{x}\_t is restricted to the singleton x_t1\mathbf{x}’\_{t-1} instead of Mt\mathcal{M}_t, therefore, MaxMinLCB can be considered a more general acquisition function than the other two.

Further experiments on applicability

To further demonstrate the scalability and applicability of our approach, we included preliminary results of an ongoing experiment in the uploaded PDF [Fig A-b, Tab B - Uploaded PDF] that provide evidence for scaling our approach to complex real-world problems.

In this setting, we use the Yelp open dataset of restaurant reviews to learn the preferences of users and recommend restaurants. We subset the data for restaurants in Philadelphia, USA with at least 500 reviews and users who reviewed at least 90 restaurants. The final dataset includes 275 restaurants, 20 users, and a total of 2563 reviews.

We consider the action space of vector embeddings of the reviews for each restaurant for which we use the text-embedding-3-large OpenAI embedding model and have a dimensionality of 32. We consider each user separately and estimate their utilities for restaurants not yet reviewed using collaborative filtering.

Please note that neither of the algorithms are tuned or modified for this experiment. The intention on releasing pre-liminary result, is to demonstrate that 1) the computations easily scale, 2) kernelized approach is still applicable on a text-based domain. The regret result on the Yelp data, shall be viewed as a proof-of-concept, rather than a benchmark.

In the uploaded PDF, Figure A-b (right) shows that the results of this larger problem align with previous conclusions. MaxMinLCB remains the best-performing algorithm with MultiSBM following second. We included the cumulative regret in Table B (final row) as well.

Correction

Since the initial submission, we identified a minor error in the code for generating Table 1 in our paper. The conclusions drawn from these results remain unchanged. The correction slightly affected numerical values in the table but not qualitative relationships. We refer reviewers to Table B in the uploaded PDF for the corrected version of Table 1.

最终决定

The paper studies multi-armed bandit with preference feedback under the continuous action domain and kernelized reward function. All reviewers acknowledge the novelty and technical contributions of the paper and recommend acceptance to the paper. Congratulations to the authors, and please revise the paper to address the issues raised by the reviewers, such as enriching experimental evaluations.