/10

Poster4 位审稿人

最低2最高4标准差0.8

ICML 2025

Fusing Reward and Dueling Feedback in Stochastic Bandits

Xuchuang Wang,Qirun Zeng,Jinhang Zuo,Xutong Liu,Mohammad Hajiesmaili,John C.S. Lui,Adam Wierman

OpenReview PDF

提交: 2025-01-21更新: 2025-07-24

TL;DR

Effective algorithms fusing reward and dueling feedback in stochastic bandits.

摘要

关键词

Relative FeedbackStochastic BanditsDueling Bandits

评审与讨论

审稿意见

评分: 42025-03-04

This paper investigates the fusion of numerical and preference feedback in stochastic bandits, where both feedback types are gathered in each timestep. The authors derive a regret lower bound, demonstrating that an efficient algorithm may incur only the smaller among the reward and dueling-based regret for each arm. The authors propose two fusion algorithms: (1) a simple elimination fusion algorithm that leverages both feedback types to explore all arms and unifies collected information by sharing a common candidate arm set, and (2) a decomposition fusion algorithm that selects the more effective feedback to explore the corresponding arms and randomly assigns one feedback type for exploration and the other for exploitation in each round. The elimination fusion experiences a suboptimal multiplicative term of the number of arms in regret due to the intrinsic suboptimality of dueling elimination. In contrast, the decomposition fusion achieves regret matching the lower bound up to a constant under a common assumption. Extensive experiments validate the efficacy of the provided algorithms and theoretical results.

给作者的问题

Please see the weaknesses above.

论据与证据

The claims made in this paper are supported by clear and convincing evidence.

方法与评估标准

The proposed methods and evaluation criteria make sense.

理论论述

The theoretical results look reasonable, but I didn’t go through every proof.

实验设计与分析

The experiments look reasonable.

补充材料

I didn’t read the supplementary material.

与现有文献的关系

This paper is relevant to the literature.

遗漏的重要参考文献

None

其他优缺点

Strengths:

This paper is very well-written and the considered problem is well-motivated from real-world decision-making scenarios where numerical feedback and preference feedback are simultaneously collected, e.g., recommendation systems.
The algorithm design and theoretical results are complete and well executed. The authors design two fusion algorithms, one of which achieves an optimal regret. Rigorous regret upper and lower bounds are provided. The presented table and figure for result summary (Table 1 and Figure 1) are clear.
Experimental results are presented to validate the empirical effectiveness of the proposed algorithms and the provided theoretical findings.

Weaknesses:

(Not a weakness, just a comment) In the formulation, this paper only considers that the preference probability of choosing arm a over arm b is greater than 0.5 if arm a has a higher expected reward than arm b. Besides this ordering relation, this paper does not assume or utilize any other relation between the preference probability and expected reward.

What if the preference probability is generated according to the expected reward, like the Bradley-Terry model used in the RLHF literature? Will this additional assumption allow a better regret result? More discussion is needed here.

What is the motivation of the regret definition (the equation above Section 2.1), which is a linear combination between the regret in classic bandits and the regret in dueling bandits? It seems that this is not very natural. Isn’t it more natural to define the regret as the loss in the expected reward due to not always selecting the optimal arm, which may need an additional assumption that the preference probability is generated according to the expected reward, like the Bradley-Terry model?

其他意见或建议

Please see the weaknesses above.

作者回复

2025-04-01

W1. [...] other relation between the preference probability and expected reward [...] like the Bradley-Terry model

A1: We thank the reviewer for raising this intriguing question. A relation between the preference probability and expected reward, like the Bradley-Terry model $\nu_{1,2} = \frac{\exp(\mu_1)}{\exp(\mu_1) + \exp(\mu_2)} = \frac{\exp(\mu_1-\mu_2)}{\exp(\mu_1-\mu_2) + 1}$ or utility dueling bandits $\nu_{1,2} = \frac{1 + (\mu_1 - \mu_2)}{2}$ (Ailon et al., 2014), is indeed stronger than the one in our paper and will lead to a better regret result.

Specifically, with this parametric relation, one only needs to focus on one set of parameters: the reward means or the dueling probabilities. Let us consider the case that the reward means $(\mu_1, \mu_2, \dots, \mu_K)$ as the basic parameters to estimate, and all observations from dueling feedback can be translated into reward feedback via the parametric relation. Under this point of view, the reward feedback provides direct observations from the reward mean parameters, and the dueling (preference) feedback between arms $k$ and $\ell$ can be considered as a "parametric reward feedback" depending on the reward mean parameters. For example, with the Bradley-Terry relation, the parametric form is a logistic function, which is similar to the logistic bandits (Faury et al., 2020), while with the utility dueling relation, the parametric form is a linear function, which is similar to the linear bandits (Abbasi-Yadkori et al., 2011).

With this interpretation, the authors believe that the final regret bound would depend on reward gaps $\Delta_k^{(R)}$ and have no explicit dependence on the dueling gaps $\Delta_k^{(D)}$ (or the other way round if we consider the dueling probabilities as the basic parameters), and the regret improvement may be in the actual dependence of the reward gaps $\Delta_k^{(R)}$ (e.g., the prefactor or the exponential order of this gap would be strictly less than than the best one without the dueling feedback option). To rigorously investigate the improvement in regret bounds under these stronger assumptions (out of the scope of the current paper) is an interesting research direction. We will add this discussion in the final version of this paper to clarify the potential improvement in regret bounds under these stronger assumptions.

W2. [...] motivation of the regret definition [...] regret as the loss in the expected reward due to not always selecting the optimal arm

A2: We thank the reviewer again for raising this interesting question.

Motivation The current linear combination definition --- especially, the parameter $\alpha\in[0,1]$ ---is motivated by the cost difference between querying reward feedback and dueling feedback, as the dueling (relative) feedback is usually cost-efficient (Ouyang et al., 2022). Furthermore, this flexible definition covers many interesting scenarios, e.g., the case of $\alpha = 1$ is the regret from reward feedback only, and the case of $\alpha = 0$ is the regret from dueling feedback only, and in the case of $\alpha = \frac{1}{2}$ , the regret is the simple sum of the two types of feedback. We will add this discussion to this paper's final version to clarify the regret definition's motivation.

Other regret definitions Below, we provide three perspectives on changing the definition of regret in our paper.

First, if the reviewer means to only consider the regret from reward feedback ("define the regret as the loss in the expected reward"), and treat the dueling feedback as side free observations, then this regret reduces to the case of setting $\alpha = 1$ in our regret definition. In this scenario, our $`DecoFusion`$ algorithm achieves constant regret, as discussed in Lines 409--423 (left column).

Second, if the reviewer means also to consider the regret cost due to dueling feedback but counts this part of the regret in terms of the expected regret instead of the dueling (preference) probability in the current definition, then the new regret cost in each decision round would be the sum of the reward gaps of the pulled three arms (one for reward, a pair for dueling). For this new regret definition, our algorithm and analysis still work. The only modification is in the regret upper bound results, where one needs to change the dueling gap $\Delta_k^{(D)}$ in the nominator of Eq. (4) in Theorem 3.1 and Eq. (5) in Theorem 4.1 to the reward gap $2\Delta_k^{(R)}$ .

Third, if one further assumes the Bradley-Terry model relation between reward means and dueling probabilities upon the new regret definition, this would lead to a problem that is similar to logistic bandits (Faury et al., 2020), as discussed in the previous response, which is an interesting research direction.

Faury, L., Abeille, M., Calauz`enes, C., and Fercoq, O. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning, pp. 30523060. PMLR, 2020.

审稿意见

评分: 42025-03-08

This paper looks into the problem of stochastic bandits with fusing reward and dueling feedback where the regret is defined as linear combination of the normal regret and averaged dueling regret.

给作者的问题

In the reward explore, dueling exploit stage, is there a reason to define the dueling action as both $\hat {k}_t^{R}$
The algorithm line 6 and 8 does not do the exclusion of the reward set form dueling set as in line 264(right). Is there an intuition that this kind of non-approximate design works?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

I checked the lower bound proof but did not review the rest, as I am not familiar with the proof of the dueling bandit.
I have a question regarding the proof of the lower bound. In the construction of the original and alternative instances, the proof seems to define the reward means and dueling probabilities separately. My question is: when the reward mean is defined, aren’t the dueling probabilities automatically determined?

实验设计与分析

Yes, looks good to me.

补充材料

N/A. There is no supplementary material.

与现有文献的关系

This problem is related to general bandit problem beyond natural reward feedback and has it's application in real world systems such as recommendation system.

遗漏的重要参考文献

N/A

其他优缺点

The paper proposes the fusing problem and presents two algorithms to solve it. The algorithms are well-designed, and the achievable upper bounds match the order of the lower bounds. I did not find any significant weaknesses in this paper, as it is self-contained and inspiring.

其他意见或建议

The regret bound for No Fusion algorithm seems to be different in the Table 1 and line 224 (right)
The empirical log-likelihoods seems to be defined in algorithm 2, not algorithm 1 or 4.

作者回复

2025-04-01

Q1: The regret bound for No Fusion algorithm seems to be different in the Table 1 and line 224 (right)

A1: Indeed, they are different. The regret bound in Table 1 combines two separate (optimal) regret bounds for the reward and dueling bandits, respectively. Therefore, the deminonator is $\min\{(\Delta_k^{(R)})^2,\Delta_k^{(D)})^2 \}$ as both are in the best dependence. Nevertheless, the regret bound in line 224 is by combining two elimination algorithms for reward and dueling bandits, respectively (for comparison with our fusion elimination algorithm). As the elimination algorithm in dueling bandits is suboptimal, the denominator becomes $\min\{(\Delta_k^{(R)})^2,\Delta_k^{(D)})^2 / K \}$ .

We thank the reviewer for raising this concern. We will clarify this in the final version of this paper.

Q2: The empirical log-likelihoods seems to be defined in algorithm 2, not algorithm 1 or 4.

A2: Yes, it is the case because the empirical log-likelihoods are only used in Algorithm 2 (also updated in Algorithm 2 at Lines 29-30). Algorithms 1 and 4 do not need the empirical log-likelihoods.

Q3: In the reward explore, dueling exploit stage, is there a reason to define the dueling action as both $\hat k_t^R$

A3: Yes, as the arm $\hat k_t^{(R)}$ is the current empirical best arm, we choose the empirical best arm pair $(\hat k_t^{(R)}, \hat k_t^{(R)})$ as the dueling action to exploit the empirical good arms (for the sake of minimizing the dueling regret since we only explore via reward feedback in that case).

Q4: The algorithm line 6 and 8 does not do the exclusion of the reward set form dueling set as in line 264(right). Is there an intuition that this kind of non-approximate design works?

A4: We thank the reviewer for catching this delicate but important algorithm design technique.

In Line 264 (right), as the ground-truth decomposition is known, one should exclude the reward set from the dueling set and vice versa. However, as the actual decomposition is unknown a prior in $`DecoFusion`$ , we need to be conservative (i.e., allow both sets $\hat{\mathcal K}_t^{(D)}$ and $\hat{\mathcal K}_t^{(R)}$ to have some overlap instead of taking exclusion) to ensure enough explorations from both feedback sides for uncertain arms so to learn the ground-truth decomposition at the end. We will add this discussion in the final version of this paper to clarify the intuition behind this design.

审稿人评论

2025-04-05

Thank you for your response. I will hold my rating for now and see if the other reviewers' concerns are addressed.

审稿意见

评分: 22025-03-11

The paper investigates a K-armed bandit problem in which, in addition to the standard observations of arm rewards (modeled by a Bernoulli distribution), the learner has the option to select two additional arms at each time step. The feedback received includes comparisons between the selected arms, which are also based on a Bernoulli feedback mechanism, with preferences that align with the mean rewards. I.e., the paper "fuses" standard bandits with duelling bandits.

给作者的问题

"The estimations of reward means $\mu_k$ and dueling probabilities $\nu_{k,l}$ are orthogonal as their observations are independently sampled from distributions with different parameters. This orthogonality makes it difficult to directly combine these two types of feedback in online learning."

The statement appears to be problematic. In what sense are these estimations considered orthogonal? Does "orthogonal" here imply independence? The reward feedback can clearly be translated into relative feedback, which suggests a relationship between the two types of feedback rather than a complete orthogonality. It may be beneficial to clarify this terminology to avoid confusion.

论据与证据

Yes, the results are supported by proofs and appear to be correct to me.

方法与评估标准

The theoretical presentation is consistent with standard practices in the literature. The experiments utilize synthetic data. The paper leans more towards a theoretical approach.

理论论述

Yes, Theorem 2.3 on the lower bound, as well as Theorems 3.1 and 4.1 on the regret performance of algorithms, make sense and appear to be correct to me. I have reviewed the proofs.

实验设计与分析

Yes, the proofs appear correct. The experiments are standard and use synthetic data.

补充材料

I checked the proof of theorems.

与现有文献的关系

The paper mainly derives theoretical regret bounds on fusion of standard and duelling bandits and also presents some basic synthetic experiments.

遗漏的重要参考文献

其他优缺点

The fusion of reward-based and preference-based feedback is an intriguing problem, and I assume many practitioners employ heuristics in their algorithm deployments. This paper adopts a theoretical approach and models the problem mathematically using the multi-armed bandit (MAB) framework, which could potentially be interesting.

However, I have some reservations regarding the scope and contribution of the paper.

First, the practical value of the paper is not clear. The setting and framework are too simplistic and niche, making them inapplicable to motivating examples such as large language models (LLMs), which utilize inherently different algorithms and techniques.

Second, the theoretical results appear to be a combination of existing tools and findings, which makes them somewhat expected. While the lower bound seems complete and logical, the algorithms do not appear to align with this lower bound. The only aspect that seems specific to this paper is the method of fusing these two feedback models and eliminating arms based on the best statistics (relative feedback or reward), which remains an open question. Specifically, the results align with the lower bound only when the best relative arm for eliminating a suboptimal arm is the one with the highest reward. This limitation also constrains the theoretical contribution, which seems to be the primary focus of the paper.

其他意见或建议

伦理审查问题

作者回复

2025-04-01

Q1: the practical value of the paper is not clear. [...] simplistic and niche

A1: Practical value While the authors agree that the bandits setting (a framework in favor of theoretical analysis) of this paper is indeed simplified, we believe that the proposed algorithms and theoretical results are still valuable for practice for the following reasons:

The standard LLM training usually has two separate steps: (1) use the responses from human experts (absolute feedback) to conduct a supervised learning, called supervised fine-tuning (SFT), and (2) use the human preference (relative feedback) among multiple responses generated from the SFT model to conduct a preference directly training (either directed preference optimization (DPO) or reinforcement learning from human feedback (RLHF)).

Although not exactly the same, these two steps correspond to the reward (absolute) and dueling (relative) feedback in our bandits setting. Our algorithm suggests a training approach to conduct these two steps in parallel, potentially saving the human labeling cost and improving the training efficiency (as our regret upper bounds improve over the existing ones). Investigating this potential approach in LLM training is an interesting direction.

As our ultimate goal is to effectively combine the absolute and relative feedback (e.g., in realistic RLHF alignment for LLM), the proposed algorithms and theoretical results in the bandits framework are meaningful and concrete steps toward this goal. We thank the reviewer for raising this concern. We will discuss this point in the final version of this paper.

Q2: [...] the theoretical results appear to be a combination of existing tools and findings [...] the algorithms do not appear to align with this lower bound.

A2: Novelty First, we clarify that our paper focuses on fusing both reward and dueling feedback, each drawn from distributions from a different set of parameters, and the information on either type of feedback can not be translated into the other type. Due to this separation, it is intrinsically difficult to incorporate them to optimize performance on the same bandit task. To address the difficulty, we propose two novel algorithmic frameworks: elimination fusion and decomposition fusion, neither of which is known in prior literature. The elimination fusion is simple and more straightforward to be extended to other applications. In contrast, the decomposition fusion is optimized for the stochastic bandit scenario and achieves near-optimal regret performance.

Open Problem Secondly, we agree with the reviewer that removing the condition that "the best relative arm for eliminating a suboptimal arm is the one with the highest reward" (i.e., $\ell_k^*=1$ ) is indeed an interesting direction, as we discussed in the paper (Lines 382--384 (right column) and Lines 396--408 (left column)), However, this required condition is a reminiscence of the dueling bandits setting, and it is still a partially open problem (Komiyama et al., 2015). Once this problem is fully addressed in the basic dueling bandits, one can apply our two fusion algorithm techniques to the new dueling bandits algorithms to remove this condition in our setting.

Q3: [...] The statement appears to be problematic. Does "orthogonal" here imply independence? [...]

A3: We thank the reviewer for suggesting the improvement of this phrase. Although there is indeed an ordering relation that "the dueling probability of arm $k$ winning over arm $\ell$ is greater than $0.5$ if and only if arm $k$ has a higher expected reward than arm $\ell$ ," we clarify that the reward feedback cannot be directly translated into dueling feedback in the general case, unless some stronger relations are assumed between the reward means and dueling probabilities (e.g., the Bradley-Terry model suggested by Reviewer qwFm (hyperref)).

Rephrase In the final version of this paper, we will clarify the terminology and rewrite this sentence to avoid the confusion as follows,

"As the reward means and dueling probabilities only have a weak ordering relation---the dueling probability of arm $k$ winning over arm $\ell$ is greater than $0.5$ if and only if arm $k$ has a higher reward mean than arm $\ell$ ---and their observations are independently sampled from distributions with different parameters, the estimations of reward means $\mu_k$ and dueling probabilities $\nu_{k,\ell}$ are intrinsically separated. This separation makes it difficult to combine these two types of feedback in online learning directly."

审稿意见

评分: 32025-03-13

This manuscript considers a new bandit problem where the learner, at each step, simultaneously chooses any of the K arms and any couple of the K arms and observe an absolute reward as well as a dueling reward. The incurred regret is a convex combination (with parameter $\alpha$ ) of the absolute regret and the strong dueling regret. Assuming that the ordering of the expected rewards is compatible with the dueling expected, the authors prove an asymptotic lower bound of the cumulative regret and introduce two new algorithms, the second one asymptotically matching the lower bound. Interestingly for $\alpha$ close to zero or $\alpha$ close to one, this corresponds to bandit problem with side information and the authors, in which case their algorithm DecoFusion achieves a constant regret.

update after rebuttal

I understand the authors viewpoint that earlier papers on bandits (e.g Auer et al., 2002) only focused on purely asymptotic regimes. However, there has been a lot of literature in the last ten years (and in particular in the duelling literature) to weaken assumptions of the form $T> \exp( cK)$ ; see again the introduction of Saha and Gaillard. In any case, I admit that this is perhaps a personal bias on the objective of being tightly asymptotically optimal versus non-asymptotically optimal, but there is a huge gap here between $T> \exp(cK)$ and weaker conditions such as $T>K^2$ .

给作者的问题

Q1) In Section 2, you write "Later in this paper, we will show that a sublogarithmic o(log T ) regret is actually achievable (in fact, T -independent constant regret) in these two scenarios, revealing a unique phenomenon of DR-MAB". Up to my knowledge, this phenomenon is not unique and in fact arises for bandit problems with side information (e.g. the player plays several arms and only incurs the regret of a single arm).

Q2) Following W1, I am wondering to what extent using more recent approaches in duelling bandits allows to bypass the limitation that we need $T\geq \exp(c K^{1+\zeta})$ for optimality.

论据与证据

I feel that it is a bit overclaimed that "DecoFusion achieves the optimal regret". While the main theorem (Theorem 4.1) seems valid, the result is so asymptotic, that the optimality is not achieved as soon as the number of arms K is not constant (see W1)

方法与评估标准

Yes

理论论述

I checked the proofs of the main theorem (Theorem 4.1).

实验设计与分析

Yes, but the numerical experiments are only here for illustration.

补充材料

I mainly reviewed Appendix A (bibliography), D2 (main proof).

与现有文献的关系

This manuscript considers a new bandit problem where the learner both plays an arm and a duel of arms. As such, I am not aware of any previous on this topic. However, the authors build upon previous works on K-arms bandits (mostly Honda & Takemura (2010)) and duelling bandits (Komiyama et al. (2015)). That being said, the combination of the above approaches in DecoFusion is clearly new and original.

遗漏的重要参考文献

I am not aware of any, although the authors could further discussed the recent literature in duelling bandits (see e.g. the references in Saha and Gaillard or the survey of Bengs et al., 2018).

其他优缺点

S1) This is the first manuscript that deals with this problem, although the motivations are somewhat allusive. The authors establishes almost asymptotic matching lower and upper bounds for the problem, but see W1.

S2) The manuscript is mostly well written and the authors manage to provide the intuition behind their algorithms and the different rates.

W1) In my view, the regret bounds are of high asymptotic nature. For instance, the regret bound in Theorem 4.1 only matches the lower bound only in the regime where T\geq \exp [ c K], which is quite disappointing.

其他意见或建议

C1) $\epsilon$ is not defined the statement of Theorem 4.1

作者回复

2025-04-01

W1): In my view, the regret bounds are of high asymptotic nature. [...] where T\geq \exp [ c K], which is quite disappointing.

A1: We respectfully disagree with the reviewer's characterization of our results as "disappointing," particularly regarding the fact that our upper bound matches the lower bound asymptotically. As detailed below, achieving asymptotic optimality in this context is a well-established practice in bandit literature. We hope the reviewer will reconsider the contributions of our paper in light of these points.

Firstly, we clarify that our (gap-dependent) upper bound alone is non-asymptotic for any $T$ , and it is expected to only match the lower bound asymptotically. Because the (gap-dependent) lower bounds in bandits literature (as well as our Theorem 3.1) are usually asymptotic (holds when $T\to\infty$ ), and thus the upper bounds can only match them in an asymptotic manner. This is a common practice for optimal bandit algorithms. For example, the regret upper bounds of the known optimal stochastic bandit algorithms, e.g., KL-UCB (Cappe et al., 2013, Theorem 1) and DMED (Honda & Takemura, 2010, Theorem 4), also match the lower bound only asymptotically.

Secondly, the condition $T\ge \exp(cK)$ for matching lower bounds is common and needed in most bandits algorithms (omitted by default, so this is often unaware). For example, the regret upper bound of the seminal UCB1 algorithm (Auer et al., 2002a, Theorem 1) can be expressed as $R_T \le O\left( \sum_{k>1}\frac{\log T}{\Delta_k} + K \right).$ To match the $\Omega\left( \sum_{k>1}\frac{\log T}{\Delta_k}\right)$ lower bound, it needs $T \ge \exp (cK)$ as well. This condition is required for any bandit algorithm whose finite (gap-dependent) regret upper bound has an additive $O(K)$ term---a common upper bound term in the literature. Therefore, the required condition $T\ge \exp(cK^{1+\xi})$ in our Theorem 4.1 is very basic as $\xi$ can be as close to zero as possible.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002a.

C1) $\epsilon$ is not defined the statement of Theorem 4.1

A2: We thank the reviewer for pointing this typo out. The $\xi$ and $\epsilon$ are the same in the statement of Theorem 4.1, and we will correct this in the final version.

Q1) [...] this phenomenon is not unique and in fact arises for bandit problems with side information [...]

A3: We thank the reviewer for mentioning this reference. We will add it to the final version of this paper. We will also clarify that the phenomenon's uniqueness is in the context of the fusion of reward and dueling feedback in comparison to the bandits with either reward or dueling feedback.

Q2) Following W1, [...] what extent using more recent approaches in duelling bandits allows to bypass the limitation [...of...] $T\ge \exp\left( cK^{1+\xi} \right)$ [...]

A4: Although there is indeed some recent progress on dueling bandits, e.g., the best-of-both world algorithm by Saha & Gaillard (2022, Theorem 3) (whose stochastic regret upper bound has a suboptimal prefactor $4$ in the dominating $\log T$ term), to the knowledge of the authors, the best-known regret bounds for stochastic dueling bandits are still the RMED algorithm by Komiyama et al. (2015, Theorem 3), which also relies on the $T\ge \exp\left( cK^{1+\xi} \right)$ condition. We will add this discussion to the final version of this paper and raise it as an open question for future research on dueling bandits.

最终决定Accept (poster)

2025-05-01

The paper presents a setting of stochastic multi-armed bandits where the feedback contains both absolute and relative values. The authors provide an algorithm to solve this problem, along with a rigorous analysis of its performance.

According to the reviews the paper is well written and easy to follow. The proofs are sound, and the results of the algorithm match the lower bound, up to constants (for constant number of arms). The two major issues raised were about (1) the guarantees of the algorithm, (2) the motivation.

For (1), 6nVK mentions that for non-constant K (number of arms) the regret bound is no longer tight. YVjo mentions that the assumption that the best arm for elimination w.r.t relative feedback has the best absolute reward is too strict. For (2), the problem presented is new. The authors did not provide a concrete application using it as-is, but reviews mentioned motivation for it from recommendation systems.

In both cases, the flaw mentioned is subjective. For (1) the solution for the limitations does not seem to be trivial, so this is not a clear incompleteness of the paper, rather an open problem. For (2), the question of whether a new problem is or isn’t interesting is clearly subjective. Since 2 out of the 4 reviewers did find the problem interesting (e.g. qwFm: "the considered problem is well-motivated from real-world decision-making scenarios .."), it is likely that more people in ICML will find it interesting as well.

To conclude, in my opinion the weaknesses pointed out are mild and the paper provides a well written and sufficiently thorough analysis of a new problem.