PaperHub
5.5
/10
Poster4 位审稿人
最低4最高6标准差0.9
6
6
6
4
3.5
置信度
正确性3.0
贡献度2.5
表达3.3
NeurIPS 2024

An Adaptive Approach for Infinitely Many-armed Bandits under Generalized Rotting Constraints

OpenReviewPDF
提交: 2024-05-07更新: 2025-01-16
TL;DR

Infinitely many-armed bandits with rotting rewards under generalized rotting constraints.

摘要

In this study, we consider the infinitely many-armed bandit problems in a rested rotting setting, where the mean reward of an arm may decrease with each pull, while otherwise, it remains unchanged. We explore two scenarios regarding the rotting of rewards: one in which the cumulative amount of rotting is bounded by $V_T$, referred to as the slow-rotting case, and the other in which the cumulative number of rotting instances is bounded by $S_T$, referred to as the abrupt-rotting case. To address the challenge posed by rotting rewards, we introduce an algorithm that utilizes UCB with an adaptive sliding window, designed to manage the bias and variance trade-off arising due to rotting rewards. Our proposed algorithm achieves tight regret bounds for both slow and abrupt rotting scenarios. Lastly, we demonstrate the performance of our algorithm using numerical experiments.
关键词
banditsrotting rewardsinfinitely many arms

评审与讨论

审稿意见
6

This paper studied an extension of multi-armed bandit problem by introducing infinitely many arms and generalized rotting constraints. It provides explicit regret lower bounds and proposes an algorithm with regret upper bound matching the lower bound when β1\beta \geq 1. It has been claimed that closing the gap between the lower and upper bounds when β(0,1)\beta \in (0,1) remains an open problem.

优点

  • Extending the multi-armed bandit problem to infinitely many arms with rotting mean rewards is feasible in many real-world applications. Solid theoretical analyses and empirical results are presented to justify the proposed solution.

  • The paper in general is well-written and easy to follow. Since I have not checked the supplementary material step by step, I can not guarantee the correctness of the proofs, but the theoretical results make sense to me.

缺点

  • The paper could be viewed as an extension of [1], and the impact of the paper might not be extremely significant. As mentioned in the paper, when β(0,1)\beta \in (0,1), the proposed algorithm can not be proved near optimal at the current stage. The theoretical result could be more impactful if this issue can be solved.

[1] Jung-hun Kim, Milan Vojnovic, and Se-Young Yun. Rotting infinitely many-armed bandits. ICML, pages 11229–11254. PMLR, 2022.

  • When the environment parameters β,VT,\beta, V_T, and STS_T are unknown, a significant amount of additional regret is generated with proposed Algorithm 2.

  • The paper could benefit from a more extensive experiment study. For example, the algorithm performances can be compared under different rotting processes (perhaps by varying the rotting rate and adding randomness).

问题

The following questions are raised simply for discussion. There is no need to address them in the paper.

  • Why the rotting budget VTV_T and STS_T are defined for all selected arms? I would imagine it can fit better with real-world problems to treat the rotting process for each arm independently. Saying if an arm is selected NN times, the rotting budget is f(N)f(N) where ff is a nondecreasing function.

  • Corresponding to my second point in Weakness, is it possible to address the unknown parameters without incurring large additional regret? Can the methodology in the following paper [2] be applied in this paper to address the nonstationary rotting rewards? Can β\beta be estimated by selecting multiple new arms?

[2] Chen, Yifang, et al. "A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free." Conference on Learning Theory. PMLR, 2019.

局限性

The authors adequately addressed the limitations.

作者回复

We appreciate your feedback and positive evaluation. Below, we address each comment.

The paper is an extension of [1]. When β(0,1)\beta\in(0,1), the proposed algorithm can not be proved optimal:

First of all, we highlight that we consider rotting constraints with VTV_T or STS_T and initial mean rewards with β>0\beta>0, both of which are not considered in [1]. As mentioned in Remark 2.2, the rotting constraint with VTV_T (tρtVT\sum_t \rho_t \le V_T) is more general than the maximum rotting constraint in [1]. Furthermore, the constraint with STS_T (1+t1(ρt0)ST1+\sum_t 1(\rho_t \neq 0) \le S_T) is fundamentally different from that in [1] because STS_T considers the total number of rotting instances rather than the magnitude of the rotting rate.

While, in our work, we have demonstrated optimality only for β1\beta \ge 1, we believe that the suboptimality for β(0,1)\beta \in (0,1) arises from the looseness of our lower bounds rather than the regret upper bounds. The gap between lower and upper bounds comes from that when β(<1)\beta(<1) decreases, the regret lower bounds (in Theorems 4.1, 4.2) also decrease while the upper bounds (in Theorems 3.1, 3.3) remain the same. As discussed in Appendix A.1, the phenomenon where the regret upper bound remains the same as β\beta decreases has also been observed in many infinitely many armed bandits [5, 24, 10]. This is because, as mentioned in [10], although there are likely many good arms when β\beta is small, it is not possible to avoid a certain amount of regret from estimating mean rewards. Therefore, we believe that our regret upper bounds are near-optimal across the entire range of β\beta, and achieving tighter lower bounds for β<1\beta < 1, considering such unavoidable regret, is left for future research. Notably, the optimality proven only for β1\beta \ge 1 has also been observed in stationary infinitely many armed bandits [5, 24].

When β,VT,ST\beta,V_T,S_T are unknown, additional regret is generated with Algorithm 2:

While our algorithm (Algorithm 2) incurs an additional regret to estimate the threshold parameter when parameters are unknown, if VTV_T, STS_T are large enough (e.g. VTT1/4V_T\ge T^{1/4}, STTS_T\ge \sqrt{T} for β=1\beta=1), then the regret bound of Algorithm 2 matches that of Algorithm 1 in Corollary 3.5 (for known parameters). This is because the additional term becomes negligible compared to the main term involving VTV_T and STS_T. We leave it as future work to obtain a tighter regret bound for the entire range of VTV_T and STS_T when the parameters are unknown.

Extensive experiment study by varying the rotting rate with randomness:

The main purpose of our experiments is to validate some claims of our theoretical analysis such as Remark 3.2. Based on your comments, we have conducted an additional experiment for random rotting rates in Figure 3 in the attached pdf. Following the rotting rates used in the experiment section in our paper, which rotting rates are convenient to compare the theoretical results of algorithms, we introduced random rotting rate ρt\rho_t uniformly sampled from [0,3/(tlog(T))][0,3/(t\log(T))] for each time tt. In Figure 3, our algorithms outperform other benchmarks. We will include further experiments regarding rotting rates in our final version.

The rotting budget VTV_T and STS_T are defined for all selected arms. How about treating the rotting process for each arm independently with the budget of each arm, f(N)f(N), for selected NN times:

In our setting, the constraints on VTV_T and STS_T quantify the rotting rate of all selected arms for all tt, as you mentioned. We note that similar quantities related to overall nonstationarity across all arms have also been considered in standard non-stationary bandits [7, 19, 4]. These quantities help quantify the overall difficulty of the problem in our regret analysis. As you suggested, addressing individual budgets for each arm would be an interesting research question.

We believe that our algorithm and analysis provide helpful insights for solving such a problem. This is because the core of our algorithm involves determining whether a selected arm is good or bad based on a threshold parameter tuned by the value of the constraint budget. If we know f(N)f(N) for each arm, we can use our method to determine the status of the arm using a threshold adjusted by the budget information.

Addressing the unknown parameters without incurring large additional regret. Can the methodology in [2] be applied in this paper? Can β\beta be estimated?

Here, we provide our thoughts regarding the applicability of [2]. In our problem, it is required to determine whether a selected arm is bad or not. In our algorithm, using a threshold, we can determine this with some confidence. Although we assume that nonstationarity can be detected using techniques from [2], it may not be clear how to determine the status of an arm whose initial mean reward is bad (small). In our setting, which includes infinitely many arms and many near-optimal ones, an absolute threshold (rather than a relative value between arms) may be required, as used in our algorithm to determine the status of arms. Without knowledge of parameters of VTV_T, STS_T, and β\beta, how to determine such a threshold combined with the detection method is an unresolved problem.

Another issue is that our setting allows for potentially large negative rewards from rotting with sub-Gaussian noise, whereas [2] considers rewards within [0,1][0,1]. This difference makes it challenging to apply the concentration inequality of Lemma 12 in [2] to our problem. As you suggested, applying the detection-based algorithm [2] to our setting while addressing those issues would be an interesting direction for future work.

Regarding the unknown β\beta, estimating β\beta directly was considered in Algorithm 3 in [10]. However, it requires additional information on a (non-zero) lower bound of β\beta, and the performance depends on the tightness of this lower bound.

评论

Thank you to the authors for their reply. I am writing to confirm that I've read the author's rebuttal and my scores remain unchanged.

评论

Thank you for maintaining the positive rating of our paper. We will incorporate the discussion, including additional experimental results, into our final version. If there are any other questions, please let us know.

审稿意见
6

This paper considers the extension of the classic stochastic multi-armed bandit problem to the case where there is a) an infinite set of arms and b) rotting of the arm means. Specifically the rotting behaviour is of the 'rested bandit' variety, where the mean reward of an arm may fall as an immediate consequence of playing it, but the mean of an unplayed arm will not change, in contrast to the restless bandit. The infinite set of arms have i.i.d. mean rewards (no structural assumptions on the set of arms or existence of a reward function over them etc.) and reward observations are sub-Gaussian.

This paper considers a more general rotting behaviour than Kim et al. (2022, ICML). In Kim et al. (2022) the magnitude of rotting in each round was bounded, in the present paper the more general setting where the total amount of rotting in T rounds is bounded is considered, and the somewhat different setting where the number of rounds in which rotting occurs is considered. These two settings are referred to as the 'slow' and 'abrupt' rotting cases respectively.

An important parameter in the infinitely-many-armed bandit problem is β>0\beta>0 which controls the probability of an arm being δ\delta-near optimal. All regret bounds in the paper for mean rewards in [0,1] and consider dependence on the number of rounds TT and parameter β\beta. β=1\beta=1 corresponds to a uniform distribution on arm means.

The paper derives a sliding-window-UCB-based approach, which can be tuned to both the slow and abrupt rotting cases when the bounds on rotting behaviour are known, and an adaptive version which aims to learn the rotting behaviour when these parameters are not known. The paper shows tight (in terms of order) regret guarantees on these algorithms, and improved empirical performance over sensible competitors.

优点

The paper studies an interesting extension of prior work to allow for more general rotting constraints. This is likely to be of interest to the multi-armed bandit community and both the methodological and theoretical work present some novelty and careful algorithmic design that merit publication.

I was pleased to see a treatment of the case where key problem parameters were not known, and that largely comparable bounds were achievable in this setting.

Generally, the paper is written well and concisely, with useful clarifications made around the most important details and a clear relationship to the most closely related prior work.

缺点

There are three main weaknesses which I would like to see commented on the rebuttal phase. I think the first is difficult to address fully in a rebuttal phase, but I would like to be reassured that potential issues do not limit the scope of the contribution.

  1. All of the theoretical results are ultimately order results only, without identification of the constants in the appendices, and some being somewhat vaguely described as needing to be 'large enough'. The paper would be stronger if these constants (or upper bounds on them) could be identified, to guarantee for which values of TT the bounds are meaningful. It would be helpful if the bounds could thus be realised for the cases considered in the experiments and compared to the actual regret of Algorithms 1 and 2.

  2. If I am not mistaken the experiments only consider β=1\beta=1? It would be interesting to see if the convergence of the algorithm is replicated for various β1\beta \neq 1.

  3. The motivation in terms of real applications is not especially strong. It is difficult to imagine a setting where the precise assumptions of the paper are realistic. In a recommendation setting, is it likely that there will be sufficient information to be confident that the click through rate of an item is always non-increasing, yet there is no contextual information available to supplement the model or exploit similar click through rates across similar items? In the clinical setting, is it likely that the rested assumption is realistic (i.e. any loss in efficacy is only determined by the actions of a single decision maker) alongside the immediacy of feedback and time horizons being of a scale such that the algorithms actually exhibit non-linear regret? If it is the case that the benefit of the paper is more fundamental - it presents an understanding of a simpler problem and its insights serve as a foundation to tackling these more complex real-world challenges, then perhaps it would be more helpful to make that clear?

问题

  1. Can you provide clarity on whether your constants in the regret analysis are identifiable, and suggest the values of these for some simple instance of the problem? How would the resultant bounds compare to the empirical performance of Algorithms 1 and 2?

  2. What would the experimental results look like for β1\beta \neq 1?

  3. Is Prop 2.3 missing some kind of additional condition, or clarity over what is meant by worst-case in line 453? It would seem to be that you could construct an example where ρ1=T+1\rho_1=T+1 and all other ρ=0\rho=0 and one could achieve T\sqrt{T} regret while satisfying t=1T1ρt>T\sum{t=1}^{T-1}\rho_t >T?

  4. Can you put the WUCB definition in an equation display and reference its equation number in the statement of Algorithm 1?

局限性

Yes

作者回复

We appreciate your feedback and positive evaluation of our work. Below, we address each comment.

All of the theoretical results are ultimately order results only, without identification of the constants in the appendices. It would be helpful if the bounds could thus be realised for the cases considered in the experiments and compared to the actual regret of Algorithms 1 and 2:

In our work, we provide regret bounds in terms of the horizon time TT and the rotting constraint parameters VTV_T and STS_T (to the power of β\beta terms), which hold up to constant factors whose values are not asserted in the statements of our theorems.

Based on your feedback, we will include more details regarding the values of these constant factors in the final version of our paper. For instance, in Lemma A.5, instead of using mG=C3m^{\mathcal{G}} = C_3 for some sufficiently large constant C3>0C_3 > 0, we can specify mG=3m^\mathcal{G}=3 with VTmax1/T1/(β+1),1/TV_T\le \max\\{1/T^{1/(\beta+1)},1/\sqrt{T}\\} and δ=max1/T1/(β+1),1/T\delta=\max\\{1/T^{1/(\beta+1)},1/\sqrt{T}\\}. Then, we have mGminδ/2,VT=3minδ/2,VT>VTm^{\mathcal{G}}\min\\{\delta/2,V_T\\}=3\min\\{\delta/2,V_T\\}>V_T, which can conclude the lemma.

We will also provide additional experimental results related to this. For now, we present an experiment in Figure 1 in the attached pdf that compares the performance of our algorithms with theoretical regret upper bounds with a constant of 1. We observe that the performance of our algorithms (blue and green solid lines) is better than the theoretical regret upper bounds (light blue and light green dashed lines), respectively, because the theoretical bounds represent worst-case regret regarding rotting rates, while the experiment is conducted under a specific instance of rotting rates.

It is noteworthy that providing regret bounds that hold up to constant factors is an important first step to establishing fundamental bounds for the underlying learning problem. Note that previous work on stationary infinitely-armed bandits (e.g. [24, 10]) also focused on providing regret bounds up to constant factors, whose values are not identified explicitly.

What would the experimental results look like for β1\beta\neq 1:

We appreciate your suggestion. We have included additional experiments in which we varied the value of β\beta in Figure 2 of the attached pdf. In Figure 2, our algorithms outperform other benchmarks for various β\beta. We will incorporate this into our final version.

The motivation in terms of real applications is not especially strong:

We appreciate your comments and suggestions. We highlight that (rested) rotting rewards in bandits have been studied [16, 20, 21, 13], motivated by real-world applications such as recommendations and clinical trials. For instance, in recommender systems, the click rate for each item may diminish due to user boredom with repeated exposure to the same content. Similarly, in clinical trials, the efficacy of a medication can decline due to drug tolerance induced by repeated administration. However, the previous work regarding rotting bandits, except for [13], focused on MAB with finite arms, which have limitations when the number of items is large, as in recommender systems. In contrast, we consider the case with infinitely many arms without contextual information.

We believe our algorithm, as you mentioned, provides insights into handling rotting cases in real-world scenarios. Specifically, our adaptive SW-UCB and threshold approach offers valuable insights for designing bandit algorithms regarding when to explore new arms and how to estimate rotting mean rewards and determine the status (good or bad) of each arm when rotting occurs. Furthermore, our study on optimizing the threshold parameter provides practical guidance for setting threshold values in real-world applications.

Is Prop 2.3 missing some kind of additional condition, or clarity over what is meant by worst-case in line 453?:

Our regret analysis focuses on the worst-case regret concerning rotting rates. As outlined in Assumption 2.1, we consider an adaptive adversary that determines the rotting rate ρt\rho_t arbitrarily immediately after the agent pulls an action ata_t at each time step tt. In Proposition 2.3, we show that there always exists a rotting adversary (or instances of rotting rates over TT) with tρt>T\sum_t\rho_t>T such that it incurs at least TT regret for any algorithm, implying an Ω(T)\Omega(T) regret lower bound. This does not imply that any rotting adversary (or any rotting instances) occur TT regret. This worst-case lower bound can be demonstrated by providing a rotting adversary (or rotting rate instances) where any algorithm will incur at least TT regret. Additionally, we show that this lower bound can be achieved by a simple algorithm that pulls a new arm every round. Therefore, Proposition 2.3, which is regarding the worst case, does not contradict the example you mentioned. We appreciate your helpful comments to improve clarification. We will include the explanation to clarify Proposition 2.3 in our final version.

Can you put the WUCB definition in an equation display and reference its equation number in the statement of Algorithm 1?:

We appreciate your helpful comments for improving the clarity of our paper. We will display the WUCB definition in an equation and reference its number in Algorithm 1.

评论

Thank you for your detailed reply to my comments. I appreciate the commitment to make modifications and am in agreement with your response. I have also read the other reviews and your responses and consider these to be suitably well addressed. As such I will retain my score and increase my confidence. (Justification: I agree with the remarks that finding results without constants and algorithms that can lead to further application-specific work are important fundamentals, but I feel for a score of 7+ the paper would have made more progress in one of these areas.)

评论

Thank you for maintaining a positive rating on our paper and increasing your confidence score. We appreciate your detailed feedback. We will incorporate the discussion into our final version, including more details regarding constant factors, a detailed explanation of Prop 2.3, and additional experimental results.

审稿意见
6

Authors investigated an adaptive approach for the rotting bandits problem under the infinitely arms assumption.

优点

Authors introduced a new ucb like policy for the mentioned problem, additionally a lower bound analysis has been carried on.

缺点

No clear weaknesses

问题

I wonder how the regret bound behaves with respect to the effective rotting instead of the V_T or S_T. Assuming their values to be large and far from the real rotting, would it be possible to extend this analysis and solution to propose an adaptive result? Additionally, I wonder if also the regret lower bound would be tight in that case.

Similarly, I was wondering how good these results would be for the infinitely many afrmed bandit with no rotting.

局限性

No clear limitations have been found.

作者回复

We appreciate your feedback and positive evaluation of our work. Below, we address each comment.

I wonder how the regret bound behaves with respect to the effective rotting instead of the VTV_T or STS_T:

In this problem, as mentioned in Assumption 2.1, we consider an adaptive adversary who determines ρt\rho_t arbitrarily, subject to the constraint tρtVT\sum_t \rho_t \le V_T or 1+t1(ρt0)ST1 + \sum_t 1(\rho_t \neq 0) \le S_T for given VTV_T or STS_T. As a side remark, these constraints are more general than the stricter conditions tρt=VT\sum_t \rho_t = V_T or 1+t1(ρt0)=ST1 + \sum_t 1(\rho_t \neq 0) = S_T. Our regret analysis, for both upper and lower bounds, addresses the worst-case scenario concerning the rotting rates under the general constraint of VTV_T or STS_T. Hence, our regret bounds are expressed in terms of VTV_T or STS_T. We also note that standard nonstationary bandits have been studied under a similar concept of a nonstationary budget upper bounded by VTV_T [7,19].

We can consider a scenario where an adversary determines ρt\rho_t under the stricter constraints of tρt=VT\sum_t \rho_t = V_T or 1+t1(ρt0)=ST1 + \sum_t 1(\rho_t \neq 0) = S_T. Our (upper and lower) regret bounds apply with these values of VTV_T and STS_T, respectively corresponding to the effective rotting amounts of tρt\sum_t\rho_t and 1+t1(ρt0)1 + \sum_t 1(\rho_t \neq 0). When there is no rotting such that VT=0V_T=0 and ST=1S_T=1, then our bounds are the same as the bounds for the stationary case (e.g. T\sqrt{T} for β=1\beta=1 as in [24, 5]). Furthermore, if VTV_T and STS_T are unknown, Algorithm 2 (in Appendix A.6) can achieve regret bounds that depend on the values of VTV_T and STS_T (Theorem A.15).

评论

Thank you to the authors for their reply. I am writing to confirm that I've read the author's rebuttal and my scores remain unchanged.

评论

Thank you for maintaining a positive rating on our paper. We will incorporate the discussion into our final version.

审稿意见
4

The paper studies infinite-armed bandits with rotting rewards. They show a lower bound on regret in terms of the total-variation and number of abrupt changepoints in the change in rewards. They also provide regret upper bounds: (1) a UCB-like algorithm tuned with knowledge of problem parameters gets optimal regret in some regimes and (2) a parameter-free bandit-over-bandit algorithm can attain optimal regret in some regimes when the level of non-stationarity is large.

优点

  • The infinite-armed bandit problem is well motivated and the problem and results are clearly presented.
  • There are matching upper and lower bounds on regret, at least for the better-understood β1\beta \geq 1 setting.

缺点

  • The definition of VT,STV_T,S_T are a bit unclear and may not be fully rigorous. They are bounds on the total amount and count of rotting, but the rotting ρt\rho_t depends on the chosen arm ρt\rho_t and is thus random (in fact, determined by an adaptive adversary according to Assumption 2.1). But, meanwhile, VT,STV_T,S_T seem to be treated as deterministic constants in the whole paper given that they are, for example, used to tune Algorithm 1 to get the best regret bounds. In fact, it would not even make sense for VT,STV_T,S_T to appear in the expected regret bounds if they were random. So, the only way the results of this work can rigorously hold is if VT,STV_T,S_T bound the respective random quantities t=1T1ρt\sum_{t=1}^{T_1} \rho_t and 1+t=1T11{ρt0}1 + \sum_{t=1}^{T-1} 1\{\rho_t\neq 0\} for all realizations of the randomness, which then means there are some missing assumptions for the results of this paper. For instance, even for a "nice" algorithm, t=1T1ρt\sum_{t=1}^{T_1} \rho_t could be very large with some probability in which case VTV_T is forced to be large as well because of a "bad realization". The only other way around this is to make further assumptions about the adversary/design of non-stationarity. In either case, there should be more discussion as the rigor of results presented versus the given assumptions seems unclear to me.
  • There are no optimal regret upper bounds without parameter knowledge. Also, the bandit-over-bandit result for parameter-free result requires further assumptions on the non-stationarity and so does not apply as generally as the other results of this paper.

问题

Please refer to strengths and weaknesses above.

局限性

No broader impact concerns.

作者回复

We appreciate your time to review our paper and comments. Below, we address each comment.

The definition of VTV_T, STS_T are a bit unclear and may not be fully rigorous. They are bounds on the total amount and count of rotting, but the rotting depends on the chosen arm and is thus random:

We appreciate your detailed comments. However, we argue that the definitions of VTV_T and STS_T are rigorous in our context, as they serve as constraints on the rotting adversary. These quantities are not random outcomes based on the adversary's behavior but rather imposed conditions. To clarify this, we restate our Assumption 2.1 concerning the adaptive adversary as it is.


Assumption 2.1. At each time t[T]t\in[T], the value of rotting rate ρt>0\rho_t>0 is arbitrarily determined immediately after the agent pulls ata_t, subject to the constraint of either slow rotting for a given VTV_T or abrupt rotting for a given STS_T.


According to this assumption, we consider VTV_T and STS_T to be given a priori to the adversary, and the adversary determines ρt\rho_t subject to these constraints. VTV_T and STS_T are deterministic quantities that act as constraints, which are determined before the game begins. This is why we refer to the two scenarios as the slow rotting constraint (VTV_T) and the abrupt rotting constraint (STS_T). In our setting, if VTV_T or STS_T is large, then, as we have shown in our regret lower bounds, any algorithm naturally cannot avoid a large regret bound in the worst case.

There are no optimal regret upper bounds without parameter knowledge. Also, the bandit-over-bandit result for parameter-free result requires further assumptions on the non-stationarity and so does not apply as generally as the other results of this paper:

In the case when the parameters are unknown, there is an additional regret for Algorithm 2, which stems from learning the threshold parameter. However, if VTV_T and STS_T are large enough (e.g. VTT1/4V_T\ge T^{1/4}, STTS_T\ge \sqrt{T} for β=1\beta=1), then the regret bound in Theorem A.15 for Algorithm 2 matches that in Corollary 3.5 for Algorithm 1 (for known parameters). It is an open problem to achieve tight regret bounds for entire ranges of VTV_T and STS_T.

Now we discuss the further assumptions. For the unknown parameter case, we consider Assumptions A.10 and A.12 rather than Assumptions 2.1 and 2.4 (for the known parameter case). In Assumption A.10, we still consider an adaptive adversary for the rotting rates, such that 0ρtϱt0 \le \rho_t \le \varrho_t for given ϱt\varrho_t to the adversary, where tϱtVT\sum_t \varrho_t \le V_T and 1+t1(ϱt0)ST1 + \sum_t 1(\varrho_t \neq 0) \le S_T. Here ϱt\varrho_t's are assumed to be determined before the algorithm is run. As we describe in Remark A.11, this assumption is more general than that in [13], where ϱt=ρ\varrho_t = \rho with a maximum rotting rate ρ\rho for all tt. In the special case where ρt=ϱt\rho_t = \varrho_t, the assumption represents an oblivious adversary.

In Assumption A.12, we consider _tT_iρ_tH\sum\_{t \in \mathcal{T}\_i} \rho\_t \le H for all i[T/H]i \in [ \lceil T/H \rceil], where T_i=[(i1)H+1,iH]\mathcal{T}\_i=[(i-1)H+1,iH] representing the ii-th block of times of length HH within horizon time TT. As we mention in Remark A.13, this assumption is satisfied when mean rewards are constrained by 0μt(at)10 \le \mu_t(a_t) \le 1 for all tt because 0ρt10\le\rho_t\le1. The positive mean reward constraint is frequently encountered in real-world applications such as click rates in recommendation systems. We also note that, as mentioned in Remark A.14, the assumption is still more general than the one with a maximum rotting rate constraint of ρtρ=o(1)\rho_t \le \rho = o(1) in [13]. This is because, in our setting, each ρt\rho_t is not necessarily bounded by o(1)o(1) but tTiρtH\sum_{t \in \mathcal{T}_i} \rho_t \le H.

Lastly, we emphasize that our study is the first work to consider VTV_T, STS_T, and β\beta in the context of rotting bandits with infinite arms, which is fundamentally different from finite-armed bandits due to the necessity of exploring new arms (details are described in lines 58–70). Notably, when the parameters are known, we achieve tight results using a novel approach and provide regret lower bounds.

评论

Thank you for your comment. We would like to address a few points regarding your comments:

On VTV_T, STS_T: Based on your comment, we believe your initial concern regarding VTV_T and STS_T has been resolved. However, we would like to provide additional explanation for clarity. We strongly believe that the adaptive adversary under the slow (VTV_T) or abrupt rotting constraint (STS_T) is a natural and general assumption in our adversary rotting scenario. The adaptive adversary determines an arbitrary rotting rate at each time, immediately after the agent's action is determined, under either VTV_T or STS_T. From the values of STS_T and VTV_T, we can appropriately quantify the difficulty of our problems, as demonstrated by our regret lower bounds.

Without such constraints (i.e., fully adaptive according to your comment), the adaptive adversary could easily make the problem trivial in the 'worst-case' scenario of our setting because whenever an algorithm finds a good arm, the adversary could cause that arm to rot and become a bad arm or even have a negative mean reward adaptively, without the constraints. We also note that similar quantities of VTV_T and STS_T have also been considered in standard nonstationary bandit literature [7,19,4], where they are treated as determined quantities, not random variables, as in our setting. If necessary, we are more than happy to clarify this further in our final version to prevent any potential confusion.

Assumption A.10 for parameter-free: The additional constraint in Assumption A.10 is required due to our adaptive adversary, which selects the rotting rate arbitrarily and adaptively in response to the selected action at each time, within the bandit-over-bandit framework. If we consider an oblivious adversary instead of an adaptive one, where the values of rotting rates ρt\rho_t are predetermined such that tρtVT\sum_t \rho_t\le V_T and 1+t1(ρt0)ST1+\sum_t 1(\rho_t\neq 0)\le S_T before the game begins, then this satisfies Assumption A.10 with ρt=ϱt\rho_t=\varrho_t. In other words, as we mentioned in Remark A.11, Assumption A.10 is a more general assumption than this oblivious adversary and that in [13].

As mentioned in lines 831~834, the well-known black-box framework proposed for addressing nonstationarity [25] is not applicable to this problem, and attaining the optimal regret bound under a parameter-free algorithm for all ranges of VTV_T and STS_T remains an open problem. However, we respectfully disagree with your comment that the results remain limited. We again highlight that our study is the first work to examine VTV_T, STS_T, and β\beta in the context of rotting bandits with infinite arms, which is fundamentally different from finite-armed bandits (details are described in lines 58–70). Notably, we achieve tight results through a novel approach when the parameters are known, and we establish regret lower bounds. We also examine the case of unknown parameters. We believe our work is crucial for the community.

评论

Dear Reviewer g8GP,

Thank you again for taking the time to review our paper.

We sincerely hope our responses have adequately addressed your questions and comments. If they have, we appreciate it if you could reconsider your evaluation. If you have any last-minute questions, please let us know.

Sincerely,

Authors

评论
  • On VT/STV_T/S_T: thank you for the clarification. It seems one assumes the adversary is not fully adaptive then and there are limitations on how much rotting it can force on the rewards. I would say wording this as a "adaptive adversary" can be a bit misleading then as it is really a constrained adversary.

  • optimal regret upper bounds without parameter knowledge: I would carefully explain in the presentation why the further Assumptions A.10 is needed. It seems like it is an artifact of the bandit-over-bandit approach as you cannot have the adversary respond in an overly strong way to the master's choice of base over HH rounds.

Even under the broader above mentioned constrained adversary there seem to be no parameter-free results without this A.10. As such, I feel the scope of result remains limited in this work even while being the first to study this particular setting/parametrization.

作者回复

We appreciate you taking the time to review our paper. We are encouraged by the feedback indicating that our problem is well motivated by many real-world applications, solid theoretical analyses and empirical results are presented, and the paper is written well and concisely. We have attached a pdf with additional experimental results to address the reviewers' comments. In the following, we provide detailed responses to each comment from the reviewers.

最终决定

This paper considers the many-armed bandit problem where the arms rewards may drop as one draws them. The paper considers two regimes (slow and abrupt rottings). The problem setting of this paper generalizes Kim et al. [13] for general beta (top-arm ratio) and variation budget. They propose sliding-UCB-based algorithms and analyze them for the two regimes. They also provided a lower bound that is tight for β>1\beta > 1. The main strength of this paper is its adaptivity against a general setting with a single algorithm. Regarding the motivation based on real-world, sentiment is mixed (generally positive). The weaknesses of this paper, suggested by the reviewers, are mainly due to the requirement for the variation size (STS_T or VTV_T), which is partially addressed in Algorithm 2 with additional assumptions). The score of this paper is on the margin. The overall sentiment is positive about the paper. In the discussion phase, some of the reviewers suggest consideration of existing work (such as the replay phase method by Chen et al. [2019] or Auer et al. [2018]). Even though the complexity of the method on Chen et al. is relatively high.