PaperHub
7.8
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
5
3
4
ICML 2025

Geometric Resampling in Nearly Linear Time for Follow-the-Perturbed-Leader with Best-of-Both-Worlds Guarantee in Bandit Problems

OpenReviewPDF
提交: 2025-01-21更新: 2025-07-24
TL;DR

The authors propose a novel technique, with which Follow-the-Perturbed-Leader policy achieves nearly linear time complexity without sacrificing the Best-of-Both-Worlds regret guarantee for bandit problems.

摘要

This paper studies the complexity and optimality of Follow-the-Perturbed-Leader (FTPL) policy in the $K$-armed bandit problems. FTPL is a promising policy that achieves the Best-of-Both-Worlds (BOBW) guarantee without solving an optimization problem unlike Follow-the-Regularized-Leader (FTRL). However, FTPL needs a procedure called geometric resampling to estimate the loss, which needs $O(K^2)$ per-round average complexity, usually worse than that of FTRL. To address this issue, we propose a novel technique, which we call Conditional Geometric Resampling (CGR), for unbiased loss estimation applicable to general perturbation distributions. CGR reduces the average complexity to $O(K\log K)$ without sacrificing the regret bounds. We also propose a biased version of CGR that can control the worst-case complexity while keeping the BOBW guarantee for a certain perturbation distribution. We confirm through experiments that CGR does not only significantly improve the average and worst-case runtime but also achieve better regret thanks to the stable loss estimation.
关键词
multi-armed banditfollow-the-perturbed-leaderbest-of-both-worldsgeometric resampling

评审与讨论

审稿意见
4

The paper proposed some useful variants of the Geometric Resampling (GR) algorithm called Conditional Geometric Resampling (CGR) and those algorithms improve the sample complexity from O(K^2) to O(Klog(K)) in each round while keeping the BOBW guarantee for a certain perturbation distribution. The experiment result shows the proposed algorithms do perform well in the sample complexity and runtime while keeping a low regret while keeping the BOBW guarantee for a certain perturbation distribution.

update after rebuttal

I thank the authors for the response. Nothing major has changed, and I keep my score.

给作者的问题

Are there any disadvantages of CGR to the other algorithms (such as Tsallis-INF)?

论据与证据

Yes, they claim the CGR algorithm has a great improvement compared to the GR algorithm and the experiments do show the improvement for runtime and sampling time.

方法与评估标准

Yes, the paper says the reason that CGR achieves success is that CGR resamples the perturbation only from those satisfying a necessary condition for termination, and that makes sense.

理论论述

Yes, such as Theorem 6.

实验设计与分析

Yes, I checked Section 5.

补充材料

Yes, I checked the proof of the theorem stated in the main paper.

与现有文献的关系

The paper proposed an experimental beneficial algorithm CGR and its method might have chances to be applied to other time-cost algorithms.

遗漏的重要参考文献

Not to my knowledge.

其他优缺点

The proposed algorithms greatly improve the sample complexity and runtime from previous GR algorithm.

其他意见或建议

Could you discuss more about the novelty of the proof of those regret theorems? It is unclear what technical contributions this paper adds.

作者回复

Thank you for taking the time to review our paper. We have addressed your question and comment below.

Q1. Are there any disadvantages of CGR to the other algorithms (such as Tsallis-INF)?
A1. Generally, we believe that CGR has no disadvantages over the conventional GR, as it is intuitively just cutting the redundant part of GR, and thus CGR is more effective on each metric.

Although CGR overcomes the computational inefficiency compared with GR, it still inherits the general disadvantages of FTPL. This is mainly the difficulty of the theoretical analysis, such as the difficulty of the analysis for RV estimators and extensions to other settings.

Q2. Novelty of the regret analysis.
A2. Thanks for your comment. The main novelty in the regret analysis lies in the analysis of CGR II-B. The key difference in the regret analysis stems from the biased estimator, where the relevant term involves both P(At)P(A_t) and wtw_t differently from the original GR. By deriving a lower bound on wt/P(At)w_t/P(A_t), we obtained the desired result even with a smaller maximum number of resampling. While our current analysis is specific to the chosen definition of AtA_t and the expression of wtw_t, we expect that similar results can be achieved in different settings beyond MAB by appropriately designing AtA_t and selecting perturbations.

审稿意见
5

This paper studies FTPL for MAB problem. The authors first propose a general receipe for designing loss estimate procedure. Then they also give several concrete variants under this framework and shows that they enjoy certain improved time complexity while maintain the optimal BOBW guarantee. These theoretical findings are complemented by comprehensive numerical results. The proposed CGR I and II both bring improved empirical time complexity and regret compared to the conventional GR.

给作者的问题

  1. any intuitions for why the new proposed scheme would give better empirical regret?

论据与证据

Yes

方法与评估标准

Yes

理论论述

I checked the main body which looks good to me.

实验设计与分析

Yes

补充材料

Yes, the additonal experimental results.

与现有文献的关系

potential audience: people studying bandits/RL/stats

遗漏的重要参考文献

n/a

其他优缺点

Strengths

  1. Gives a new framework for loss estimate in FTPL.

  2. Supporting empirical results.

其他意见或建议

n/a

作者回复

Thank you for taking the time to review our paper. We have addressed your question below.

Q1. Any intuitions for why the new proposed scheme would give better empirical regret?
A1. Generally, the variance of the estimator wt,It1^\widehat{w_{t,I_t}^{-1}} introduces an additional regret term in the analysis, which is strongly related to the expected number of resampling in GR and CGR. Since the expected number of resampling is significantly reduced in CGR, the variance of wt,It1^\widehat{w_{t,I_t}^{-1}} is also effectively reduced, compared with that of GR (as explained in Remark 2). Therefore, the regret from the estimation error of wt,It1w_{t,I_t}^{-1} gets improved effectively, which also leads to the improved theoretical guarantee in Theorem 5.

审稿人评论

I've read the response from the authors and also the communication between the authors and other reviewers. While this work is limited to MAB, personally I'm still impressed by a new recipe for design and analysis of FTPL, which expends the theoretical understanding on FTPL, so I would still like to support this paper and recommend acceptance.

审稿意见
3

The paper proposes Conditional Geometric Resampling (CGR) to improve the computational efficiency of the Follow-the-Perturbed-Leader (FTPL) algorithm in the multi-armed bandit problem. By introducing a carefully selected, necessary stopping condition in the resampling process, CGR reduces the expected computational complexity of standard Geometric Resampling from the quadratic O(K2)O(K^2) expected per-round time to the near-linear O(KlogK)O(K \\log K) while guaranteeing the same best-of-both-worlds regret guarantees from Honda et al. (2023) and Lee et al. (2024). More precisely, three variants of CGR are introduced: CGR I, CGR II-unbiased, and CGR II-biased. These variants offer different trade-offs between computational efficiency and estimation bias. The authors provide both theoretical regret bounds and experimental results that demonstrate lower runtime and improved regret performance compared to previous approaches.

给作者的问题

  • What is the main technical difficulty in designing a variant of GR that is a sort of counterpart of Tsallis-INF with the Reduced-Variance estimator, especially compared to using importance weighting?
  • Did you actually run all the other experiments for FTPL CGR II-B on your own? If so, was there no meaningful difference compared to FTPL CGR II-U?

论据与证据

The theoretical claims are clearly stated with complete proofs. The claims on the empirical performance are also validated by appropriate experiments.

方法与评估标准

Yes.

理论论述

Yes, with more focus on the proofs of claims relative to the number of resampling iterations.

实验设计与分析

My main concern about experiments is the omission of FTPL CGR II-B from the experiments other than the number of resampling. The reason provided by the authors is because the number of resampling is essentially indistinguishable from that of FTPL CGR II-U, and so the other experiments only consider the latter algorithm. However, it would still be interesting to observe whether the performance is indistinguishable with respect to other performance metrics too (at least for the experiments performed in this work). For instance, the indistinguishability seems to appear for large enough KK, e.g. Kge32K \\ge 32, but the first experiment measuring the cumulative regret considers K=8K = 8.

补充材料

Yes, with more focus on the proofs of claims relative to the number of resampling iterations.

与现有文献的关系

The adoption of FTPL with geometric resampling to achieve best-of-both-worlds regret guarantees for multi-armed bandits has alredy been considered by previous work. The main contribution of this paper lies in the design of more clever conditional geometric resampling procedure that employ a carefully chosen necessary stopping condition. This allows an improved computational complexity with respect to other algorithms that achieve a similar regret performance. The authors are clear in the comparison with the main related work.

遗漏的重要参考文献

The authors seem to have covered the essential related work in their discussion.

其他优缺点

The main techincal contribution is the design of variants of and the consequent improvement in the computational complexity of FTPL while preserving BOBW guarantees. Regarding the regret analysis per se, there seems to be no significant difference compared to prior work. Nevertheless, the time performance is also a relevant factor in practice and the contribution in this direction is interesting.

其他意见或建议

  • In the Problem Setup section, you define losses ellt\\ell_t in the adversarial setting as being possibly functions of the history of losses and chosen arms, but it seems that you may assume they depend on (I_1,dots,I_t1)(I\_1, \\dots, I\_{t-1}) w.l.o.g.
  • Starting from Section 3, sigmai\\sigma_i is introduced but it seems to also depend on the round tt; making this dependence explicit (e.g., writing sigma_i,t\\sigma\_{i,t}) is probably clearer.
  • Theorem 5 seems more of a remark than an actual theorem.
  • In Section 5, it would be clearer to also specify how the confidence intervals around the curves were chosen (e.g., standard deviation?).
作者回复

Thank you for your thorough review and valuable feedback on our work. We have addressed your questions and comments below.

Q1. What is the main technical difficulty in designing a variant of GR that is a sort of counterpart of Tsallis-INF with the Reduced-Variance estimator, especially compared to using importance weighting?
A1. The main technical difficulties in constructing the Reduced-Variance (RV) estimator for FTPL with GR are as follows:

The RV estimators can take negative value, which makes the analysis significantly difficult. Owing to this issue, after Zimmert & Seldin (2021) proposed RV estimator, there are still some papers on FTRL that only considers the Importance-Weighted estimator (Jin et al., 2023; Tsuchiya et al., 2023).

Moreover, for FTPL with GR, since the importance weight wt,i1w_{t,i}^{-1} is replaced with wt,i1^\widehat{w_{t,i}^{-1}}, the design and analysis of the RV estimator is more challenging.

Julian Zimmert and Yevgeny Seldin. "Tsallis-INF: an optimal algorithm for stochastic and adversarial bandits.'' JMLR, 2021.

Tiancheng Jin, Junyan Liu, and Haipeng Luo. "Improved best-of-both-worlds guarantees for multi-armed bandits: FTRL with general regularizers and multiple optimal arms.'' NeurIPS, 2023.

Taira Tsuchiya, Shinji Ito, and Junya Honda. "Best-of-Both-Worlds Algorithms for Partial Monitoring.'' ALT, 2023.

Q2. Did you actually run all the other experiments for FTPL CGR II-B on your own? If so, was there no meaningful difference compared to FTPL CGR II-U?
A2. We actually ran all the experiments for FTPL CGR II-B, including the pseudo regret for K=8K=8 and K=32K=32, running time for KK varying from 22 to 128128, and the number of resampling for K=8K=8 and K=32K=32. We confirmed that there is no meaningful difference compared to FTPL CGR II-U. The detailed results can be found at the this link: https://anonymous.4open.science/api/repo/ICML2025_CGRII_Experiments-9807/file/additional_experiments.pdf?v=86d6493c

Q3. Assumption on the losses in the adversarial setting: we may assume they depend on (I1,,It1)(I_1,\ldots, I_{t-1}).
A3. Thank you for the comment. While we mentioned the loss may depend on both \ell and II, this phrasing was introduced primarily to clarify that we consider an adaptive adversary rather than an oblivious one. Still, we understood that as pointed out it is indeed sufficient to consider the loss as a function of II's as in the literature considering an adaptive adversary (Arora et al., 2012; Zimmert & Seldin, 2021). We will modify the writing accordingly.

Raman Arora, Ofer Dekel, and Ambuj Tewari. "Online bandit learning against an adaptive adversary: from regret to policy regret.'' ICML, 2012.

Julian Zimmert and Yevgeny Seldin. "Tsallis-INF: an optimal algorithm for stochastic and adversarial bandits.'' JMLR, 2021.

Q4.

  • Starting from Section 3, σi\sigma_i is introduced but it seems to also depend on the round tt; making this dependence explicit (e.g., writing σi,t\sigma_{i,t}) is probably clearer.
  • Theorem 5 seems more of a remark than an actual theorem.

A4. We sincerely appreciate your helpful suggestion. We will modify the notation accordingly to improve clarity.

Q5. In Section 5, it would be clearer to also specify how the confidence intervals around the curves were chosen (e.g., standard deviation?).
A5. Thank you for the valuable suggestion. The confidence intervals in the experimental results represent the standard deviation. We will modify the writing to clarify it.

审稿意见
4

The paper proposes a new estimation procedure based on conditional resampling to improve both theoretically and empirically the sample efficiency of FTPL while maintaining the regret guarantees.

给作者的问题

How would the analysis change in the proposed sampling approach when extending from bandits to semi-bandits?

论据与证据

Claims are supported by clear evidence.

方法与评估标准

Evaluation criteria and methods make sense.

理论论述

I have not checked the correctness of proofs.

实验设计与分析

I checked all experimental designs and they look valid to me.

补充材料

I have not reviewed the supplementary.

与现有文献的关系

The paper presents a significant result in FTPL literature.

遗漏的重要参考文献

Prior work discussion seems extensive.

其他优缺点

The conditional resampling is a refreshing idea to overcome potential computational issue in FTPL.

其他意见或建议

N/A

作者回复

Thank you for taking the time to review our paper. We have addressed your question below.

Q1. How would the analysis change in the proposed sampling approach when extending from bandits to semi-bandits?
A1. The analysis would not change so much if we are just interested in extending our sampling approach to derive results similar to Neu and Bartók (2016) for semi-bandits. Still, their results are only for near-optimal regret bound in the adversarial setting, and extending the BOBW analysis to semi-bandits is significantly difficult. In general, the key to the BOBW analysis in FTPL and FTRL is obtaining a uniform bound on wt,i/wt,i3/2-w_{t,i}'/w_{t,i}^{3/2}. Thus, the expression of wt,iw_{t,i} significantly impacts the regret analysis. In the semi-bandit setting, the arm selection probability vector wtw_t no longer lies in the probability simplex since multiple arms can be selected in one round. As a result, several techniques used in Honda et al. (2023) and Lee et al. (2024) become no longer applicable. In addition, to obtain the desired results for CGR II-B, the events AtA_t would need to be modified according to the behavior of wtw_t in semi-bandits.

最终决定

This paper studies a novel loss estimation procedure for FTPL in multi-armed bandits called Conditional Geometric Resampling (CGR).

While FTPL offers no direct advantage in terms of memory or computational complexity for multi-armed bandits, the techniques of this paper likely apply to problem settings where FTPL is computationally more attractive than FTRL. All reviewers agree that this is a solid contribution which is of interest to the community.