Optimal Algorithm for Max-Min Fair Bandit
摘要
评审与讨论
The paper presents a theoretical study for the Multi-Player Multi-Armed Bandit (MP-MAB) problem with a max-min fairness objective. In this scenario, multiple players choose from a set of arms, and collisions between players result in zero rewards. The goal is to maximize the reward for the player with the lowest reward. The authors propose a decentralized algorithm called Decentralized Fair Elimination (DFE), which improves the existing regret upper bound from to . Additionally, a matching lower bound is provided, demonstrating the optimality of the proposed algorithm. The effectiveness of the algorithm is also verified through numerical experiments.
优点
-
The DFE algorithm addresses fairness without requiring a central coordinator, making it scalable to larger systems and addressing practical concerns in decentralized settings like wireless networks.
-
The algorithm proposes a novel exploration assignment strategy that ensures even exploration across player-arm pairs, which leads to efficient elimination of sub-optimal arms and reduces overall regret.
-
The authors provide both a theoretical analysis (including regret upper and lower bounds) and empirical validation through simulations.
缺点
-
In remark 1, the authors claim that "we could still use collisions to transmit information bit by bit, resulting in an additional constant length of information bits without the communication assumption." However, using collisions to transmit information, players should quantize UCB/LCB estimates to avoid potentially infinite communication length upon communication, as these numbers are often decimal numbers. To ensure that the elimination phase remains work despite the quantization errors, the required communication length would need to be on the order of . Given that the number of epochs is , this results in an additional term of in the regret bound.
-
Could the authors elaborate on why the inequality in line 698 holds? The paper does not provide any explanation for this, and additional details would be greatly appreciated. Additionally, I am unsure if the definition of is meaningful, as this value can be non-positive.
问题
See above.
- Additional term in communication cost.
We thank the reviewer for pointing out this additional term in the communication cost. It is correct if we do not make the communication assumption. We will add this in the discussion about communication phase. The total communication cost is , which still does not affect the leading term .
- Could the authors elaborate on why the inequality in line 98 holds?
I guess you are asking the inequality in line 698, which is . First, we analyze the pair which is not eliminated at epoch , which means it has been selected at least number of times. Then conditioned on good event , we have that
.
Moreover, since the optimal player-arm pair (i’, k’) with max-min reward is not eliminated, it is also selected at least number of times, we have that
.
Since sub-optimal pair is not eliminated, we have that . Otherwise it must hold that and (i,k) will be eliminated after epoch. Rearranging the terms in the inequality in line 698.
The definition of is meaning for since we only analyze those sub-optimal pair (i, k) with , which guarantees the is always positive. Recall that from the definition of we know that the minimum reward in any matching can never greater than , thus we only care about the number of times of selecting those sub-optimal pairs with .
Thank you for your response. It effectively addresses all of my concerns, and I have adjusted my score accordingly. I recommend that the authors also incorporate the clarification regarding the inequality mentioned in line 698 into the revised paper.
This paper addresses the multi-player multi-armed bandit (MP-MAB) problem, with the goal of finding a max-min fairness matching that maximizes the reward of the player receiving the lowest reward. The authors propose a decentralized fair elimination (DFE) algorithm to handle heterogeneous reward distributions and collisions between players. The algorithm improves the regret upper bound from to , where is the minimum reward gap. Additionally, they provide a regret lower bound, demonstrating that the algorithm is optimal concerning key parameters.
优点
- The authors design a new phased elimination algorithm that improves max-min fairness by adaptively eliminating suboptimal arms and exploring the remaining ones. The algorithm achieves a regret upper bound of , which outperforms existing results.
- The paper derives a tighter regret lower bound of , which considers the parameters , , and , improving upon prior work.
- Numerical experiments confirm the effectiveness of the DFE algorithm across different settings.
缺点
-
I appreciate the contribution of this work, which presents the first optimal result for the MP-MAB problem, aligning with the lower bound established here. However, compared to earlier studies, particularly those by Bistritz et al. (2020) and Leshem (2023), this study, which employs a classic elimination-based strategy, only improves the regret results by a factor of . This improvement may not be particularly significant for smaller values of . Does this work demonstrate advantages over others in terms of factors such as or ?
-
Recently, several studies have reported intriguing results on reducing communication costs in the MP-MAB problem for both competitive and cooperative settings. I believe the authors could strengthen this study by incorporating a communication-efficient algorithm. Can the authors provide a rigrous upper bound for the communication cost in this work and discuss the possibility to make it optimal? Did the work improve the communication cost compared to previous work?
问题
see weaknesses
We thank the reviewer for your valuable and detailed comments. Please see our response below.
- Does this work demonstrate advantages over others in terms of factors such as or ?
As the dependence of and in previous works, we note that the work of Leshem (2023) attains regret, where they assume . And in the work of Bistritz et al. (2020), they can only get regret. Therefore our regret not only gets the improvement over the term , but also improves the dependence on term and .
Additionally, we highlight that the improvement over is not only reflected in removing the term , but also removing a very large constant before the leading term, which could be exponentially large with . This is because previous works Bistritz et al. (2020); Leshem (2023) provide an explore-then-commit (ETC) method at each epoch . Specifically, they let each player explores each arm times at the beginning of the epoch , and then compute the max-min matching based on history observations in exploration phase. After that each player follows this matching in the following rounds. Their algorithms both only obtain an regret bound since they have to set an increasing length of exploration at each epoch. This design is to make sure the probability of computing a wrong max-min matching is bounded by when is sufficiently large that , then the regret in the exploitation phase can be bounded. This design also raises the problem of a large constant to guarantee , which requires initial warm-up rounds is , which could be very large when is small enough. We handle this problem by applying the elimination method which eliminate sub-optimal player-arm pair efficiently. This assures that no forced explorations will happen in later epochs.
2, Can the authors provide a rigrous upper bound for the communication cost in this work and discuss the possibility to make it optimal? Did the work improve the communication cost compared to previous work?
Thanks for pointing out the communication cost in our algorithm. Here we give a rigrous analysis for it. If the minimum reward gap between max-min value is , then the length of each communication phase is bounded by . Here is the length of transmitting a reward’s information by bit and through collisions. We only need the bit length with since it is enough to distinguish two pairs with gap larger than .
Then the total communication cost is . We also note that the communication cost in Leshem (2023) is , and the communication cost in Bistritz et al. (2020) is . Additionally, we note that in their works is the minimum reward gap among all player-arm pairs, whereas in our work is only the minium reward gap between . Thus we also significantly improve the communication cost compared with previous works.
We believe that if the algorithm has to convey information of reward’s estimation, then our algorithm is optimal since we make players communication with each other by bit and through collisions, which is the most efficient way to communication as far as we know. We leave it as an interesting future work to design an algorithm with minimum communication cost.
Thanks for the detailed explanation. I have raised my score.
Thank you for your feedback and for updating us on the score. We genuinely appreciate your insightful and constructive comments. Please let us know if you have any further concerns or questions, we will be delighted to clarify them.
This paper considers max-min fair bandit, an important variant of multi-player multi-armed bandit problem where fairness means to maximize the reward of the player who receives the lowest reward. Existing work in max-min fair bandit suffer from large regret and heavy assumptions. The authors give tight regret bound for the bandit problem that is optimal with respect to all parameters. Special case for the lower bound is provided. The work closes the gap of the man-min fair bandit problem.
优点
This paper fills an important gap in existing max-min bandit literature. Tight regret bound is proved and special case is provided to demonstrate the lower bound. The improvement is significant compared to existing work. The regret bound is tight in all parameters. The algorithm is splitted into three phases with detailed algorithmic and graphical illustrations.
缺点
The main section of the decentralized fair elimination algorithm is a bit hard to read. It would be good if the authors can highlight the novelty part of the algorithm and clearly demonstrate how the three steps of the main algorithm contributes to the regret and which dominates the regret. The elimination phase is a commonly used strategy. The exploration phase is new, but it does not seem to directly contribute to the improvement of overall regret. The communication phase is also seen in the matching bandit literature.
问题
• Line 90 What is the doubling trick?
• Line 242 about the second type of elimination – unclear. How to determine the assertion of “with high probability”?
• If there is no elimination phase, how the regret is affected? Line 310 is confusing.
• Can you give intuitive idea / sketch proof to Thm 1, which is the key result for the whole paper, to explain how the three steps contribute to the regret bound?
-------------------------After Rebuttal----------------------------- Thank you for the response. This helps me better understand the technical contribution of the paper. I see it's a theoretically interesting paper. I increased my confidence in my assessment.
We thank the reviewer for your valuable and detailed comments. Please see our response below.
- Highlight the novelty part of the algorithm and clearly demonstrate how the three steps of the main algorithm contributes to the regret and which dominates the regret. Give intuitive idea / sketch proof to Thm 1, which is the key result for the whole paper, to explain how the three steps contribute to the regret bound?
We highlight that the novelty part of our proposed algorithm is the exploration phase with carefully designed exploring matching set given those non-eliminated player-arm pairs. More specifically, we improve the total exploration times in one cycle (explore all non-eliminated player-arm pairs) from a naive bound to the optimal bound . This is also the key to match the lower bound. Elimination phase guarantees no sub-optimal player-arm pair will be selected too many times, and the optimal design of exploration set guarantees the difference in the number of times pairs are selected in the exploration set will not be too large. These two designs lead to the final optimal regret bound.
2. Line 90 What is the doubling trick?
‘’Doubling trick’’ means the length of exploration phase increases doubles each time. By this design we can control the total communication times by , and ensure that the number of additional explorations does not exceed twice the necessary number of explorations.
3.Line 242 about the second type of elimination – unclear. How to determine the assertion of “with high probability”?
The second type of elimination means (j, k) will not occur in the optimal matching set if (j, k) does not exist in any matching set with UCB greater than , and thus it will be eliminated. Here “with high probability” means if UCB is greater than , then with high probability the minimum reward of the given matching is smaller than the optimal max-min reward.
4. If there is no elimination phase, how the regret is affected? Line 310 is confusing.
If there is no elimination phase, all player-arm pairs will be selected the same number of times. Denote the minimum reward gap between the pair and max-min reward as , and the gap between and a sub-optimal pair is . Then the number of times of selecting is without elimination phase, and the regret caused by selecting is , which can only be bounded by since . However, if we utilize the elimination phase, the number of times of selecting can be bounded by , and thus the regret caused by selecting is . In general, by elimination phase, we can improve the regret by .
This work studies a multi-player multi-armed bandit problem with heterogeneous reward and collision. This paper aims to find a fair bandit algorithm that matches each player to a distinct arm while maximizing the reward of the player who receives the smallest reward. This paper provides a max-min regret lower bound of . The authors propose the decentralized fair elimination (DFE) algorithm that guarantees the exploration of all valid player-arm pairs by constructing matchings, controls the communication times by the doubling trick, and eliminates player-arm pairs whose upper confidence bounds are smaller than the lower confidence bound of the current max-min value. The authors show that DFE achieves max-min regret. There is also an empirical study of the performance of the proposed algorithm compared to prior decentralized competitive MP-MAB algorithms.
优点
- The problem studied (fairness in MP-MAB) is important and interesting.
- The algorithmic designs of exploration and elimination procedures are interesting.
- This paper conducts both theoretical and empirical studies.
- I appreciate the diagrams for algorithmic design illustration.
缺点
- The claim that the proposed algorithm is optimal concerns me because the upper bound does not match exactly with the lower bound in terms of and . The proposed algorithm is definitely near-optimal, but I am concerned about claiming it as optimal.
- The readability of this paper can be further improved. Notations can be introduced more clearly. For example:
- A matching is first introduced as "matching set" , but later more used as matching or
- The definition of regret in this paper is very different from that in multi-armed bandit literature. I would suggest the authors make this point clearer in the paper.
- The last sentence on the first page is too long.
- has always denote player-arm set in the rest of the paper. However, it is used differently in Section 4.
- Claim 1 only holds for the instance described in Section 4. This point should be made clear, as in Lemma 1.
- Figure 5 is not very color-blind friendly or black-white printable.
问题
- Is the regret bound of Bistritz et al. (2020) analyzed against the max-min regret same as defined in Section 2? If not, it seems unfair to compare with it in the introduction.
- Is there any hyperparameter for the proposed algorithm or the benchmarks the authors set in the experiments? If so, please add them to the paper to increase replicability.
We thank the reviewer for your valuable and detailed comments. Please see our response below.
- Does the upper bound match the lower bound?
In this paper, we propose the algorithm attaining the upper bound and provide the analysis with lower bound . Here we confirm that these two bounds are exactly in the same order, since we can see that . This shows that the upper bound exactly matches the lower bound with respect to term . Therefore, we indeed design an optimal algorithm for this problem.
- Is the regret bound of Bistritz et al. (2020) analyzed against the max-min regret the same as defined in Section 2?
We study the exact same setting as in the work of Bistritz et al. (2020). Moreover, we do not assume each player must have different rewards over arms like Bistritz et al. (2020). Thus we analyze a more general setting and the definition of regret is the same as those compared works.
- Is there any hyperparameter for the proposed algorithm or the benchmarks the authors set in the experiments?
Indeed our algorithm is parameter-free, thus it is easy to follow our algorithm’s description to reproduce the same performance. The hyperparameters of the benchmarks are the same as they stated in their paper, and we will restate them in the updated version.
- Thanks for pointing out these unclear notations, we will fix them in the updated version.
The authors' response addressed my concern. I have raised the rating. I would encourage the authors to add the above discussion to the paper to enhance clarity.
Dear Reviewers:
We thank you once again for your careful reading of our paper and your constructive comments and suggestions. We would appreciate it if you could let us know whether all your concerns are addressed. We are also happy to answer any further questions in the remaining discussion period.
Best, Authors.
Dear Reviewers:
We appreciate the your insightful thoughts, and we will definitely integrate these comments and discussions into our next version. Thanks!
Best, Authors
This paper looks at a variant of the multi-player multi-armed bandit setting where the objective is to minimize some min-max regret instead of the usual cumulative one.
The reviewers and I are not that thrilled by the results and the algorithm, as they are not straightforward, but relatively similar than the existing ones in the literature. It is true that this paper improves a log(T)loglog(T) bounds into a log(T) regret bound, but this does not necessarily says that achieving it was difficult (rather than questioning the quality of the first paper).
All in all, I do not think this paper reaches the ICLR bar, even though it is, as far as I can see, correct and midly interesting.
审稿人讨论附加意见
All reviewers were lukewarm, and none of them decided to champion this paper for acceptance. As I was not thrilled either, the conclusion was clear
Reject