Faster Rates for Private Adversarial Bandits
We design new differentially private algorithms with improved privacy and utility tradeoffs for adversarial bandits and bandits with expert advice.
摘要
评审与讨论
The authors study the problem of adversarial multi-arm bandits under central differential privacy constraints. They propose a new algorithm and establish regret bounds that improve over previous results. They then consider the problem of learning from expert advice under privacy constraints, again providing algorithms and regret bounds.
优点
The problem studied is indeed interesting, and the regret rates that the authors claim to achieve would improve over previous rates if the authors could fix the proofs.
Edit : The authors propose a natural reduction scheme to adapt classical adversarial bandit algorithms to scenarios with batched losses augmented by Laplace noise. Their results improve upon previous regret bounds and shed light on an interesting distinction between the central and local demographic parity settings.
缺点
I believe that some proofs are false.
The most important mistake I found is perhaps the following : in Theorem 1, the authors assume that they have access to an algorithm with bounded regret where are i.i.d. Laplace variables, and the first expectation is taken with regards to all randomness (note that the right hand side of the expression is a random variables, and since the Laplace variables are unbounded, it seems to me unlikely that there exists an algorithm with a finite bound on this random quantity that holds a.s.)
However, when applying Theorem 1 to prove Corollary 1, the authors redefine as
Note that in general, we have so bounding the second definition of is not enough to bound the first one, and thus to apply Theorem 1.
In general, the authors exchange "" and "" with too little caution. While some of the inequations hold, it would be helpfull to provide more details on this steps (rather than just saying "this holds because is centered"), and prevent making some mistakes. For example, in the proof of Theorem 3, is ambiguiously defined : in Line 1209, it minimizes . However, in Line 1220, to bound the regret compared to the best expert, on the right side of the equation should minimize
Some notations are not defined (for example, in the proof of Theorem 1).
Edit : The authors have corrected their claims, which now hold.
The paper somewhat lacks clarity in both its presentation and its proofs. The authors introduce a general reduction scheme, followed by a collection of results and algorithms, some of which differ only by minor factors. It is somewhat difficult to identify the assumptions and scenarios under which one algorithm should be preferred over another. I encourage the authors to more clearly highlight the main results for the two frameworks considered (with and without expert advice).
问题
If the authors could provide a rigorous proof that the algorithms presented in the corollaries verify the assumptions of Theorem 1 with the first definition of the regret bound , or prove Theorem 1 under the assumptions on the regret bound verified by the algorithms studied in the corollaries, I would be happy to raise my score.
I also strongly encourage the authors to write all the steps when switching expectations and minimums.
We thank the reviewer for noting that the problem setting is interesting and that our results would improve upon previous rates. We address the reviewer's concerns below.
The most important mistake I found is perhaps the following: in Theorem 1...
We had a typo in the definition of on lines 251-252. The correct definition is:
so that
The first equality in the equation above is due to the fact that the noise added to is zero-mean and the last equality is due to the fact that the true losses are not random. That is, is the expected regret of the bandit algorithm on the original losses when it only observes noisy losses, where the amount of noise is captured by . In Appendix D, we establish bounds on when is taken to be EXP3 and FTPL respectively (see Lines 895 and 1064 in the updated submission). Line 826 in the proof of Theorem 1 now follows by definition of since represent the actions taken by Algorithm 1 when updated on the noisy losses (as guaranteed by Line 9 in Algorithm 1). We have made this change to the manuscript and have highlighted it in blue. While may be an unusual definition, it enables us to provide a generic conversion of a non-private bandit algorithm into a private bandit algorithm. We believe this generality is useful as it allows one to construct private bandit algorithms from off-the-shelf bandit algorithms in a black-box way.
For example, in the proof of Theorem 3, is ambiguously defined...
We thank the reviewer for noting that in the proof of Theorem 3 is ambiguously defined. We have fixed this in the manuscript and highlighted in blue.
Some notations are not defined...
We thank the reviewer for pointing out the fact that is not defined in the Proof of Theorem 1. This is a typo and should read We have fixed this in the final version.
In general, the authors exchange "min" and "" with too little caution...
With regards to swapping infimums with expectations, we have checked all expectations in the paper and confirmed that every time an infimum is inside an expectation, the object that is being minimized is not random. Accordingly, there should be no problem swapping the infimum and expectation. Actually, we can even drop the expectation surrounding the infimum term since it is not random. We will make sure to do this in the final version.
I thanks the authors for clarifying some points in their proofs.
In particular, I thank them for correcting their definition of the regret . It seems that Theorem 1 remains correct with this new definition, and that the proofs of the corollaries now hold.
I still find that the proofs overall lack rigor: for example, in the proof of Corollary 1, on Line 927, to apply Lemma 12 the expectations should be taken with regards to the randomness of , conditionally on the Laplace noise (Lemma 12 applies for deterministic losses) and on the event , before taking expectations with regards to the noise itself. To make the proofs easier to read, I strongly suggest to introduce specific notations to highlight the randomness with regards to which the expectation is taken.
Overall, I think that the reduction scheme used to apply smaller noise on batched loses is natural in a central DP setting. I am not convinced that the proof technics are particularly new, however they improve on previous rates shed light on a interesting separation between central and local DP.
As a further remark, I find the presentation of the results as a collection of algorithms and bounds somewhat challenging to follow. In particular, it is unclear why the authors apply their reduction scheme to such a wide range of algorithms, some of which yield suboptimal rates. Perhaps algorithms and results that are either suboptimal or differ only by minor variations in absolute numerical constants or logarithmic factors could be moved to the Appendix for clarity?
We thank the reviewer for noting that "Theorem 1 remains correct with this new definition, and that the proofs of the corollaries now hold." We will make sure to be rigorous about all expectations in the final version. In addition, we agree with the reviewer and will move the algorithms obtaining suboptimal guarantees and their results to the Appendix in the final version.
This paper studies adversarial multi-armed bandits (adv. MAB) (with expert advice) under the central Differential Privacy (DP) constraint.
From the upper bound side, for adv. MAB, it proposes a generic conversion framework which achieves regret and improves the best-known (although the corresponding algorithm ensures the stronger local DP guarantee) when is small (), where is the number of arms, is the interaction horizon, and is the privacy budget (smaller means stronger privacy protection). This is achieved through carefully utilizing tricks including local DP and batching.
For adv. MAB w/ expert advice, this paper gives the first set of regret upper bounds, each of which dominates for a different range of , where is the number of experts.
In the lower bound side, in the 2-arm case (), it is shown that for a certain class of algorithms (characterized by some behavior over some loss sequence), they suffer , which matches the upper bound.
优点
- Improved/first upper bounds in this setup are achieved through neat tricks.
- A matching lower bound is shown for a certain class of algorithms. This is an initial and important step towards characterizing the minimax regret in adv. MAB with central DP, which (if true) potentially implies a separation between stochastic and adv. MAB.
缺点
While I'm excited and happy to see that the authors are trying to approach the lower bound, in this paper, it is only shown for certain algorithms (which satisfy conditions (1) and (2) defined around Line 2108). It's hard to tell how broad algorithm classes these two conditions cover, and it's seems that we still have a large gap towards fully characterizing the regret.
问题
Could the authors please comment on any potential inspirations from Lemma 13? For example, is it possible to show the same lower bound using the same loss sequence () while dropping those two conditions? (or they are tightly coupled?) Or we may need to construct completely different loss sequences?
We thank the reviewer for noting that our upper bounds are "achieved through neat tricks" and that our lower bounds are an "initial and important step towards characterizing the minimax regret in adv. MAB with central DP." Before we address the reviewer's concerns, we would like to point out that the main contribution/focus of our paper is the improved upper bounds and their implications. For example, our upper bounds in the adversarial bandit setting imply that:
- There exists a separation in the achievable regret bounds between oblivious and adaptive adversaries. In particular, while we show that sublinear regret bounds are possible for all choices of under an oblivious adversary, this is not the case for an adaptive adversary, where one cannot achieve sublinear regret if [Asi et al. 2023].
- This is the first work to show that sublinear regret bounds are possible for all choices of . This establishes a separation between the achievable regret bounds under central and local differential privacy, since for the latter it is well known that sublinear regret is not possible if
Previous results on private adversarial bandits do not shed light on whether such separations existed. In addition, we give the first upper bounds for private adversarial bandits with expert advice. We now address the reviewer's concerns.
...in this paper, it is only shown for certain algorithms (which satisfy conditions (1) and (2) defined around Line 2108)...
We acknowledge the limitations of our lower bound in the sense that it is algorithm-specific. However, we want to highlight that this is the first attempt at a lower bounds for private adversarial bandits, and that existing techniques in private online learning do not give non-trivial lower bounds. As such, we believe our algorithm-specific lower bound is non-trivial progress.
It's hard to tell how broad algorithm classes...
We give some intuition about what the two conditions mean around lines 2110-2119, and will add more discussion on these in the final version. At a high level, our lower bounds apply to algorithms which, when run on where and , drops the probability of action to reasonably quickly and maintains the probability of action at afterward. Condition (1) is not really an assumption as one should set , where denotes the expected regret of when run on . Then, condition (1) is trivially satisfied by any , while condition (2) roughly states that drops the probability of playing action to around by round , and keeps it there. In other words, after round , plays action on average times in rounds. The latter property is reasonable for bandit algorithms given that for all , and thus any low-regret bandit algorithm should not be playing action often. Indeed EXP3 drops probabilities of bad arms fast enough that this condition immediately holds. We give a proof sketch below that our lower bound of holds for any batched EXP3 algorithm (i.e., with any parameters), and we will add this discussion to the final version.
Batched EXP3 on the loss sequence , where and , has regret at least , where is the batch size. Assume, towards a contradiction that for some setting of parameters, this was . This implies that is , and that is . EXP3 drops the probability of a bad arm to in steps, and the batched version will do so in steps. Thus condition (2) is satisfied with being and (ignoring logarithmic factors). This then yields a lower bound of using Lemma 13, which contradicts the assumption. Thus the regret has to be up to logarithmic factors.
Could the authors please comment on any potential inspirations from Lemma 13?
We do not believe that the same construction can lead to fully general lower bounds. The lower bound rules out a large class of algorithms, and this may help lead us to a better algorithm. If there isn't a better algorithm, we suspect that we will need to construct the hard loss sequences in an algorithm-specific way, e.g. by simulating the bandit algorithm and using its internal state. This sort of technique was recently used to establish lower bounds for private online counting [Lyu et al. 2024].
I thank authors for their response. I've read the interactions with other reviewers as well. I would like to maintain my current rating. Overall, I put this submission at the borderline. I will engage the discussion with other reviewers and AC in the next stage.
This paper studies the problem of adversarial MAB with differential privacy guarantee. Specifically, the authors propose a reduction framework which updates the classic MAB strategy every steps with the feedback as the aggregated losses plus a Laplacian noise. With this framework, they obtained regret when combining this framework with EXP3, Heavy-tailed Tsallis INF, and FTPL. The authors also consider the bandits with expert advice (or contexual case) using this framework and obtain various bounds using different base algorithms.
优点
-
This paper first proposed an algorithm with regret with DP guarantee, improving upon the previous SOTA algorithm with . This reduction is applicable for both MAB and its contextual variant.
-
The proposed framework is simple and easy to be implemented.
缺点
-
This algorithmic framework, which updates the strategy only after a fixed 𝑁 rounds, is not the first to be used in achieving DP for MAB. [Tossou & Dimitrakakis 2017] also considers this batched update algorithm but with EXP3 as the subroutine, while the author claims that the results are buggy. From a technical perspective, the proofs seem to be adopted from existing bandit analysis without much challenge. Maybe the authors can highlight more on the technical challenge in achieving the DP guarantee with this reduction since the negative loss and heavy tail loss do not seem to be critical using these base algorithms.
-
There are some places (actually many) that the author claims "not too hard" to obtain, which I think requires more explanations or at least some derivations.
- Line 2077 and below: please provide more details on why the given sequences leads to very different . It is hard for me to understand why a single change in the loss at the first round will make to be close to 1 in the first case but close to 0 in the second case.
- Line 2187: can the authors provide more explanations on this?
问题
See "Weakness" section. In addition, I wonder whether the lower bound mentioned in the last section (or Appendix G.2) applies to Algorithm 1 with EXP3 and other base algorithm choices?
伦理问题详情
None.
We thank the reviewer for their comments and address their concerns below.
Maybe the authors can highlight more on the technical challenge...
Batching is an old technique in the history of bandit, and more generally, online learning algorithms. [Tossou and Dimitrakakis 2017] considered the batched version of EXP3 and used the internal randomness of EXP3 alone to argue about its privacy. Unfortunately, their privacy analysis is flawed as privately releasing only the selected action is not enough to ensure the overall privacy of EXP3 due to its use of Inverse Propensity Weighted estimators. In fact, in Appendix G, we provide evidence that EXP3 is actually inherently non-private. To overcome this issue, we additionally add noise to the batched losses, and only use the randomness of the batched losses to argue about privacy. However, by using noisy batched losses, we require that our base bandit algorithms to be able to handle negative and heavy-tailed losses. This is non-trivial requirement and more care needs to be taken when deriving regret bounds for adversarial bandits with negative losses. In fact, the standard regret analysis for adversarial bandit algorithms do not work if the losses can be negative and potentially unbounded. This is why we make sure to derive the regret bounds of EXP3, FTPL, and FTRL under negative losses explicitly in the Appendix. In addition, by considering noisy batched losses, we are able to establish a connection between privacy and the vast literature on heavy-tailed bandits. In particular, our reduction shows that any heavy-tailed bandit algorithm, like the ones we presented, can be converted into a private bandit algorithm. This allowed us to eventually remove log factors in and that EXP3 would give with our reduction.
There are some places that the author claims "not too hard" to obtain...
We will remove our wording of "not too hard" and "easy" throughout the text and be more explicit in the final version.
Line 2077 and below: please provide...
With regards to Line 2077 and the claim about , we provide a sketch of the proof below and will elaborate more in the final version. Fix and suppose that . Then, by definition, for the loss sequence , we have that and Since , we know that and therefore . Now, if , then . Accordingly, , and repeating the analysis would eventually show that , implying that A symmetric argument shows that if , then , and therefore Since the two loss sequences and are identical after time point , an identical argument holds for . To complete the proof sketch, note that for loss sequence , and , thus On the other hand, for loss sequence , and , giving that
We have also verified this claim experimentally and included a Figure displaying the results of our simulation and the code used. These are highlighted in blue and exhibit the rapid divergence in the probabilities of selecting action 2 between the two loss sequences.
Line 2187: can the authors provide...
We thank the reviewer for catching this typo. It should be on line 2187.
I wonder whether the lower bound mentioned in the last section (or Appendix G.2) applies...
We give a proof sketch below that our lower bound holds for any batched EXP3 algorithm (i.e., with any parameters). We will add this discussion to the final version.
Batched EXP3 on the loss sequence , where and , has regret at least , where is the batch size. Assume, towards a contradiction that for some setting of parameters, this was . This implies that is , and that is . EXP3 drops the probability of a bad arm to in steps, and the batched version will do so in steps. Thus condition (2) is satisfied with being and (ignoring logarithmic factors). This then yields a lower bound of using Lemma 13, which contradicts the assumption. Thus the regret has to be up to log factors.
I thank the authors for their detailed response and effort during the rebuttal. My concerns are address adequately and I keep my positive score for it.
This paper investigates adversarial bandit problems and bandit problems based on expert advice, proposing a differentially private algorithm with improved regret bounds compared to prior work. The primary concern with this paper lies in its lack of clarity and rigor in both the writing and the mathematical proofs. Given that the contributions of the paper are almost entirely theoretical, these issues of clarity and rigor must be taken seriously.
While the authors' responses and partial revisions have addressed some of the concerns, verifying the correctness of the claims would still require substantial revisions and additional time for thorough review. The reviewers agree on the significant impact and technical interest of the results, and they believe that this could become a strong paper if it were made clear and rigorous. A revised submission is strongly encouraged.
审稿人讨论附加意见
The reviewers agree on the significant impact and technical interest of the results. However, concerns were raised regarding the lack of clarity and rigor in the writing and mathematical proofs, as well as the limited applicability of the lower bounds. While the authors' responses and partial revisions have addressed some of these concerns, verifying the correctness of the claims would still require substantial revisions and additional time for a thorough review.
Reject