Fixed-Budget Differentially Private Best Arm Identification
摘要
评审与讨论
This paper studies the problem of identifying the best arm in a differentially private manner. This paper focuses on the pure DP setting. Here, the privacy model is as follows. At each time step T, there are K arms. Each arm has a potential reward x_{t, k}. The algorithm is DP if, changing only one of the x_{t, k} causes the trajectory of arms that are selected to satisfy the usual eps-DP property.
The authors prove tight upper and lower bounds for this problem. In particular, they determine a parameter, similar to the parameter in the non-private setting, which essentially governs the error rate of the algorithm.
优点
This paper studies an interesting problem and will be of interest to researchers working on DP and bandits. The authors also prove tight results so I believe this is a significant contribution to the literature. The writing quality and clarity is good. The lower bound is a nice technical contribution.
缺点
Unless I misunderstood, two datasets are neighboring if the set of possible rewards differ in only one location. I'm curious what the motivation for this is. For example, if we use clinical trials as the motivating example then each arm may correspond to a different treatment. In this case, I would view two datasets as neighboring if the set of observations differ at one step which could mean that all K arms at a single time step have different rewards. I am curious how this impacts the results of the paper.
问题
See weaknesses above.
We are grateful for your positive feedback on our paper and that you find value in our work. We present our response to your comments below.
Q1: "Unless I misunderstood, two datasets are neighboring if the set of possible rewards differ in only one location. I'm curious what the motivation for this is. For example, if we use clinical trials as the motivating example then each arm may correspond to a different treatment. In this case, I would view two datasets as neighboring if the set of observations differ at one step which could mean that all arms at a single time step have different rewards. I am curious how this impacts the results of the paper."
A1: We thank the reviewer for this insightful question. In the revised version of our paper, we have created a new Appendix C to reply to this question. Briefly, our definition of DP is equivalent to that of table-DP appearing in [1] (by making a correspondence between the notations), and the latter precisely satisfies the reviewer's expectation of ``I would view two datasets as neighboring if the set of observations differ at one step which could mean that all arms at a single time step have different rewards.'' See Definition C.2 as well as the subsequent analyses for the details.
Reference
[1] Azize, A. and Basu, D. (2023). Interactive and concentrated differential privacy for bandits. In Sixteenth European Workshop on Reinforcement Learning.
Thanks for the response! I have no further questions.
This work studied the way to adopt differential privacy on the Best Arm Identification policy. The main trick is to append each arm's empirical mean with proper Laplace distribution.
优点
The work provide comprehensive details on establishing its theoretical claims.
缺点
The empirical evaluation is limited on a particular synthetic data instance.
问题
-
Motivation to consider DP-BAI algorithm. For my understanding, DP is for preventing potential privacy risk when sharing statistics of a dataset. I would be great for the author to provide motivation to consider DP in bandit problem, especially on the best arm identification task.
-
Intuition on scheme (7). I had difficulty to make sense on the scheme (7) and would be thankful is author can provide explanations.
We thank the reviewer for engaging with our manuscript and providing useful feedback. We summarise your comments/questions and respond to them below.
Q1: The empirical evaluation is limited on a particular synthetic data instance.
A1: We acknowledge the limitation pointed out by the reviewer, and this is indeed an aspect that we are mindful of. The focus of our paper is primarily theoretical, aiming to establish and explore conceptual frameworks within differentially private BAI. Although the key intent of our work is to provide a strong theoretical foundation for future research and discussion, keeping in mind the importance of empirical validation, we benchmark the performance of our algorithm against existing non-private algorithms or the private adaptations thereof. We sincerely hope that the reviewer perceives this stance of ours optimistically.
Q2: ``It would be great for the authors to provide motivation to consider DP in bandit problem, especially on the best arm identification task.''
A2: Our conceptualization of differential privacy within the context of bandit problems aligns closely with established literature [1,2,3,4,5,6], particularly drawing inspiration from the work of Nikolakakis et al. [3]. While the primary objective of our work is fixed-budget BAI, it is noteworthy that the objectives in other relevant studies center around either regret minimization or fixed-confidence BAI. Importantly, these alternative settings necessitate unique analytical methodologies, yielding results distinct in nature from our own.
The underlying principle of privacy in the aforementioned body of literature, including our work, may be articulated as follows: an alteration in one of the rewards received by the learning agent should not largely influence subsequent decisions regarding the pulling of arms. To illustrate this point further, certain works posit a scenario where each time step is associated with an independent "user"; pulling arm at time provides the agent with user 's "score" of arm . If the score from a preceding user significantly impacts subsequent arm selections by the agent, there exists a potential inference risk for later users regarding the (private) scores of their predecessors through observations of the arm selections. Such privacy leakage can be mitigated by the application of differential privacy mechanisms tailored for bandit scenarios. In the context of the BAI task, the "agent" can be perceived as a surveyor seeking to identify the optimal item from a set of items in a market survey, while the "users" can be likened to customers participating in the survey.
Q3: "I had difficulty to make sense on the scheme (7) and would be thankful if author can provide explanations."
A3: We thank the reviewer for the comment. Scheme (7) defines the variable , which is the pre-configured number of active arms in phase . Following are some details of (7) assuming . Note that for , we have , where , i.e., there active arms in phase 1 (the default initialisation). Then, for phase , we have from (7) that
which means that if for some , then the agent eliminates approximately fraction of the current active arms at end of phase . Therefore, from phase until phase , the number of active arms reduces from to , and the shrinkage rate is approximately in a majority of these phases. Here, is chosen to ensure that . Lastly, for phase , we have from (7) that . That is, the agent eliminates approximately half of the active arms in each phase, and is chosen in such a way that , i.e., only one active arm remains after phases. For clarity, we provide a numerical illustration of the above specifications in the newly created Appendix B of the revised manuscript.
References
[1] Mishra, N. and Thakurta, A. (2015, July). (Nearly) optimal differentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence (pp. 592-601).
[2] Sajed, T. and Sheffet, O. (2019, May). An optimal private stochastic-mab algorithm based on optimal private stopping rule. In International Conference on Machine Learning (pp. 5579-5588). PMLR.
[3] Nikolakakis, K. E., Kalogerias, D. S., Sheffet, O., Sarwate, A. D. (2021). Quantile multi-armed bandits: Optimal best-arm identification and a differentially private scheme. IEEE Journal on Selected Areas in Information Theory, 2(2), 534-548.
[4] Azize, A. and Basu, D. (2022). When privacy meets partial information: A refined analysis of differentially private bandits. Advances in Neural Information Processing Systems, 35, 32199-32210.
[5] Solanki, S., Kanaparthy, S., Damle, S. and Gujar, S. (2022, September). Differentially Private Federated Combinatorial Bandits with Constraints. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 620-637). Cham: Springer Nature Switzerland.
[6] Azize, A. and Basu, D. (2023). Interactive and concentrated differential privacy for bandits. In Sixteenth European Workshop on Reinforcement Learning.
Dear Reviewer mZMs,
We hope this message finds you well. The rebuttal period is scheduled to conclude on November 22nd, which is ending. We are committed to engaging all reviewers with discussions. We hope that our responses are satisfactory; if you need further clarifications, please let us know. Thank you for your time and attention.
Best regards,
Authors of 3619
The authors propose and analyze an algorithm for the fixed-budget best-arm identification problem with differential privacy constraints. The algorithm is based on differentially private version of MAX-DET rather than adapting established fixed-budget BAI algorithms. In the appendix the authors provide a general analysis of natural possible extension of one such algorithm and clearly benchmark their algorithm with this extension. The lower bounds also depend on new notions of complexity which incorporate differential privacy as a constraint in policy design.
优点
I think the authors provide fundamental analysis of the problem in terms of the upper and lower bound. The main strength of the contribution is demonstrated by proving that the algorithm matches instance optimal bounds. Further, the algorithm idea is new itself and clearly outperforms existing benchmarks (and straightforward adaptations thereof).
缺点
I think the paper can be written more intuitively given that it has a lot of parameters. For example, while the algorithm is stated clearly, I am unsure why it works intuitively. What makes the apparent dimension go down for the first few rounds? How does decreasing the span basis vectors of the arm space lead to convergence to the optimal arm.
问题
-- I am skeptical about their definition of DP since it is defined over length sequences. I would expect that in the online setting this definition would only be defined over a sample of sequence starting from the time when the reward sequences differ as done in joint differential privacy.
-- I would like an intuitive explanation of why their algorithm works on a small example somewhere in the paper (maybe, in an appendix).
-- Finally, the central idea of the paper is to use D-optimal design rather than G-optimal design. These ideas are theoretically equivalent (see Proposition~3 in Soare et al. NeurIPs 2014) but this paper demonstrates a dramatic performance improvement in their numerical experiments. What is the intuitive explanation for this?
We thank the reviewer for recognizing the value of our work and for your insightful comments. Below are our responses to your questions.
Q1: ``I am skeptical about their definition of DP since it is defined over length sequences. I would expect that in the online setting this definition would only be defined over a sample of sequence starting from the time when the reward sequences differ as done in joint differential privacy.''
A1: We thank the reviewer for this question. Our definition of DP indeed aligns with your expectation of ``over a sample of sequence starting from the time when the reward sequences differ''. Intuitively, this is because the future rewards have no impact on the past sequence of arm pulls. Below are further details.
Our definition of DP does not explicitly entail time steps; note that in Section 2 represents the -th reward from arm (rather than the classical representation of reward from arm at time step ). In spite of the minor difference in the representation of rewards, our definition of DP is, in fact, equivalent to the notion of DP arising from the classical representation of rewards (which we call -table-DP borrowing the terminology from Azize et al. [1]). We prove this formally in the newly created Appendix C of the revised manuscript by introducing a dual decision maker who observes reward at time instead of as in our work. We then show that for any -DP policy of the decision maker as defined in our work, its dual decision maker's policy (or simply the dual policy) meets the -table-DP criterion, and vice-versa (see Propositions C.4 and C.5 in the revised paper).
We note here that table-DP is based on ``a sample of sequence starting from the time when the reward sequences differ'' highlighted by the reviewer; see Corollary C.3 of the newly created Appendix C for further details.
Q2: ``I would like an intuitive explanation of why their algorithm works on a small example somewhere in the paper (maybe, in an appendix).What makes the apparent dimension go down for the first few rounds? How does decreasing the span basis vectors of the arm space lead to convergence to the optimal arm.''
A2: We thank the reviewer for the suggestion. We have now provided additional explanations in the newly created Appendix B of our revised paper (highlighted in blue). Firstly, we would like to clarify that in the first few rounds (precisely rounds), the dimension may not go down as the number of arms is at least . Secondly, as pointed out by the reviewer, "decreasing the span" is indeed critical in identifying the best arm. This is because "decreasing the span" increases the probability that the private empirical mean of the best arm surpasses that of others, as is shown in Lemma E.2 of the original manuscript ( or Lemma G.2 of the revised version).
We remain open to elucidating any particular facet of our algorithm should the reviewer express a desire for further explanations.
Q3: Finally, the central idea of the paper is to use D-optimal design rather than G-optimal design. These ideas are theoretically equivalent (see Proposition~3 in Soare et al. NeurIPS 2014) but this paper demonstrates a dramatic performance improvement in their numerical experiments. What is the intuitive explanation for this?
A3: As pointed out by the reviewer, D-optimal design and G-optimal design are indeed theoretically equivalent. However, our Max-Det principle is rather different from D-optimal design. Note that is a D-optimal design of vector set if
There are two primary differences between D-optimal design and the Max-Det collection (defined in Definition 3.1). Firstly, the matrices of which the determinants are calculated are different. In D-optimal design, this matrix is formed by the weighted sum of some rank-one matrices of the form for , whereas in Definition 3.1, this matrix is formed by stacking up a subset of vectors from as its columns. Secondly, in D-optimal design, the weight of each vector (i.e., for ) may be any value in . In contrast, in Definition 3.1, this ``weight'' of each vector in can be regarded as being or , depending on whether or not it belongs to the Max-Det collection.
We appreciate the reviewer's question for providing us with the opportunity to clarify the differences between D-optimal design and our Max-Det principle.
Reference
[1] Azize, A. and Basu, D. (2023). Interactive and concentrated differential privacy for bandits. In Sixteenth European Workshop on Reinforcement Learning.
This paper investigates the best arm identification (BAI) problem in multi-armed bandits with differential privacy (DP) guarantees. It focuses on the fixed-budget BAI setting, aiming to minimize the error probability of selecting a suboptimal arm within a set number of arm pulls (budget). Notably, the paper introduces DP-BAI, the first algorithm to achieve DP guarantees in the fixed-budget BAI context. This algorithm outperforms a naive approach that directly incorporates a DP mechanism into the existing state-of-the-art non-private BAI algorithm.
优点
This paper introduces the first algorithm to solve the BAI problem within a fixed-budget constraint and under DP guarantees. The paper also offers an extensive theoretical analysis, establishing an upper bound on the error probability for the new algorithm, which is adaptive to the complexity of the problem measured by . Furthermore, it provides a matching lower bound, demonstrating that the algorithm attains optimal performance in this specific setting.
缺点
The paper lacks a detailed comparison with existing non-private fixed-budget BAI works. For example, a natural question is: Does the error probability of DP-BAI converge to that of the state-of-the-art non-private counterpart when ? Such an analysis would be valuable in understanding the trade-offs between privacy and performance.
Moreover, I personally believe that the writing of this paper, especially in the algorithm description section, could be improved. Currently the presentation has a lot of notations, many without adequate explanations. A more detailed explanation of each notation and some high-level insights would significantly improve the paper's readability and effectively convey the core ideas.
问题
-
see Weakness 1: How does DP-BAI compare to OD-LinBAI, especially when ?
-
Page 4, Definition 3.1: Why is a Max-Det collection of always linearly independent raises questions. Does this implicitly assume that spans ?
-
Regarding Algorithm 1:
- Lines 8, 14: The phrasing in these lines is confusing and seems inconsistent with earlier descriptions. The term "pull each arm XX times" is ambiguous. From the description, it appears that, in line 8, every arm in is pulled times, totaling arm pulls. Meanwhile, in line 14, it seems there are a total of arm pulls, with each pull randomly choosing an arm from .
- Can the authors provide some high-level intuition behind the choice of and as described in equations (5) and (6)?
We extend our sincere gratitude for your careful review of our paper. We present our response to your comments below.
Q1: ``How does DP-BAI compare to OD-LinBAI, especially when .''
A1: When , the upper bound on the error probability of our DP-BAI algorithm is given by
In contrast, the upper bound of OD-LinBAI is given by
Note that is at least as large as because . Hence, OD-LinBAI outperforms DP-BAI in the non-private setting (i.e., when ). However, DP-BAI outperforms DP-OD (a private adaptation of OD-LinBAI) as demonstrated in Appendix C of our original manuscript (or Appendix E of the revised version). These results are not surprising as OD-LinBAI is specifically designed for non-private settings, whereas DP-BAI is specifically designed for private settings. However, we would like to emphasize that shares the same form as in that, for small , it only depends on the first gaps ( for OD-LinBAI).
Q2: ``Does this implicitly assume that spans ?''
Q3: ``The phrasing in these lines is confusing and seems inconsistent with earlier descriptions.''
Response to Q2 and Q3: The reviewer is verily right in these two questions. We apologize for the typos, and have revised the above highlighted sentences in Section 3 (please see the parts in blue) accordingly. Briefly: (1) Yes, we only consider the case in which spans . (2) Yes, your understanding of Line 8 is correct, and "randomly" was a typo appearing in the earlier description of Line 14; we have now rectified this in the revised version. We express our sincere gratitude to the reviewer for their careful reading of our manuscript.
Q4: ``Can the authors provide some high-level intuition behind the choice of and as described in equations (5) and (6)?''
A4: The rationale behind the choices of and is rooted in their role as auxiliary parameters governing the number of active arms in each phase, as outlined in (7). Adhering to the principle of successive elimination (SE) [1,2], our algorithm aims to eliminate a specific proportion of active arms in each of the phases. The determination of this proportion in each phase is crucial for achieving the overall performance guarantee. In essence, when the number of active arms in any given phase significantly exceeds , the agent eliminates approximately proportion of active arms; otherwise, it eliminates roughly half of the active arms. The threshold is pivotal in the Max-Det principle analysis. The parameter in (4) is chosen to ensure that phases transpire before the number of active arms reduces to in cases where . This leads to and in equations (5) and (6), respectively. These specifications are numerically illustrated in the newly created Appendix B for clarity.
References
[1] Karnin, Z., Koren, T., and Somekh, O. (2013, May). Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 1238-1246. PMLR.
[2] Yang, J., and Tan, V. (2022). Minimax Optimal Fixed-Budget Best Arm Identification in Linear Bandits. Advances in Neural Information Processing Systems, pp. 12253-12266.
Thank you for the detailed and adequate response.
This paper is the first to study best arm identification (BAI) in linear bandits under a fixed DP budget budget. With extensive research focusing on DP MAB, this paper tries to answer this problem through a different lens of "pure exploration". This work serves as a valuable complement to the current body of literature.
优点
To this end, the paper proposes a policy satisfying -DP, thus providing an upper bound of the decaying speed of the error probability. The paper also provides an almost-matching lower bound. Empirical evaluation is also provided to show the effectiveness of the algorithm.
缺点
Although this is a nice work, I still suggest the paper provide more discussion on the connections between this problem to 1) BAI in the fixed-confidence regime, 2) and generally, MAB under DP. I understand superficially speaking they are different problems, but it is not very clear (to me) whether or not there exist some connections deeper. For example, there might be a simple adaptation of previous algorithms to suit this setting.
问题
Please refer to weakness
We are grateful for your valuable feedback. We are truly honored and encouraged by the positive evaluation and constructive feedback you provided. Below is our response.
Q: ``I still suggest the paper provide more discussion on the connections between this problem to 1) BAI in the fixed-confidence regime, 2) and generally, MAB under DP. I understand superficially speaking they are different problems, but it is not very clear (to me) whether or not there exist some connections deeper. For example, there might be a simple adaptation of previous algorithms to suit this setting.''
A: Thank you for the suggestion. Below are some comparisons between the different bandit settings.
(1) Fixed-budget BAI and regret minimization: While regret minimization inherently entails an exploration vs. exploitation dilemma, BAI (encompassing fixed-budget and fixed-confidence regimes) on the other hand is centered around pure exploration. As astutely demonstrated in the work of Zhong et al. [1], no algorithm can perform optimally for both of the above objectives simultaneously. Thus, for instance, an algorithm achieving the minimum probability of error in identifying the best arm with a fixed budget is inevitably accompanied by a larger regret value, and vice versa; for more details, see [1, Theorem 5.1]. Consequently, in the nuanced context of our work, which involves additional considerations such as differential privacy, a mere transposition of ideas from regret minimization proves insufficient to devise algorithms with near-optimal performance for the rather distinct objective of BAI.
(2) Fixed-budget BAI and fixed-confidence BAI: In the domain of pure exploration problems, two pivotal considerations, namely (a) the probability of error and (b) the stopping time required to ascertain the best arm, compete with one another. In the fixed-budget regime, a deterministic time (a.k.a. budget) is fixed, and the goal is to minimise the probability of error in determining the best arm after time steps. On the other hand, in the fixed-confidence regime, given a threshold (a.k.a. confidence level), the goal is to minimise the expected stopping time subject to the error probability not exceeding . A notable challenge in fixed-confidence scenarios lies in the determination of the expected stopping time, which is typically unknown to the agent. This is because the upper bound of the expected stopping time is, in general, a function of an unknown problem-specific ``hardness'' parameter, as evidenced in various works such as [2,3,4,5]. Consequently, establishing a straightforward relationship between in the fixed-confidence regime and the budget in the fixed-budget regime is substantially intricate.
Furthermore, it is noteworthy that while the asymptotic complexity of fixed-confidence BAI has been comprehensively characterised in the asymptotic limit as by Garivier and Kaufmann [2], a corresponding characterisation for fixed-budget BAI (where the asymptotics is as ) remains an open problem, as articulated by Qin [6]. Hence, the transposition of ideas from fixed-confidence BAI into the realm of fixed-budget BAI appears nontrivial due to the above fundamental disparities.
Finally, we would like to emphasize that fixed-budget BAI remains a considerable challenge in multi-armed bandits, and continues to be actively investigated; see, for instance, the recent works [6,7,8,9,10]. To the best of our knowledge, our work is the first attempt towards incorporating DP considerations within the framework of fixed-budget BAI. The amalgamation of these two aspects, namely fixed-budget BAI and DP, introduces numerous analytical challenges. Despite these challenges, we provide a complete characterization of the exponent of the probability of error up to universal constants.
[1] Zhong, Z., Cheung, W. C., and Tan, V. (2023). Achieving the Pareto frontier of regret minimization and best arm identification in multi-armed bandits. Transactions on Machine Learning Research.
[2] Garivier, A., and Kaufmann, E. (2016, June). Optimal best arm identification with fixed confidence. In Conference on Learning Theory (pp. 998-1027). PMLR.
[3] Wang, P. A., Tzeng, R. C., and Proutiere, A. (2021). Fast pure exploration via Frank--Wolfe. Advances in Neural Information Processing Systems, 34, 5810-5821.
[4] Nikolakakis, K. E., Kalogerias, D. S., Sheffet, O., Sarwate, A. D. (2021). Quantile multi-armed bandits: Optimal best-arm identification and a differentially private scheme. IEEE Journal on Selected Areas in Information Theory, 2(2), 534-548.
[5] Hou, Y., Tan, V. Y., and Zhong, Z. (2022). Almost optimal variance-constrained best arm identification. IEEE Transactions on Information Theory, 69(4), 2603-2634.
[6] Qin, C. (2022, September). Open Problem: Optimal Best Arm Identification with Fixed-Budget. In Conference on Learning Theory (pp. 5650-5654). PMLR.
[7] Yang, J., and Tan, V. (2022). Minimax Optimal Fixed-Budget Best Arm Identification in Linear Bandits. Advances in Neural Information Processing Systems, 35, 12253-12266.
[8] Komiyama, J., Tsuchiya, T., and Honda, J. (2022). Minimax optimal algorithms for fixed-budget best arm identification. Advances in Neural Information Processing Systems, 35, 10393-10404.
[9] Barrier, A., Garivier, A., and Stoltz, G. (2023, February). On best-arm identification with a fixed budget in non-parametric multi-armed bandits. In International Conference on Algorithmic Learning Theory (pp. 136-181). PMLR.
[10] Lalitha, A. L., Kalantari, K., Ma, Y., Deoras, A., and Kveton, B. (2023, July). Fixed-Budget Best-Arm Identification with Heterogeneous Reward Variances. In Uncertainty in Artificial Intelligence (pp. 1164-1173). PMLR.
This paper studies best arm identification (BAI) in linear bandits with differential privacy (DP) constraint. Here there are arms with feature vectors where the reward in each round is for some unknown ; the differential privacy constraint is with respect to changing a single reward value. The goal is to output the arm with largest (expected) reward with as high a probability as possible.
The paper provides an algorithm and a lower bound for the problem. As usual, the algorithm keeps a list of active arms and gradually reduces this list. The main innovation is that, instead of querying all arms (and add noise to their rewards to achieve DP), the authors show that, when the number of active arms is larger than , it suffices to query only a small subset of just arms; this subset is selected by maximizing the determinant of the submatrix corresponding to the selected arms. The rewards of other arms can be deduced from these by taking appropriate linear combinations. The authors give an analysis of this algorithm where the probabilistic error bound depends on the appropriate notion of (instance-wise) gap. The authors then give a lower bound that, for regime of constant , matches the upper bound to within roughly a squared factor in the exponent of the error probability. Finally, the authors provide empirical evaluation on synthetic data; the algorithm is shown to be much superior compared to adding noise trivially to the best known non-private algorithm (Yang and Tan, NeurIPS 2022), and the algorithm nearly matches the non-private algorithm already for moderate value of (between ).
Strengths
-
BAI is a natural setting and this paper is the first paper to study BAI with DP constraints. (Previous DP bandit papers focus on regret minimization.)
-
As far as I can tell, the idea of using determinant maximization is novel in the context of BAI.
-
The paper provides rigorous analysis of upper and lower bounds that nearly matches for the most interesting regime of .
-
The paper also provides clear explanation and analysis on why trivially adding noise to the best known non-private algorithm (Yang and Tan, NeurIPS 2022) leads to a worst algorithm than the one proposed here.
-
The algorithm is simple and practical; this is shown by the empirical results in the paper.
Weaknesses
-
The performance of the algorithm still does not match the best known non-private algorithm as we take .
-
The writing can be improved. E.g. algorithms are not explained in words and most analyses are given in the appendix without much intuition given in the main body.
为何不给更高分
Given that the performance of the algorithm still does not match the best known non-private algorithm as we take , it is likely that this is not yet the best approach to tackle the problem.
为何不给更低分
As stated above, this is a natural and well-studied setting in bandits literature that is somehow overlooked by the DP community so far. I think this is a good first paper for the topic of DP-BAI, which provides some initial (and innovative) results on the problem. The analyses are rigorous and, as far as I can tell, seem correct.
Accept (poster)