Non-asymptotic Analysis of Stochastic Gradient Descent under Local Differential Privacy Guarantee
We provide a comprehensive non-asymptotic analysis of the convergence of the DP-SGD algorithm, where individual users retain the autonomy to specify their differential privacy budgets.
摘要
评审与讨论
This paper analyzes the benefit of weight averaging (specifically, Polyak-Ruppert averaging) in private optimization. Specifically, it considers the strongly/general convex + smooth setting along with Hessian smoothness albeit only w.r.t. the optimum (i.e., Condition 3). Under these assumptions, it is shown that the (non-asymptotic) convergence bound of the Polyak-Ruppert averaging scheme is better than that of the last iterate with suitable step-size choices. Some small-scale experiments are shown to corroborate the theoretical findings.
优点
Solid theoretical analysis showing the improvement offered by the averaged iterate over the last iterate in private optimization. I'm not up to speed on all of the recent papers in private optimization, but as far as I know, there are no results like the ones in this paper showing the benefits of weight averaging in private optimization. The paper just makes the cut for me because of this.
缺点
1. The presentation of the theoretical results needs to be improved and simplified. For e.g., Theorem 4 is very hard to parse and overloaded with too many symbols. I'd recommend deferring the full versions to the Appendix and presenting abridged versions in the main paper having only the important terms. Also, I'd have liked to see a remark or something explicitly comparing the dependence of the convergence bounds of the averaged iterate and last iterate w.r.t. together (and therefore, explaining why the averaged iterate does better) rather than leaving it to the reader.
2. Looking at Theorem 4, there is a term of whereas in Theorem 3, there is just . If our initialization is very far from the optimum, this may make the averaged iterate have worse convergence than the last iterate. This point should be discussed in the paper.
3. I understand that this is a theoretical paper and so I don't want to complain too much about the scale of the experiments, but for this proposal to be more convincing and practically useful, the authors should consider performing some larger experiments -- beyond linear/logistic regression and definitely with larger .
4. It'd be nice to also provide some intuitive explanation of why averaging helps instead of just math.
Suggestion: The expectation notation is a bit non-standard; is more standard.
问题
1. Is Condition 3 being used in Theorem 3? I don't see any term in Theorem 3.
2. Why are there and terms in the results? These are deterministic, right? Or are you taking expectation w.r.t. something else here?
3. What are the constants and in Theorem 4? Overall this theorem needs to be simplified and cleaned up as I mentioned in Weaknesses.
4. Can the results be extended to regular -DP? Or is there a reason the authors are considering GDP?
Thank you for the constructive feedback, which has been very helpful in enhancing the quality of our paper. We respond below to your questions and concerns:
W1: Acknowledging the complexity of the theoretical expressions in our work, we appreciate the reviewer's observation regarding their intricacy for those less familiar with the field. The task of non-asymptotic analysis is notably challenging, particularly when it comes to deriving tractable theoretical results from complex estimation bounds. Such complexity often renders these results difficult to comprehend, especially in understanding the convergence behavior of estimators. Our work, however, provides a clear and accurate bound for the estimation process. It elucidates key aspects: the predominant rate of convergence, the diminishing influence of initial conditions, and how various factors—like dimensionality and the privacy budget—affect the theoretical bounds. This clarity in our results sheds significant light on areas previously obscured by complexity. To facilitate understanding, we have included detailed explanatory remarks. In response to the feedback, we will further simplify these expressions in the revised version to enhance accessibility for our readers. For a more comprehensive overview, we add Table 1 in our revised version to summarize our results. This table will detail the leading convergence rates for each specific situation addressed in our research, offering a more comprehensive understanding of our contributions in comparison to existing work.
W2: We appreciate the reviewer's concerns and apologize for any confusion our work may have caused. Indeed, there is a notable difference in how quickly initial conditions diminish in two scenarios: the averaged version and the last step estimation, particularly with smaller step sizes. As outlined in Remark 1 and Remark 2, the averaged version tends to 'forget' these initial values at a slower rate compared to the last step estimation. For a more comprehensive understanding of the differences in the convergence rate between these two methods, we direct readers to the detailed results and discussion in Paragraph 1 on Page 9 of our report. Please also see Table 1 for the summary of convergence rate in different situations.
W3: Thank you for your feedback. Our choice to utilize linear and logistic regression models is grounded in their widespread use and representativeness in regression and classification problems, respectively. These models also exemplify the two categories of loss functions our paper explores: strongly convex and non-strongly convex. Furthermore, we have conducted simulations with higher-dimensional data, the details of which are available in the supplementary materials.
W4: Intuitively, our procedure involves introducing a noise term with a zero mean in each iteration. This addition leads to considerable variance when considering the last step as our estimator. However, by employing an averaging technique, we significantly reduce the variance induced by this noise. As a result, this averaging approach not only stabilizes our estimator but also endows it with a smaller variance.
Q1: We thank the reviewer for reading our paper carefully. Actually, Condition 3 is not used in the proof of Theorem 3. In this revision, we have corrected it.
Q2: We feel sorry for the misleading notation here. To clarify, the initial conditions referred to are fixed numbers.
Q3: These constants are associated with the loss function, as detailed in the supplementary material on Page 4, Paragraph 2. We aim to present our results in a more simplified manner in the revised version.
Q4: In practice, according to Section 2.4 in [1], we know -DP is a generalization of -DP. Therefore, this result can be easily extended to the regular DP. Meanwhile, within the context of an online setting, we utilize the parallel composition approach rather than relying on the composition property of the GDP. In this case, incorporating the classical definition of LDP seems more appropriate. Theoretically, it’s possible to extend noise distribution beyond Gaussian noise by imposing the assumptions on the boundedness of noise moments when considering the general LDP. In our revised version, we are committed to thoroughly integrating classical LDP notation alongside GDP.
[1] Dong, Jinshuo, Aaron Roth, and Weijie J. Su. "Gaussian differential privacy." Journal of the Royal Statistical Society Series B: Statistical Methodology 84.1 (2022): 3-37.
Thanks for the rebuttal. I'll keep my score. But since I can't strongly endorse this paper, I'll defer to the opinion of the other reviewers.
This paper provides the convergence result of sgd for lipschitz functions under the local differential privacy, both for the strongly convex setting and the non-convex setting.
优点
The paper provides detailed proofs of their theoretical results, with several numerical simulations.
缺点
-
I think authors may not provide a good explaination on their motivation on using GDP instead of classical LDP notation, see also Question 1.
-
Both the results and proof techniques in this paper are similar to those in [1] in the non-private setting. It appears that the introduction of additional Gaussian noise does not pose significant challenges to the analysis. However, the authors do not explicitly compare their paper to [1] or other non-private research in both the proof and result sections.
[1] Moulines & Bach, Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning, 2011
问题
I have several questions on the paper's motivation and theoretical results:
-
Why did the authors choose to employ GDP notation rather than LDP in the paper? As far as my understanding goes, GDP's primary advantage is its tighter composition guarantee. However, within the context of this paper, where each user's data passes through the algorithm only once, the notion of composition does not apply. Consequently, I am struggling to understand the underlying motivation and advantages of adopting GDP notation.
-
Concerning the theoretical results when : In the convergence rate section of Remark 1, the authors assert that "For a scenario where , convergence of the LDP-SGD estimator is not assured." Nevertheless, based on my knowledge, at least in the strongly convex setting, the convergence analysis of LDP-SGD with a step size seems to be readily attainable by adapting the proof provided in [1]. Have I possibly overlooked any technical complexities?
[1] Alexander Rakhlin, Ohad Shamir, Karthik Sridharan, "Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization," 2011.
Thank you for the constructive feedback, which has been very helpful in enhancing the quality of our paper. We respond below to your questions and concerns:
W1 & Q1: Thank you for your insights on the use of GDP over classical DP in our work. We recognize that classical DP is indeed a more general framework. As outlined in Section 2.4 of [1], -DP, which includes GDP, is an extension of -DP, allowing us to easily extend our results to classical DP scenarios. It is theoretically feasible to expand our analysis to include noise distributions beyond Gaussian, which forms our ongoing research.
Our choice of GDP in this work is driven by two main factors. Firstly, there is a growing interest in GDP, with numerous algorithms proposed that lack theoretical convergence guarantees. Our work aims to bridge this gap, providing a foundation for non-asymptotic analysis of these algorithms. Secondly, GDP demonstrates advantageous theoretical properties in asymptotic analysis, as detailed in [2], prompting us to offer a comprehensive non-asymptotic analysis for GDP in advance. In conclusion, while classical DP offers a broader context, our focus on GDP is intentional, aligning with the specific aims and scope of our current paper.
W2: In our technical details of online SGD algorithm, we recognize that specialists in the field might note the structural similarities in proofs across various studies. Such similarities often stem from the use of standard methods and tools within the discipline. Specifically, in our convergence analysis of DP-SGD estimator, the proof structure mirrors that of non-private versions, a similarity we only discerned after completing our analysis. This resemblance arises from the inherent recursive nature of the SGD algorithm.
Also, we wish to clarify that our work does not claim to introduce novel proof techniques. Instead, our significant contribution lies in adeptly applying established methodologies from the extensive literature to tackle the new and distinct challenges presented by online LDP problems. Our focus is on solving these novel problems using existing tools, rather than on the novelty of the proof techniques themselves. Indeed, the non-asymptotic analysis of online LDP is inherently challenging, and achieving tractable theoretical results was not feasible until we developed these theorems.
Q2: In our exploration of the theoretical outcomes when , we found that the convergence of the algorithm hinges on specific conditions. These are determined through mathematical derivation and involve factors such as the correlation between the eigenvalues of the loss function's Hessian matrix and the initial step size. Additionally, our simulation studies highlight that these conditions often go unmet in practical scenarios, leading to situations where convergence is not guaranteed.
This distinctive behavior in convergence under the condition for private versus non-private cases further underscores an important aspect of our work. It demonstrates that even though the structures and tools of the proofs may be similar, the outcomes in private settings, as opposed to non-private ones, can differ significantly. This distinction not only highlights the uniqueness of our results but also reinforces the value of our contribution in extending non-asymptotic analysis to the private version of the online SGD algorithm.
[1] Dong, Jinshuo, Aaron Roth, and Weijie J. Su. "Gaussian differential privacy." Journal of the Royal Statistical Society Series B: Statistical Methodology 84.1 (2022): 3-37.
[2] Avella-Medina, Marco, Casey Bradshaw, and Po-Ling Loh. "Differentially private inference via noisy optimization." arXiv preprint arXiv:2103.11003 (2021).
This paper focuses on the non-asymptotic analysis of the convergence of the DP-SGD (Differentially Private Stochastic Gradient Descent) algorithm and its variants. The authors analyze the convergence of the DP-SGD algorithm and provide practical guidelines on the effect of various hyperparameters such as step size, parameter dimensions, and privacy budgets on convergence rates. The paper provides theoretical bounds on the expected distance between the estimators and the global optimum for strongly convex loss functions. For non-strongly convex functions, the paper analyzes the difference between the loss incurred by the estimators and the optimal loss.
优点
Strength:
-
The paper provides a comprehensive understanding of the convergence behavior of DP-SGD and to help practitioners choose appropriate hyperparameters for their specific use cases.
-
The paper contains fine analysis for both strongly convex case and non-strongly convex case.
缺点
Weakness:
-
The paper does not provide any lower bounds so it is hard to evaluate the tightness of results.
-
This paper lacks comprehensive comparisons to prior works. There are many works on stochastic gradient descent on local differential privacy guarantee. For example, Algorithm 4 in Wang, Teng, et al. "Local differential privacy for data collection and analysis." Neurocomputing 426 (2021): 114-133 and Algorithm 1 in Liu, Ruixuan, et al. "Fedsel: Federated sgd under local differential privacy with top-k dimension selection." Database Systems for Advanced Applications: 25th International Conference, DASFAA 2020, Jeju, South Korea, September 24–27, 2020, Proceedings, Part I 25. Springer International Publishing, 2020. From my point of view, this is not the first work on LDP-SGD, and therefore a comprehensive comparison is expected in this paper. I suggest the author adding a table including previous results and this work. Also, the upper bounds in the paper are unnecessarily long, for simplicity and better rendering I suggest keeping the leading asymptotic term.
-
The technical contribution is relatively weak as I could not find sufficient novelty in the proofing techniques.
-
The presentation of this work could be enhanced. Specifically, it appears there may be an issue with the visibility of equations 3 and 4, which I presume are the upper bounds in Theorems 3 and 4. This could potentially be a result of a compiling error.
问题
- There is an obvious typo in the proof. On page 1 of supplementary material, “For the second term of (1), by Condition ??, we have…”, here Condition ?? should be Condition 2.
- Why the authors particularly favor Polyak-Ruppert averaging estimator in the theoretical analysis? It would also be interesting to see the results of many other variants of SGD that give better performance.
We appreciate the reviewer's constructive suggestion.
W1: We agree with the reviewer on the importance of a lower bound in evaluating and understanding the estimator's behavior. However, the task of establishing a lower bound for parameter estimation in our specific context, and similar ones, is exceptionally challenging and remains an unresolved issue in the field. To our knowledge, there exists a substantial gap in the literature and available methodologies that directly address this challenge. Also, the existing literature [1, 2] conducting non-asymptotic analysis only provide upper bounds, which indirectly highlights the inherent challenges associated with establishing a lower bound.
Indeed, theoretical analysis in the field of LDP algorithms, particularly in terms of their convergence, is generally sparse. Our study distinguishes itself from existing studies in LDP, which primarily concentrates on the development of algorithms that achieve set privacy benchmarks, but often overlooks their convergence behaviors. In contrast, our study makes a significant contribution by not only examining but also quantifying the rate and manner in which these DP algorithms approach optimal solutions. This marks a significant leap forward in the realm of DP.
W2: We thank the reviewer for providing additional work focusing on LDP. However, the papers you referenced address different contexts within the field of local differential privacy. For instance, 'Local differential privacy for data collection and analysis' focuses on mean estimation, contrasting with the general parameter estimation issue our study addresses. Similarly, 'Fedsel: Federated SGD under local differential privacy with top-k dimension selection' considers private dimension selection mechanisms and adapts the gradient accumulation technique to stabilize the learning process with noisy updates within a distributed framework, while our work is framed in an online parameter estimation context under LDP. Considering these different scenarios, making direct comparisons within a uniform framework or situation is difficult. Nonetheless, for a more lucid comparison, we have introduced a table in our revision. This table details the leading convergence rates for each specific situation addressed in our study, offering a more comprehensive understanding of our contributions in comparison to existing work.
W3: In our technical details of online SGD algorithm, we recognize that specialists in the field might note the structural similarities in proofs across various studies. Such similarities often stem from the use of standard methods and tools within the discipline. Specifically, in our convergence analysis of DP-SGD estimator, the proof structure mirrors that of non-private versions, a similarity we only discerned after completing our analysis. This resemblance arises from the inherent recursive nature of the SGD algorithm.
Also, we wish to clarify that our work does not claim to introduce novel proof techniques. Instead, our significant contribution lies in adeptly applying established methodologies from the extensive literature to tackle the new and distinct challenges presented by online LDP problems. Our focus is on solving these novel problems using existing tools, rather than on the novelty of the proof techniques themselves. Indeed, the non-asymptotic analysis of online LDP is inherently challenging, and achieving tractable theoretical results was not feasible until we developed these theorems.
Finally, in our exploration of the theoretical outcomes when , we found that the convergence of the algorithm hinges on specific conditions, which is different from the non-private cases. It demonstrates that even though the structures and tools of the proofs may be similar, the outcomes in private settings, as opposed to non-private ones, can differ significantly. This distinction not only highlights the uniqueness of our results but also reinforces the value of our contribution in extending non-asymptotic analysis to the private version of the online SGD algorithm.
W4: We express our sincere gratitude to the reviewer for identifying the editing error in the cross-reference links. Our main emphasis has been on the thorough derivation of theoretical results to ensure their accuracy. We acknowledge the importance of clear and precise presentation and commit to improving this aspect in the final manuscript.
Continued:
Q1: Thank you for spotting the error. In this revision, following your suggestion, we have revised it.
Q2: In fact, the Polyak-Ruppert averaging method greatly improves the stability and efficiency of stochastic approximation algorithms through the averaging of iterative sequences. This approach has been demonstrated to achieve information-theoretically optimal asymptotic variance. It does so particularly when employing learning rates characterized by slower decay, coupled with uniform averaging. Meanwhile, the Polyak-Ruppert averaging technique remains the most popular and user-friendly version of the online SGD algorithm. Its widespread use underscores its efficiency and straightforward application. While this method is our primary focus, we recognize the value of investigating other variants, such as the weighted sum approach, in our future research.
[1] Korba, Anna, et al. "A non-asymptotic analysis for Stein variational gradient descent." Advances in Neural Information Processing Systems 33 (2020): 4672-4682.
[2] Zhou, Lijia, et al. "A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models." Advances in Neural Information Processing Systems 35 (2022): 21286-21299.
Thanks for your response. However, I will not change my score due to the following reasons. (1) I am not sure about the motivation of the paper. The paper only focuses on LDP-SGD where each iteration only uses one sample (2). However, in practice, to the best of my knowledge, there is no application using one-pass SGD, so why the paper needs to study this algorithm? (2) The assumptions in the paper are quite strong compared to many work on DP-SCO, you can see they need strongly convex, smooth and also an additional Assumption 3. Studying DP-SGD for strongly convex functions is less interesting now.
The paper claims to perform a comprehensive non-asymptotic analysis of the convergence guarantee of the DP-SGD algorithm. Rightfully so, if this is done correctly, it provides guidelines to the practitioner how to best choose various hyperparameters if we aim to achieve certain convergence rates. Considering that hyperparameter tuning is one of the important problem, such a study is definitely worth pursuing.
优点
The strength of the paper is to perform the tedious calculation to show the non-asymptotic analysis on the convergence rate.
缺点
While the major motivation of the paper is to provide a guideline to practitioners as to how they should choose hyperparameters, I really (like really) do not see how one can figure out these hyperparameters from the expression they have in the theorem statement. I find it extremely hard to parse their theorem statements. One benefit of asymptotic analysis is that it makes expression a bit easier to parse. If the authors do insist on writing the exact non-asymptotic bounds, then I would suggest they work a bit more on making the expression simpler. Having experience with practitioners, I can guarantee that no one would take their theorem and try to work out hyperparameters from it, at least not analytically. Nesterov-Nemerivoski's results are not influenced because they give an asymptotic bound, but because they give simple to state bounds, and in reality, they are not that far from what you get in practice for certain loss functions.
I also find the assumption on assumption on Hessian a bit too much.
The paper has several typos. One glaring one is citing the same set of authors on page 2 (second paragraph). Another one that comes to my mind is the missing cross reference on the first page of the supplementary material. The one that really annoyed me is the references in the Remarks. It does not take a lot of effort to write Theorem 3 instead of (3). However, it shows the lack of diligence the authors put in writing their paper.
问题
None. Please read the weakness section.
We thank the Reviewer for the constructive comment. First, we want to clarify that "guiding practitioners in hyperparameter selection" is not the motivation of our work, although the theorems we presented can address this issue. Actually, the aim of this paper is to study the non-asymptotic bounds of Differentially Private Stochastic Gradient Descent (DP-SGD) estimator as well as its variants under local differential privacy in finite samples. The main motivations of our paper are twofold.
- Bound for finite sample: Unlike traditional asymptotic analysis, in which bound is only valid for a sample size exceeding a certain rank. In contrast, our non-asymptotic bound is valid for each finite sample size, which makes it more relevant for practitioners seeking to understand parameter convergence. Meanwhile, examining the behavior of finite samples is an important task [1, 2, 3], as sample sizes in practical problems are always finite. Usually, obtaining such results requires more mathematical effort than merely obtaining asymptotic results and typically, this involves more restrictive assumptions.
- In-depth exploration of convergence rate: Also, our study focuses on providing precise bounds and convergence rates of DP-SGD estimators to the optimum in finite samples. Unlike existing studies in local differential privacy, which primarily focus on developing algorithms that satisfy expected privacy guarantees but often overlook their convergence properties and rates, our study addresses this shortfall and fills this gap. We delve into how these algorithms converge to the optimum and, importantly, the speed at which this convergence occurs, providing a more comprehensive understanding of their performance.
Acknowledging the complexity of the theoretical expressions in our work, we appreciate the reviewer's observation regarding their intricacy for those less familiar with the field. The task of non-asymptotic analysis is notably challenging, particularly when it comes to deriving tractable theoretical results from complex estimation bounds. Such complexity often renders these results difficult to comprehend, especially in understanding the convergence behavior of estimators. Our work, however, provides a clear and accurate bound for the estimation process. It elucidates key aspects: the predominant rate of convergence, the diminishing influence of initial conditions, and how various factors—like dimensionality and the privacy budget—affect the theoretical bounds. This clarity in our results sheds significant light on areas previously obscured by complexity. To facilitate understanding, we have included detailed explanatory remarks. In response to the feedback, we will further simplify these expressions in the revised version to enhance accessibility for our readers. For a more comprehensive overview, we add Table 1 in our revised version to summarize our results. This table will detail the leading convergence rates for each specific situation addressed in our research, offering a more comprehensive understanding of our contributions in comparison to existing work.
We sincerely accept the reviewer's criticism of the Hessian matrix Condition. Actually, the requirements on the Hessian matrix are commonly used in analyzing the convergence of SGD’s algorithm; see [4]. We note that it is easy to verify this Condition holds for the two classical examples: quadratic loss and logistic loss. Indeed, this condition can be relaxed by imposing some assumptions regarding on the martingale difference , as outlined in Assumption 3.2 by [4], which we are working on it. Moreover, this condition underscores the difficulty in obtaining non-asymptotic results, thereby reinforcing the significance of our contribution.
Finally, we are grateful to the reviewer for pointing out the editing error of the cross-reference links. While we regret any confusion caused, we believe this does not reflect a lack of diligence on our work. Our primary focus has been on the meticulous derivation of theoretical results, ensuring their correctness. Consequently, we think the compiling error should not undermine the significant contribution of our work.
[1] Korba, Anna, et al. "A non-asymptotic analysis for Stein variational gradient descent." Advances in Neural Information Processing Systems 33 (2020): 4672-4682.
[2] Zhou, Lijia, et al. "A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models." Advances in Neural Information Processing Systems 35 (2022): 21286-21299.
[3] Chen, Sitan, Giannis Daras, and Alex Dimakis. "Restoration-degradation beyond linear diffusions: A non-asymptotic analysis for ddim-type samplers." International Conference on Machine Learning. PMLR, 2023.
[4] Chen, Xi, et al. "Statistical inference for model parameters in stochastic gradient descent." Annals of Statistics 48.1 (2020): 251-273.
The paper gives detailed non-asymptotic bounds on the convergence guarantee of one pass noisy SGD algorithm with individual privacy guarantees. It considers two variants of the algorithm one returning the last iterate and another an appropriate average. The motivation of the work is to give practitioners tools to pick the hyperparameters However, as pointed out by reviewers, there are no actual practitioner-friendly advice but instead rather cumbersome and not human-interpretable bounds. The empirical part has a discussion of the effect of some parameter choices on the rates rather than actual suboptimality bounds. As as result, additional evaluation and a major revision would be needed to make a more convincing case for the significance of this contribution.
为何不给更高分
as above
为何不给更低分
n/a
Reject