PaperHub
6.7
/10
Poster3 位审稿人
最低6最高8标准差0.9
8
6
6
3.3
置信度
ICLR 2024

Tractable MCMC for Private Learning with Pure and Gaussian Differential Privacy

OpenReviewPDF
提交: 2023-09-24更新: 2024-04-15
TL;DR

We propose an approximate Monte Carlo sampling method to achieve pure and Gaussian differential privacy.

摘要

Posterior sampling, i.e., exponential mechanism to sample from the posterior distribution, provides $\varepsilon$-pure differential privacy (DP) guarantees and does not suffer from potentially unbounded privacy breach introduced by $(\varepsilon,\delta)$-approximate DP. In practice, however, one needs to apply approximate sampling methods such as Markov chain Monte Carlo (MCMC), thus re-introducing the unappealing $\delta$-approximation error into the privacy guarantees. To bridge this gap, we propose the Approximate SAample Perturbation (abbr. ASAP) algorithm which perturbs an MCMC sample with noise proportional to its Wasserstein-infinity ($W_\infty$) distance from a reference distribution that satisfies pure DP or pure Gaussian DP (i.e., $\delta=0$). We then leverage a Metropolis-Hastings algorithm to generate the sample and prove that the algorithm converges in W$_\infty$ distance. We show that by combining our new techniques with a localization step, we obtain the first nearly linear-time algorithm that achieves the optimal rates in the DP-ERM problem with strongly convex and smooth losses.
关键词
Pure Differential PrivacyMonte Carlo samplingGaussian Differential PrivacyExponential Mechanism

评审与讨论

审稿意见
8

The paper proposes an algorithm for pure DP or GDP DP-ERM that achieves nearly linear time complexity with an optimal error rate under regularity assumptions on the loss. The algorithm uses the exponential mechanism with MCMC sampling, and uses a novel analysis to show that the approximate sample from the MCMC can still meet pure DP or GDP with some additional privacy loss.

优点

The paper is fairly easy to understand, despite having lot's of theory. The MCMC technique of obtaining a pure DP or GDP guarantee from an approximate exponential mechanism is novel to my knowledge, and could find applications elsewhere, although implementing the mechanism in practice doesn't currently seem possible. While the performance guarantee of the proposed DP-ERM algorithm is purely theoretical, it advances the current knowledge on what is possible under DP.

缺点

The biggest weakness of the paper is the purely theoretical nature of the method in its current form. Implementing it would be difficult, as the required number of MALA iterations is only given as an asymptotic expression, so it would not be possible to know the required number of iterations for any given problem. If possible, it would be great to have a non-asymptotic expression for the number of iterations in the Appendix, or some other way to compute it.

Minor comments:

  • Lemma 14 should be mentioned in the proof of Theorem 2 when referring to the adaptive composition theorems. In the proof, I think it is possible that for some yy, the sensitivity of fu,yf_{u,y} is greater than the bound with more than probability 0 under ζ(y)×ζ(y)\zeta(\cdot | y) \times \zeta(\cdot | y), but this set of yy would have 0 measure under ζ\zeta or ζ\zeta'. I think the proof still works due to Lemma 14 allowing the second mechanism to be almost surely DP, but this should be explicitly mentioned.
  • In footnote 1, both sides of the DP inequality have DD'.
  • In the P(A)P(A) expression in the GDP part of the proof of Lemma 14, line 3 is a repeat of line 2. In the expression of Hα(M(D)M(D))H_\alpha(\mathcal{M}(D) || \mathcal{M}(D')), there's an extra dP1dQ1(x)\frac{d P_1}{d Q_1}(x) in the first indicator on line 7. When that expression continues on the next page, the Gaussian Radon-Nikodym derivatives involving μ2\mu_2 at the end of the integral are the wrong way around on lines 2-5.

问题

  • Would it be possible to use ASAP outside ERM, with a generic utility function F(θ)F(\theta) for the exponential mechanism? What conditions would FF need to meet?
  • When discussing related work, you mentioned that an existing algorithm could solve DP-SCO in nearly linear time, but adapting their algorithm to DP-ERM would not result in a linear time. What are the differences between DP-ERM and DP-SCO, and how do they lead to this difference?
  • Isn't Lemma 5 effectively assuming that the loss is bounded, as Lipschitzness of the loss implies continuity, and Θ\Theta must be bounded due to the upper bound on γ\gamma?
  • In the proof of Theorem 1, shouldn't the expressions for KK have Ω\Omega, not O\mathcal{O}, so they would be lower bounds instead of upper bounds?
评论
  1. Would it be possible to use ASAP outside ERM, with a generic utility function F(θ)F(\theta) for the exponential mechanism? What conditions would need to meet?

Yes, our results can be extended to non-convex settings by incorporating techniques from [Ma et al. 2019]. The results of our paper also directly apply to the utility function FF with smoothness and strong convexity.

  1. When discussing related work, you mentioned that an existing algorithm could solve DP-SCO in nearly linear time, but adapting their algorithm to DP-ERM would not result in a linear time. What are the differences between DP-ERM and DP-SCO, and how do they lead to this difference?

Thanks for pointing it out. To enhance clarity, we rephrase the discussion on Page 9 as follows: "To avoid any confusion, a nearly linear-time algorithm for DP-SCO in the smooth/strongly convex setting that achieves optimal rates exists (Feldman et al., 2020). But to the best of our knowledge, the problem remains open for DP-ERM until this paper." DP-ERM and DP-SCO have different optimal rates. For the strongly convex case, the DP-SCO has G2α(1n+d2n2ϵ2)\frac{G^2}{\alpha}(\frac{1}{n} + \frac{d^2}{n^2\epsilon^2}) for ϵ\epsilon-DP or G2α(1n+dn2μ2)\frac{G^2}{\alpha}(\frac{1}{n} + \frac{d}{n^2\mu^2}) for μ\mu-GDP while DP-ERM's optimal rates are d2n2ϵ2\frac{d^2}{n^2\epsilon^2} and dn2μ2\frac{d}{n^2\mu^2}. The latter is harder to achieve in the constant dd, large nn regime. Existing near-linear-time DP-SCO algorithms with optimal rates (only under zCDP and approx-DP --- no such results for pure-DP or pure-Gaussian methods exist) mostly leverage the fact that G2αn\frac{G^2}{\alpha n} is bigger and one can stop versions of DP-GD or DP-SGD training early.

Regarding DP-SCO vs DP-ERM: We love both settings. DP-ERM is a bit stronger and cleaner due to the separation between optimization and generalization. ERM and its generalization bound and adaptivity (e.g., low-noise, fast spectral decay, stability) have been studied for much longer than DP-SCO. DP-SCO algorithms focus on weakening computation so as to just achieve the worst-case excess (population) risk for crude classes of problems but do not adapt to finer subclasses. DP-ERM however automatically achieves whatever adaptivity and other properties that ERM enjoys. This is a concrete benefit of DP-ERM over DP-SCO in our minds.

  1. Isn't Lemma 5 effectively assuming that the loss is bounded, as Lipschitzness of the loss implies continuity, and Θ\Theta must be bounded due to the upper bound on γ\gamma?

The key insight from Lemma 5 is that it avoids assuming a bound for FF, which is potentially large. The loss may not be bounded actually unless we assume there is a point θΘ\theta\in\Theta such that the loss F(θ)F(\theta) is bounded. For example, let's say the loss is log(1+ey(θx))\log(1+e^{-y (\theta\cdot x)}), with X=[1,1]\mathcal{X}=[-1,1], Θ=[10000,10001]\Theta = [10000,10001]. Then if y=1,x=1y=-1,x=1, the loss can be as large as 1000110001. This is a vacuous bound compared to GDiam(Θ)1G \cdot \mathrm{Diam}(\Theta) \leq 1. Compared to the classical analysis based on the bounded loss, the bounded-range analysis is both cleaner and tighter.

  1. In the proof of Theorem 1, shouldn't the expressions for KK have Ω\Omega, not O\mathcal{O}, so they would be lower bounds instead of upper bounds?

Yes. They should be lower bounds. Thanks for pointing it out.

  1. The biggest weakness of the paper is the purely theoretical nature of the method in its current form. Implementing it would be difficult, as the required number of MALA iterations is only given as an asymptotic expression, so it would not be possible to know the required number of iterations for any given problem. If possible, it would be great to have a non-asymptotic expression for the number of iterations in the Appendix, or some other way to compute it.

Thank you for pointing it out. We omit the constants in our paper for simplicity and will discuss them in the appendix in the upcoming revision. The constant is within 500 and can be found in the proof of Lemma 7 of [Ma et al. 2019], which is Lemma 9 in the arxiv version.

Reference:

Ma, Y. A., Chen, Y., Jin, C., Flammarion, N., & Jordan, M. I. (2019). Sampling can be faster than optimization. Proceedings of the National Academy of Sciences, 116(42), 20881-20885.

评论

Thanks for the reply. You addressed my main points, especially if you get the experiments that you mentioned in the other reply done during the discussion period update the submission to include them.

The key insight from Lemma 5 is that it avoids assuming a bound for FF, which is potentially large.

So the key benefit is not needing to use a potentially large bound on FF in the privacy calculations? I would explicitly mention that as a benefit of Lemma 5 if that is the case.

评论

Yes, we will explicitly mention this in our revision. Thank you for your suggestions.

审稿意见
6

The authors propose a novel MCMC algorithm for pure DP and Gaussian DP empirical risk minimization. The new MCMC sampling scheme does not introduce extra privacy failure probablity and thus preserves pure DP. Moreover, the algorithm runs nearly linearly in the dataset size.

优点

  1. DP-ERM is a very fundamental task. The paper addresses very practical and important issues in differentially private optimization.
  2. The paper proposes a new algorithm for DP-ERM. The major benefit is that it avoids catastrophic privacy failures and thus guarantees pure DP, and runs linearly with the dataset size nn.
  3. The paper has interesting technical contributions, for example, a new MCMC algorithm, the relation between TV distance and Wasserstein-infinity distance.

缺点

  1. θ\theta^* seems to be defined as the minimizer for the regularized loss in Section 2.1, but in Assumption 3 θ\theta^* is used again for the minimizer of the original loss. The authors should clarify what is the θ\theta^* in Theorem 3. This is a crucial problem as in existing DP-ERM literature (e.g. Bassily'14) we should care about the excess risk w.r.t. minθL(θ)\min_{\theta}\mathcal{L}(\theta), not the regularized loss. If θ\theta^* is the minimizer of the regularized loss, then the authors should clarify why it is used instead of the commonly adopted formulation of excess risk.
  2. In view of 1 the notation causes confusion in understanding the main theoretical results. There are other notation issues: JJ was first defined as ii\sum_{i}\ell_i which collides with L:\mathcal{L:}, and then redefined in Algorithm 3 as the regularized loss centered at θ0\theta_0. The authors should check other notation issues.
  3. The algorithm requires rejection sampling which in the worst case may not terminate.
  4. A minor issue is that the work lacks empirical evaluations. Even a small experiment on even the simplest convex optimization problem can provide very strong support for the computation efficiency of this work and demonstrate the practicality of the proposed algorithm.

====Update==== Raised my score to 6 after the authors have addressed my concerns.

问题

  1. See weakness 1.
  2. If in Algorithm 1, we are only allowed to perform the sampling N times, and thus must return a FAIL state if none of the θK\theta^Ks are valid, would the algorithm still be pure-DP or pure Gaussian DP?
评论
  1. θ\theta^* seems to be defined as the minimizer for the regularized loss in Section 2.1, but in Assumption 3 θ\theta^* is used again for the minimizer of the original loss. The authors should clarify what is the θ\theta^* in Theorem 3. This is a crucial problem as in existing DP-ERM literature (e.g. Bassily'14) we should care about the excess risk w.r.t. minθL(θ)\min_{\theta}\mathcal{L}(\theta), not the regularized loss. If θ\theta^* is the minimizer of the regularized loss, then the authors should clarify why it is used instead of the commonly adopted formulation of excess risk.

In this paper, we only consider the excess risk without the regularizer, i.e., λ=0\lambda=0. We set λ=0\lambda=0 in the first line of the input of Algorithm 3, as well as in Table 2. In this context, the notations align, with J=L=iiJ=\mathcal{L}=\sum_i \ell_i, and θ=argminθL(θ)\theta^*=\textrm{argmin}_{\theta}\mathcal{L}(\theta). We keep the notation of regularizer to allow more flexibility for our algorithm and future work.

  1. In view of 1 the notation causes confusion in understanding the main theoretical results. There are other notation issues: JJ was first defined as ii\sum_{i}\ell_i which collides with L\mathcal{L}, and then redefined in Algorithm 3 as the regularized loss centered at θ0\theta_0. The authors should check other notation issues.

See the response of 1. Thank you for your suggestions; we will check and revise accordingly.

  1. If in Algorithm 1, we are only allowed to perform the sampling N times, and thus must return a FAIL state if none of the θK\theta^Ks are valid, would the algorithm still be pure-DP or pure Gaussian DP?

We did not include the case when the sampler can fail in the submitted version for the sake of simplicity. As a consequence, as you alluded to there is a very small chance that the algorithm does not halt within a short time. What you proposed can be combined quite seamlessly with the ASAP technique that we proposed in this paper to obtain a worst-case run-time bound by stopping after NN attempts and outputting an arbitrary point from the domain Θ\Theta. Let's say the chance of failure after NN attempts is δ\delta, then we can essentially just add δ\delta to our ξ\xi parameter so we get a TV-distance-bound of δ+ξ\delta+\xi before converting it to WW_\infty and use ASAP to obtain a pure-DP mechanism. Details are as follows:

Let XX be the "SUCCESS or FAIL" random variable with P(X=1)=1δ\mathbb{P}(X=1)=1-\delta, and P(X=0)=δ\mathbb{P}(X=0)=\delta. Denote pSp_S as the distribution of output in a successful case, and denote pFp_F as an arbitrary distribution within domain Θ\Theta. The output θ~\tilde{\theta} of Algorithm 1 (with failure output) follows: θ~X=1pS, and θ~X=0pF.\tilde{\theta}|X=1 \sim p_S, \textrm{ and } \tilde{\theta}|X=0 \sim p_F. Denote p~\tilde{p} as the distribution of θ~\tilde{\theta}. Then we have p~=(1δ)pS+δpF\tilde{p}=(1-\delta)\cdot p_S+\delta \cdot p_F. Thus dTV(p~,pS)=supAp~(A)pS(A)=supA(1δ)pS(A)+δpF(A)pS(A)δ.d_{TV}(\tilde{p},p_S)=\sup_{A}|\tilde{p}(A)-p_S(A)|=\sup_{A}|(1-\delta)p_S(A)+\delta p_F(A)-p_S(A)|\leq \delta. Denote pp^* as the reference distribution. Since dTV(pS,p)ξd_{TV}(p_S, p^*)\leq \xi, we thereby have dTV(p~,p)dTV(p~,pS)+dTV(pS,p)δ+ξ.d_{TV}(\tilde{p}, p^*)\leq d_{TV}(\tilde{p}, p_S)+d_{TV}(p_S, p^*)\leq \delta+\xi. Replacing ξ\xi by δ+ξ\delta+\xi in Theorem 1 implies the pure DP of the algorithm.

  1. The algorithm requires rejection sampling which in the worst case may not terminate.

We demonstrate in our analyses that the algorithm will terminate with high probability 1q1-q in Ω(d(lnκ+κln(d/ρ)+lnn)max{κ3/2d(κln(d/ρ)+lnn),dκ}ln(1/q))\Omega( d (\ln \kappa+\kappa \ln(d/\rho) + \ln n ) \max\lbrace \kappa^{3/2}\sqrt{d(\kappa \ln(d/\rho) + \ln n )}, d\kappa \rbrace \ln(1/q) ) steps. You are correct; there is an exponentially small chance it will take longer, but the probability diminishes with the amount of runtime. Moreover, as clarified in our response to Question 3, our algorithm maintains the pure DP guarantee even if we need to terminate our algorithm in N steps.

  1. A minor issue is that the work lacks empirical evaluations. Even a small experiment on even the simplest convex optimization problem can provide very strong support for the computation efficiency of this work and demonstrate the practicality of the proposed algorithm.

Thank you for your suggestions. Our next revision of the paper will likely include some experiments. We should also note that "Private Convex Optimization via Exponential Mechanism" by Gopi et al. doesn't have any experiments, nor do many existing works on sampling. So while we agree on the importance of including experiments to demonstrate practicality, we want to point out that there is also precedent for theory-only work.

Reference:

Gopi, S., Lee, Y. T., & Liu, D. (2022). Private convex optimization via exponential mechanism. In Conference on Learning Theory, pages 1948-1989. PMLR.

评论

We deeply appreciate your invaluable review. In our updated manuscript and response, we have dedicated ourselves to thoroughly addressing all your concerns and questions.

Our revision incorporates experiments and clarifies the notation of θ\theta^* in Section 2.1. Additionally, our response emphasizes that our algorithm maintains pure DP or pure Gaussian DP guarantee, even in the presence of a potential FAIL case.

Could you please spare a moment to review our response? We are eager for your feedback and prepared to provide any further clarifications.

Thank you for your time and consideration.

评论

Thank you for your response. I think you have addressed my concerns. I am willing to raise my score to 6 if I don't find other major issues.

审稿意见
6

The paper studies the problem of approximate posterior sampling with pure differential privacy guarantee, for which prior works either have only proved weaker notions of DP guarantees (such as approximate DP and R'enyi DP) or have only proposed computationally expensive pure DP algorithms. The difficulty in establishing pure differential privacy guarantees lies in the gap between approximate and exact posterior sampling, contributing to a non-zero probability of unbounded information leakage. The authors circumvent this limitation by proving that as long as the posterior sampling process converges in WW_{\infty} distance to a reference distribution that satisfies pure DP, then perturbing the MCMC sample with calibrated noise also satisfies pure DP. For strongly convex smooth loss functions, the author then designed a posterior sampling algorithm that converges in WW_{\infty} distance by combining the Metropolic-Hasting algorithm (that converges in total variation distance) and a rejection step that only accepts samples within a bounded ball (this boundedness enables conversion from total variation distance bound to WW_{\infty} distance bound). Finally, the authors showcased an application of their DP posterior sampling algorithm for the DP empirical risk minimization problem, which achieves the optimal rates in near-linear runtime.

优点

  • The idea of perturbing an MCMC sampling process (that only satisfies weaker notions of DP guarantees) with noise to achieve the stronger pure-DP guarantee is interesting and novel. To make this idea concrete, the paper has used an interesting new technique that converts the weaker TV distance bound to a WW_{\infty} distance bound under the conditions that the domain is a bounded ball and the posterior density is bounded away from zero.

  • The proposed algorithm achieves a near-optimal rate within near-linear runtime under pure DP for the DP ERM problem, under strongly convex smooth loss function on 2\ell_2-bounded ball). The proved rate improves with a multiplicative factor κlogn\kappa\log n where κ\kappa is the condition number, and nn is the number of data records.

缺点

  • The improvement for DP ERM due to the designed approximate posterior sampling algorithm (and several parts of the proofs) requires further clarification. See questions 1 and 2 for more details.

  • The setting is quite constrained, i.e., strongly convex and smooth loss functions over 2\ell_2-bounded ball. It is unclear whether the privacy analysis, especially the critical conversion theorem (from total variation distance bound to WW_{\infty} distance bound), is generalizable to more general forms of bounded domain or loss functions.

问题

  1. In the proof for Theorem 3 (improved utility bound for DP-ERM), equations (5) and (6) utilize Lemma 6 to analyze the approximate posterior sampling component MALA with constraint. However, Lemma 6 only holds under exact posterior sampling from p(θ)exp(γF(θ))p(\theta)\propto \exp(-\gamma F(\theta)). Could the authors explain this discrepancy and justify this usage of Lemma 6?

  2. In proof of Theorem 1, when proving iteration complexity guarantee for convergence of MALA in total variation distance, the authors cited [Dwivedi et al. 2019] and [Ma et al. 2019]. Could the authors cite the specific theorems? Otherwise, it is hard to validate these claims.

  3. Could the authors elaborate on the possibility of extending the privacy bound to more general settings, such as general bounded domain (e.g., probability simplex) or more relaxed convex/non-convex/non-smooth loss functions?

Minor comments:

  • The idea of perturbing distributions with bounded WW_{\infty} distance with noise to strengthen the differential privacy guarantee seems quite similar to the shift reduction lemma in the "privacy amplification by iteration" analysis [Lemma 20, a]. The main difference seems to be that [Lemma 20, a] focuses on Rényi DP guarantees while the authors focus on pure DP and Gaussian DP guarantees. Maybe the authors could add more discussion regarding the non-trivialness of this extension.

Reference:

  • [a] Feldman, Vitaly, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. "Privacy Amplification by Iteration." arXiv preprint arXiv:1808.06651 (2018).
评论
  1. In the proof for Theorem 3, equations (5) and (6) utilize Lemma 6 to analyze the approximate posterior sampling component MALA with constraint. However, Lemma 6 only holds under exact posterior sampling from p(θ)exp(γF(θ))p(\theta)\propto \exp(-\gamma F(\theta)). Could the authors explain this discrepancy and justify this usage of Lemma 6?

Thank you for bringing up this point. We omitted the analysis regarding the discrepancy between the approximate posterior and the exact posterior for the sake of simplicity. This simplification doesn't compromise the results, as this discrepancy is dominated by the error arising from the noise added in the final step of ASAP, as shown in (4). To clarify why this simplification does not impact the established bound in Theorem 3, we present a detailed analysis below.

Take Δ=dGlog(d/ρ)2n2αε\Delta=\frac{d G \log(d/\rho)}{2 n^2 \alpha \varepsilon} as Table 2. Denote θ~p~\tilde{\theta}\sim \tilde{p} the output of MALA with constraint, and denote pp^* the exact posterior. Since W(p~,p)ΔW_\infty(\tilde{p},p^*)\leq \Delta, applying Lemma 21, there exist a coupling of p~\tilde{p} and pp^*, denoted as ζ\zeta, such that esssup(x,y)(Θ×Θ,ζ)xy2Δesssup_{(x,y)\in (\Theta\times\Theta, \zeta) }\lVert x-y \rVert_2\leq \Delta. Take θexactθ~ζ(θ~)\theta_{exact}| \tilde{\theta}\sim \zeta(\cdot|\tilde{\theta}), then we know θexactp\theta_{exact}\sim p^*. With the nGnG-Lipschitz continuity of L\mathcal{L}, this discrepancy is bounded by E(θ~,θexact)ζ[L(θ~)L(θexact)]nGE(θ~,θexact)ζ[θ~θexact2]nGΔ.\mathbb{E} _ {(\tilde{\theta},\theta _ {exact})\sim \zeta}[\mathcal{L}(\tilde{\theta})-\mathcal{L}(\theta _ {exact})]\leq n G \mathbb{E} _ {(\tilde{\theta},\theta_{exact})\sim \zeta}[\lVert\tilde{\theta}-\theta _ {exact}\rVert_2]\leq n G \Delta. This discrepancy is then dominated by the upper bound for E[L(θ^)L(θ~)]\mathbb{E}[\mathcal{L}(\hat{\theta})-\mathcal{L}(\tilde{ \theta})], which is 2dnGΔε\frac{\sqrt{2 d}n G \Delta}{\varepsilon} as shown in (4). Therefore, the discrepancy between the approximate posterior and the exact posterior has a negligible impact on the empirical risk bound.

Thanks again for bringing up this point; We will add a more detailed analysis to our upcoming revision.

  1. In proof of Theorem 1, when proving iteration complexity guarantee for convergence of MALA in total variation distance, the authors cited [Dwivedi et al. 2019] and [Ma et al. 2019]. Could the authors cite the specific theorems?

We cited Theorem 1 of [Dwivedi et al. 2019], and Lemma 8 of [Ma et al. 2019], which is in the appendix. Lemma 8 of [Ma et al. 2019] corresponds to Lemma 10 in the Arxiv version.

  1. Could the authors elaborate on the possibility of extending the privacy bound to more general settings, such as general bounded domain (e.g., probability simplex) or more relaxed convex/non-convex/non-smooth loss functions?

Yes, as addressed in the response to qRJd, our results can be extended to non-convex settings by leveraging techniques from [Ma et al. 2019].

We would clarify that our parameter space is not necessarily an 2\ell_2 ball. Rather, We first localize it to an 2\ell_2 ball, and then perform ASAP. With regard to the probability simplex example, we can use a mirror map towards the Euclidean space, perform localized-ASAP on the Euclidean space, and then map the output back to the simplex.

  1. The idea of perturbing distributions with bounded WW_\infty distance with noise to strengthen the differential privacy guarantee seems quite similar to the shift reduction lemma in the "privacy amplification by iteration" analysis [Lemma 20, a]. The main difference seems to be that [Lemma 20, a] focuses on Rényi DP guarantees while the authors focus on pure DP and Gaussian DP guarantees. Maybe the authors could add more discussion regarding the non-trivialness of this extension.

Thanks for pointing out this interesting technical connection. As you said, the results are different. They focused on Renyi DP while we stated results for pure DP and Gaussian DP in Theorem 2. The proof techniques are also quite different. We leveraged Lemma 15, a measure-theoretic adaptive composition theorem of privacy profiles (Hockey-Stick divergences or its dual representation — ff-DP) that we proved in this paper, which allows the exclusion of a measure 0 set. This is a non-trivial technical step towards our relatively clean proof of Theorem 2. In fact, if we apply adaptive composition theorems of RDP, then we get an RDP bound for ASAP right away. On the other hand, it is well-known that RDP bounds does not imply tight ff-DP bound even for the Gaussian mechanism. Thanks again for your suggestion; we will add the above discussion in our revision.

References:

Feldman, V., Mironov, I., Talwar, K., & Thakurta, A. (2018, October). Privacy amplification by iteration. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS) (pp. 521-532). IEEE.

Ma, Y. A., Chen, Y., Jin, C., Flammarion, N., & Jordan, M. I. (2019). Sampling can be faster than optimization. Proceedings of the National Academy of Sciences, 116(42), 20881-20885.

评论

Thanks to the authors for the response. My primary concerns regarding clarity are addressed, and I have raised the score. However, the analysis still seems limited to me, as it necessitates a localization step, which seems challenging for non-convex problems. Additionally, the algorithm may return "Fail" or incur infinite running time with a small probability (as reviewer kgTq pointed out). This seems an important limitation, given that most runtime analyses for prior DP-ERM algorithms are non-probabilistic.

评论

Thank you for your invaluable review. To extend our work to a non-convex setting, for the sampling part, we can leverage techniques from [Ma et al. 2019]; regarding the localization step, we acknowledge the necessity for a more thorough discussion. We would also like to highlight that our algorithm, with the localization step, can be extended to the general convex case. Additionally, as clarified in our response to reviewer kgTq's question 3, our algorithm maintains the pure DP and pure Gaussian DP guarantee, even in the presence of a potential FAIL case.

Thanks again for your insightful feedback.

Reference:

Ma, Y. A., Chen, Y., Jin, C., Flammarion, N., & Jordan, M. I. (2019). Sampling can be faster than optimization. Proceedings of the National Academy of Sciences, 116(42), 20881-20885.

AC 元评审

The paper addresses the problem of approximate posterior sampling with pure differential privacy guarantee. This is a well-known approach for solving DP-ERM (and DP-SCO) problems which however previously either only approximate DP guarantees or required computationally expensive sampling. For the (rather restrictive) smooth and strongly convex case this work gives a posterior sampler that achieves pure DP and solves DP-ERM with asymptotically optimal rates in near-linear runtime. The result relies on localization and additional perturbation to convert WW_\infty distance to a divergence (in a way similar to privacy amplification by iteration and its application to Langevin analysis by Altschuler and Talwar, 2023). Overall, this is an interesting new result about one of the basic problems in DP convex optimization and the tools might find additional uses.

为何不给更高分

The smooth and strongly convex seting is rather restrictive and the authors appear unaware that some of the same ideas have been already used in this context.

为何不给更低分

see above

最终决定

Accept (poster)