Privacy Amplification Through Synthetic Data: Insights from Linear Regression
Releasing synthetic data and keeping its generation model hidden could lead to better privacy guarantees, as demonstrated in our study on linear regression.
摘要
评审与讨论
The paper offers a theoretical analysis of the privacy loss of releasing synthetic samples in linear regression. It demonstrates that, under a strong threat model where an adversary controls the seed of the generative model, releasing even a single synthetic sample can result in privacy leakage equivalent to that of releasing the full generative model in the worst case. Conversely, when the seed is random, the authors prove a form of privacy amplification.
给作者的问题
N/A
论据与证据
There are a couple of points that I find problematic or unclear:
- "It is clear that the adversary can recover the model parameters from queries...Strikingly, we now show that the adversary can in fact recover the model parameter with just one query." --> The discussion here is very confusing. Privacy leakage is not about recovering model parameters (as a matter of fact, they are already known after training), but rather about inferring information about the training samples. Additionally, what you really show here is that with one query it is possible to achieve the maximum privacy leakage (as specified by the privacy budget of training the generative model) for some worst-case datasets.
- "Since Label DP is a weaker notion than standard DP, these results also imply negative results for standard DP" --> I don't follow this claim. How does a construction where releasing a single synthetic sample achieves maximum privacy leakage under Label DP translate into a construction for standard DP?
- "However, these results do not imply that for every possible seed , the privacy loss is strictly smaller than " --> I'm confused about this argument. Mathematically, it seems to be the case that does not necessarily hold. On the other hand, the post-processing property of DP guarantees that the privacy loss of releasing a single data point is upper bounded by the privacy loss of training the generative model. How can these two observations be reconciled?
方法与评估标准
N/A
理论论述
- For output perturbation, Chaudhuri et al., 2011 assumes certain properties of the loss function, specifically bounded gradient (or equivalently, Lipschitz), to upper bound the L2 sensitivity of the minimizer of the regularized least-square objective. To satisfy this property, they primarily focus on classification losses such as cross-entropy and hinge loss and assume that the samples have bounded norm. In contrast, the current paper directly borrows the results from Chaudhuri et al., 2011 but applies them to linear regression. The assumptions from Chaudhuri et al., 2011 are not formally stated, and the gap between the two settings is not addressed, which is problematic. In fact, the gradient of the square loss is not necessarily bounded without additional assumptions.
- The authors claim that "As a discretized Langevin dynamical system with a convex objective, it is known that converges in distribution to its stationary Gibbs distribution". This claim is made without any references or citations. My understanding is that while Langevin dynamics with a convex objective do converge to the stationary Gibbs distribution in continuous time, this convergence is not guaranteed for discretized processes without further assumptions on the step size. This lack of rigor is concerning.
实验设计与分析
N/A
补充材料
I checked Appendix A but did not read Appendix B in detail. The proofs generally make sense to me. Minor: in the proof of Proposition A.5, sigma --> \sigma.
与现有文献的关系
The paper contributes to the literature on DP data synthesis by providing a formal analysis of the privacy loss of synthetic samples. It also extends the literature on privacy amplification by identifying an alternative mechanism---privacy amplification through synthetic data---beyond traditional approaches such as subsampling and iteration, once again showcasing the power of randomness in privacy protection.
遗漏的重要参考文献
For DP synthetic data generation, the authors should discuss [1]. In particular, Figure 1 offers a useful overview of the current state of the field.
At a high level, the phenomenon uncovered in this work resembles privacy amplification by iteration: releasing only the final model checkpoint, rather than all the intermediate ones, leads to better privacy. In addition to Feldman et al., 2018, the authors should consider discussing [2,3,4], which provide last-iterate privacy analysis under the assumptions that the loss function is convex and/or smooth, showing that the privacy loss is bounded as goes to infinity. Moreover, it would be beneficial to review several works on privacy amplification by subsampling [5,6]. Collectively, these studies highlight the power of randomness in privacy protection.
[1] Hu, Yuzheng, et al. "Sok: Privacy-preserving data synthesis." 2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024.
[2] Altschuler, Jason, and Kunal Talwar. "Privacy of noisy stochastic gradient descent: More iterations without more privacy loss." Advances in Neural Information Processing Systems 35 (2022): 3788-3800.
[3] Ye, Jiayuan, and Reza Shokri. "Differentially private learning needs hidden state (or much faster convergence)." Advances in Neural Information Processing Systems 35 (2022): 703-715.
[4] Chien, Eli, and Pan Li. "Convergent privacy loss of noisy-sgd without convexity and smoothness." arXiv preprint arXiv:2410.01068 (2024).
[5] Balle, Borja, Gilles Barthe, and Marco Gaboardi. "Privacy amplification by subsampling: Tight analyses via couplings and divergences." Advances in neural information processing systems 31 (2018).
[6] Steinke, Thomas. "Composition of differential privacy & privacy amplification by subsampling." arXiv preprint arXiv:2210.00597 (2022).
其他优缺点
Strengths: The paper provides, to the best of my knowledge, the first theoretical analysis of the privacy loss of the synthetic samples generated by DP-trained generative models. Although the setting appears somewhat toyish, the techniques employed, particularly for releasing multiple points, are non-trivial. Overall, this work could serve as a promising first step toward an important research direction.
Weaknesses: The main factor lowering my overall rating is related to the theoretical claims; I feel that the rigor in this work does not meet the bar for ICML. Additionally, the paper could be strengthened by:
- Including a notation section. For instance, is typically interpreted as the 2-norm, but here it is mostly used as the Frobenius norm. Moreover, , which appears in both Sec 3.1 and Prop 4.2, is never formally defined.
- Providing an overview of the proof techniques. It would be helpful to discuss the technical challenges and how the paper addresses them.
- Discussing the implications of the main results. For example, how does Theorem 4.8 relate to privacy amplification? What is the relationship between and ?
其他意见或建议
N/A
We sincerely thank Reviewer pnmm for providing valuable feedback and pointing out several valid issues. Below, we address and discuss each point.
Claims And Evidence
Privacy leakage is not about recovering model parameters
We agree that our sentence might be confusing. What we mean here is that if the adversary is able to recover the model parameters, then no privacy amplification is possible compared to the privacy guarantee given by post-processing the model (this is what we call "maximum privacy leakage" in this context). We show that a single query is sufficient for an adversary to achieve this maximum privacy leakage (Proposition 3.1).
How… Label DP translate into a construction for standard DP
DP upper bounds the privacy leakage across all possible pairs of adjacent datasets. In Label DP, adjacent datasets differ only in their labels. Since any two datasets that are adjacent under Label DP remain adjacent under standard DP (where both features and labels can differ), a lower bound on the privacy leakage in the Label DP setting also applies to standard DP.
Mathematically, it seems to be the case that does not necessarily hold
You are right, thanks for catching this. There is a minor error in the upper bound. We address this in the "Theoretical claims" section below, where you also raised a related question about the convergence of NGD.
Theoretical claims
Output perturbation is not possible without Lipschitz condition on the objective function
You are right and we thank you for pointing out this oversight. To ensure the loss is Lipschitz, we can assume that and limit the parameter space to the centered ball of radius (the latter condition is always verified for ridge regression). Such conditions are common in private linear regression analysis, see for e.g. [1]. The objective is then -Lipschitz with , allowing us to use the output perturbation mechanism and to keep our results unchanged.
[1] Y. X. Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. AISTATS 2018.
Discrete convex Langevin dynamical systems do not necessarily converge
Again, you are right. The ergodicity of the process is required for convergence, and is ensured when the objective function is strongly convex and smooth (see for e.g., [2]), which is the case in our setting. Furthermore, for NGD with full batch training, it can be shown that if the learning rate is sufficiently small, then converges to a normal distribution, which is the Gibbs distribution when . The correct result writes as follows.
Let , and denote by the square root of . Without loss of generality, assume that and differ. Assume that . Then: Moreover, for two given datasets, the adversary can choose such that: In particular, if and are adjacent (label DP), the adversary can choose such that:
This result corrects both the confusion about convergence of and Propositions 3.2 to 3.4. Due to space limitations, we are not able to give the sketch of proof here, but we are happy to provide it as a follow-up comment to the reviewer.
[2] A. Durmus, S. Majewski, and B. Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. J. Mach. Learn. Res. 20:73, 2019
References not discussed
Thank you for highlighting these references. We will add a citation to this recent survey and discuss how our work relates to other privacy amplification results. As you noted, synthetic data release is a distinct phenomenon that extends beyond privacy amplification by iteration. In the latter, the final model is released at the end of private training, while our approach further conceals the model itself and discloses only synthetic data generated from random inputs to the model.
Thanks for the response. I am raising my score to 3 due to the authors’ efforts in improving the rigor of the paper.
We thank Reviewer pnmm for raising their score. For completeness, we include a sketch of proof of corrected Propositions 3.2 to 3.4, as outlined in our rebuttal.
Proof of corrected Propositions 3.2 to 3.4
We consider NGD with the following update: . Note that we changed the scaling of for the noise in order for our results to hold. The gradient is . Then, noting , we get . Then, which is composed of independent columns with mean and covariance . Assume that .
Then and:
By Levy's continuity theorem, (there is an abuse of notation here because we use the vectorized notation). Note that when , we recover the Gibbs distribution. We note . The square roots are defined because and commute.
By Lemma A.2, the tradeoff function between and is , with , and:
Furthermore, for , ,
Then,
By noting the change of variable and using the invertibility of , we get: which corresponds to the 2-norm of and is obtained by setting the right singular vector corresponding to the largest singular value of . In the setting of Label DP, has rank , so .
This paper investigates privacy amplification from synthetic data release within the specific setting of linear regression.
The authors first establish negative results, showing that an adversary controlling the seed of the generative model can induce the maximum possible privacy leakage from a single query.
Conversely, they demonstrate that generating synthetic data from random inputs amplifies privacy beyond the model's inherent guarantees when releasing a limited number of synthetic data points. The amplification holds in the regime when few synthetic samples are released and the ambient dimension d is large.
This highlights the crucial role of randomization in the privacy of synthetic data generation.
update after rebuttal
The paper presents an interesting theoretical observation, which I appreciate, albeit with limited potential applications. The rebuttal reinforces my stance.
给作者的问题
- Are there any downstream applications the authors envision for releasing private synthetic data points from a linear model?
论据与证据
The theoretical claims are supported by proofs.
方法与评估标准
N/A
理论论述
I briefly checked the proofs but not in thorough detail.
实验设计与分析
N/A
补充材料
I reviewed the proofs in the appendix but not in thorough detail.
与现有文献的关系
The key contribution is the formal proof that releasing synthetic data can in fact reduce privacy loss compared to releasing a privatized model. This is in contrast to most prior works where the privacy loss is bounded using post-processing once there is a privately learned generative model. While privacy amplification from synthetic data is not completely new, past works only study the simpler case of univariate Gaussian data (Neunhoeffer et al., 2024).
遗漏的重要参考文献
N/A
其他优缺点
Strengths:
- This work provides a decently thorough examination of privacy amplification from synthetic data release in the setting of linear regression, including an impossibility result when the internal randomness of the generator is controlled by an adversary, and a positive result when the internal randomness is actually random.
Weaknesses:
- The amplification only holds when the number of generated points is less than the dimension. However, we need samples to learn the parameters of a linear model, so it doesn't seem possible to learn the linear model using synthetic data while satisfying privacy amplification.
- Another potential weakness is the limitations of the model as well as techniques, which are unlikely to extend beyond linear regression to, e.g. neural networks, where synthetic data is much more useful.
That being said, I believe the theoretical results are interesting within their scope and can be a plausible addition to ICML.
其他意见或建议
It might be helpful to contextualize the amplification results in terms of (eps, delta)-DP, perhaps even in restricted choices of the parameters, to better interpret the quantitative improvement over simple post-processing.
We thank Reviewer ukYX for their feedback. Below, we address each concern separately.
Weaknesses
The amplification only holds when the number of generated points is less than the dimension
This is correct. However, our results do not imply that privacy amplification does not happen when the number of generated points is larger. Our work focuses on the theoretical existence of privacy amplification by synthetic data release. Making it more generally applicable is an interesting open problem.
"... limitations of the model as well as techniques, which are unlikely to extend beyond linear regression"
We agree that the generalization of our results to deeper models present significant challenges. However, we believe our findings have the potential to be leveraged in broader settings. For instance, the post-processing theorem ensures that the results also apply to regression problems with activation functions---such as logistic or ReLU regression---provided that Lipschitzness and convexity are preserved.
A promising direction for broader applicability is private fine-tuning of the last layer in deeper neural networks, which maintains the linear regression framework. However, modeling the distribution of the noise in this setting becomes more challenging, as the transformation of the Gaussian input through the layers alters its statistical properties. We leave this exploration for future work.
Questions
"Are there any downstream applications the authors envision for releasing private synthetic data points from a linear model?"
As you pointed out in your comment, revealing less than synthetic data points has limited utility. However, our work is the first to highlight scenarios where amplification is provable, laying the groundwork for deeper theoretical exploration in broader, more realistic contexts. We are confident that our privacy bounds can be extended in practical settings. This is an objective for future work.
This paper explored the privacy amplification properties of hiding the generative model in private synthetic data generative contexts. Differentially private generative models produce synthetic data that formally inherits the same privacy guarantees. In practice, it has been observed that when the synthetic data generated is small enough, it meets stronger guarantees than the generating model, through an amplification effect. This paper formally shows that this amplification effect exists in cases where synthetic data is generated from random inputs to private linear regression models as case study. In particular, releasing synthetic data leads to stronger privacy guarantees than releasing the generative models when the number of released samples is small enough. The paper also demonstrates that in the case where the adversary has access to the seed of the generative algorithm, there is no such amplification of privacy.
给作者的问题
No specific questions at the moment.
论据与证据
The claims in the paper are well supported by theorems, propositions and lemmas. I reviewed the theoretical results in the main paper, which appear well structured and correct. I did not review proofs and other results in the Appendix.
方法与评估标准
The goal of the paper is to provide an initial theoretical framework to study the phenomenon of privacy amplification through synthetic data. This is achieved mainly via theoretical analysis that is appropriate with respect to the overall goal.
理论论述
I checked all proofs and results in the main text, and to the best of my knowledge they seem correct. I did not however have the time to review results in Appendix.
实验设计与分析
N/A
补充材料
No, I did not review the Appendix for time constraints.
与现有文献的关系
The main contribution of the paper is to set up the theoretical framework to study privacy amplification via synthetic data, a phenomenon that was empirically highlighted in work by Annamalai et al. (2024), and in part explored by Neunhoeffer (2024) in a more limited context where training data is one-dimensional and the generative model is a Gaussian with mean and variance estimated privately from the data. This paper uses linear regression to study the phenomenon in a more extended way. The contribution is two-fold: i) the author(s) first prove a negative result: for both output perturbation and noisy gradient descent as methods to privately train the generative model, releasing synthetic data from fixed inputs does not lead to privacy amplification (Theorem 3.1 and 3.4 respectively); ii) then, the paper proves privacy amplification for the single release case (Theorem 4.8) and the more general case of multiple releases (Theorem 4.11). To the best of my knowledge, these contributions are novel, and pave the way for new valuable results in this line of research.
遗漏的重要参考文献
I don't think any essential related work was left out of the discussion.
其他优缺点
Strengths:
- The paper addresses an important open question by developing a theoretical framework for quantifying privacy guarantees in synthetic data release, specifically in the context of linear regression. This rigorous approach helps fill a gap in understanding privacy amplification in generative models.
- The paper presents both positive and negative results. It demonstrates that privacy amplification is possible under certain conditions (with random inputs) while also highlighting scenarios where the privacy benefits don't hold, such as when an adversary controls the synthetic data generation seed.
Limitations:
- Restricting the focus to linear regression provides a clean case study but limits the generalizability of the findings: it’s unclear how well these results could extend to more complex models.
- As stated by the author(s), while these findings lay the ground for better insights into private synthetic data, their practical impact is limited.
其他意见或建议
No additional comments or suggestions at the moment.
We thank Reviewer c2xJ for their interesting and positive feedback. Below, we address each concern separately.
Limitations
Restricting the focus to linear regression provides a clean case study but limits the generalizability of the findings: it’s unclear how well these results could extend to more complex models.
We agree that the generalization of our results to deeper models present significant challenges. However, we believe our findings have the potential to be leveraged in broader settings. For instance, the post-processing theorem ensures that the results also apply to regression problems with activation functions---such as logistic or ReLU regression---provided that Lipschitzness and convexity are preserved.
A promising direction for broader applicability is private fine-tuning of the last layer in deeper neural networks, which maintains the linear regression framework. However, modeling the distribution of the noise in this setting becomes more challenging, as the transformation of the Gaussian input through the layers alters its statistical properties. We leave this for future work.
This paper investigates the privacy amplification effect that could be gained when hiding the model that has been used to generate differentially-private synthetic data. The objective is to be able to quantify the privacy gain obtained by releasing only a limited number of synthetic data and not the model itself. More precisely, the authors show that releasing a number of synthetic profiles smaller than the input dimension provides strong privacy guarantees.
给作者的问题
It would be great if the authors could comment on the potential of the approach to generalize to other types of models.
论据与证据
Currently, the paper does not contain any experiments for validating the theoretical claims made. If possible, it would have been great to conduct some auditing experiments on controlled datasets to be able to verify these claims.
方法与评估标准
The paper takes a novel approach of trying to model the worst-case approach for the generative process by giving the control of the seed to the adversary. While this approach is promising, there is however no experimental methodology proposed for validating the performance of such adversarial approach in practice.
理论论述
The theoretical claims are made with respect to two different variants of differential privacy, namely f-DP and Rényi DP. Ideally, it would have been great if the authors could have elaborated on why such notions are necessary compared to the classical DP definition.
Nonetheless, the authors have been able to show that in the specific case of differentially-private linear regression, there exists situation in which if the adversary is able to manipulate the randomness used by the generative process, he can achieve the theoretical upper bound in terms of privacy leakage.
To be frank, the proofs are highly technical and specialized and I do not have the expertise to validate them thoroughly.
实验设计与分析
The theoretical analysis seem sound although as mentioned earlier I do not have the technical expertise to validate them fully. However, there is no experiments set up for validating them.
补充材料
I have reviewed the supplementary material, however as mentioned previously the proofs in appendices are technically heavy and I do not have the expertise to thoroughly validate them.
与现有文献的关系
The main results of the paper contributes to better understand the privacy guarantees that are possible in a context in which only synthetic data is released and not the model itself. However, the impact is limited in the sense that the results only holds if the number of released synthetic samples is very small compared to the input dimension, which is very limited in terms of practical interests.
遗漏的重要参考文献
I do not see any missing important references, rather the authors have done a good job at reviewing the corresponding state-of-the-art.
其他优缺点
The paper is well-written and the authors have done a good job at explaining the current state-of-the-art on the evaluation of DP guarantees of synthetic data. The theoretical analysis conducted is interesting but only holds for a very small release of synthetic data and thus I consider that the term "privacy amplification" used in the title is exaggerated. There is also a lack of experiments to validate practically the theoretical claims.
其他意见或建议
A small typo : « Without loss of generarilt » -> « Without loss of generality »
We thank Reviewer RSvt for their feedback. Below, we address each concern separately.
Weaknesses
The term "privacy amplification" used in the title is exaggerated
In our paper, the phrase "privacy amplification from synthetic data release" refers to potential privacy gains achieved by releasing only synthetic data while keeping the generative model hidden. We demonstrate that this privacy amplification does not occur when the adversary controls the seed. However, existing empirical studies suggest the existence of this effect when the seed is randomized (e.g., [1]). This empirical observation motivates our research question: Can privacy amplification occur from synthetic data release, or are existing membership inference attacks simply insufficiently powerful to achieve the maximal privacy leakage?
To address this, we conduct a rigorous theoretical analysis in a simplified linear regression setting. Our results are the first to show that under certain conditions, privacy amplification can indeed occur—even achieving perfect privacy as increases. While our analysis applies to a specific setting, it does not rule out amplification in more general cases. Instead, our work highlights scenarios where amplification is provable, laying the groundwork for deeper theoretical exploration in broader, more realistic contexts. This motivation is reflected in our title: "Insights from linear regression".
[1] Annamalai, M. S. M. S., Ganev, G., and Cristofaro, E. D. "What do you want from theory alone?" Experimenting with tight auditing of differentially private synthetic data generation. USENIX Security 2024
There is also a lack of experiments to validate practically the theoretical claims
This is a theoretical paper, and experiments are not necessary to support rigorously proven claims. However, we agree that empirically estimating the privacy guarantees, for example to assess the tightness of our theoretical results, is an interesting idea and we thank the reviewer for this suggestion.
Questions
It would be great if the authors could comment on the potential of the approach to generalize to other types of models.
We agree that the generalization of our results to deeper models present significant challenges. However, we believe our findings have the potential to be leveraged in broader settings. For instance, the post-processing theorem ensures that the results also apply to regression problems with activation functions---such as logistic or ReLU regression---provided that Lipschitzness and convexity are preserved.
A promising direction for broader applicability is private fine-tuning of the last layer in deeper neural networks, which maintains the linear regression framework. However, modeling the distribution of the noise in this setting becomes more challenging, as the transformation of the Gaussian input through the layers alters its statistical properties. We leave this for future work.
About -DP and Rényi DP: "Ideally, it would have been great if the authors could have elaborated on why such notions are necessary compared to the classical DP definition."
We chose to consider -DP because it is a tight way to track the privacy guarantees at all budgets as trade-off functions. In fact, -DP is the most informative DP notion for the Blackwell order [2]. While alternatives such as privacy profiles could also be considered, our analysis fundamentally relies on the approximation of trade-off functions. For other privacy definitions, this would require other tools and may lead to looser results. In addition to -DP, we considered Rényi DP for its more interpretable privacy bounds, which are easier to grasp than trade-off functions.
[2] Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):3–37, 2022.
This paper studies the problem of private synthetic data generation in linear regression.
The reviewers agree that the paper presents interesting theoretical results and novel analytic tools. The authors also try their best to address the reviewers' concerns, especially the one regarding the technical flaws. Since most of the reviewers support the paper, I recommend weak accept.