PANORAMIA: Privacy Auditing of Machine Learning Models without Retraining
PANORAMIA is a practical privacy audit for ML models, that does not retrain the target model or alter its training set or training algorithm.
摘要
评审与讨论
The paper proposes a method to estimate a lower bound on the (pure) Differential Privacy parameter for already trained machine learning models using a post-hoc empirical evaluation based on conducting Membership Inference Attacks. Building on the analysis by Steinke et al., 2023, for auditing DP with training runs, this paper tweaks the setting by using synthetically generated canary data points that closely resemble the training data population, rather than inserting or omitting pathologically crafted canaries that degrade the ML model's performance. This approach is useful because if the canary distribution and the training distribution are (nearly) identical, an already trained model can be considered as the output of the training algorithm on a random partition of the combined training and canary datasets. This means any post-hoc auditing analysis based on membership inference on a randomly and independently selected training dataset yields a valid lower bound on . To get such a canary distribution, the authors suggest training a synthetic data generation model to create a canary dataset to get a distribution similar to the population. Additionally, the authors extend Steinke et al.'s analysis to situations where the synthetic data distribution is -close to the true population in a DP-like divergence, although the estimator does not technically yield a lower bound in this case. Through extensive empirical evaluations, the paper demonstrates that the proposed estimation technique is useful as it provides reasonable values that approximate the DP lower bounds.
优点
The paper focuses on an important problem and has the following merits.
-
The paper proposes an auditing method that seeks to eliminate the need of altering anything about the training process of ML models, thus allowing post-hoc auditing.
-
The paper proposes using synthetic data that resemble the training data instead of adversarially created canaries can reduce the drop in model performance while offering useful DP estimations.
-
Empirical evaluations suggest that the estimator correlates with the DP upper bounds and can help identify situations of high privacy leakage.
-
The appendix provides an attempt to extend the budget estimator for -DP.
缺点
-
I'm not entirely certain that the analysis presented in Proposition 2 is correct. In the proof of Proposition 2 (lines 433 and 434), the authors use the fact that the model is -DP with respect to the Bernoulli random variables . But in Algorithm 1, we see that was trained on in phase 1, which is independent of sampled in phase 2. So, the model is independent of the selection (i.e., ). In other words, the equation after line 434 should evaluate to 1, as given , the prediction remains the same whether or . Perhaps Algorithm 1 requires one training run like Steinke et al. where the model is trained on as described in line 2 of Algorithm 1? If that is the case, then the claim that the auditing does not require retraining becomes invalid. If I'm mistaken, could you explain Proposition 2 further?
-
In line 196, the authors mention that Algorithm 1 (lines 4-7) does a sweep over all the recall values of the attack, and they adjust the overall significance-level or the -value by taking a union bound over all equation (1) with the significance level discounted to . The value of in the experiments ranges from 500 to 30,000. For such values, assuming , the sweeping over all recall values requires whereas when the level of recall is predetermined, the factor in equation (1) is only . So, I'm not sure if doing a sweep over recall values will give larger DP estimate values.
-
Use of synthetic data from a distribution that is -close (with a small ) can make it difficult for the membership inference attacks to work well. If the goal is auditing DP, perhaps the higher MIA precision for a given recall outweighs the drop in ML model's performance.
-
The figures haven't been explained very well. In particular, the theoretical maximum precision in Figure 3 and Figure 12 and the empirical maximum value in Figure 4 aren't clear.
-
The paper only studies the problem of estimating pure DP and does not present an operationalizable algorithm for -DP, although some results along these lines are motivated in the appendix. Additionally, the obvious weakness (which the authors acknowledge) is that their method does not technically provide a lower bound for -DP.
Minor Points
-
Intuitively, when or in the Real-Member;Real-Nonmember (RM;RN) case where the non-members follow the same distribution, I think the no-retraining-needed argument could work. This would involve modifying Algorithm 1 and reworking Proposition 2 by (A) assuming entries in and are i.i.d. from the same distribution and (B) setting according to the original train-test split instead. On the other hand, when , I'm not sure if such an argument can be made to work, at least not trivially.
-
It's not clear how the mechanism in Proposition 1 incorporates the helper model mentioned in Section 5.1 used in the experiments.
-
Algorithm 1 (lines 4-7) does not seem to reflect the -value adjustment discussed in lines 198-200. Perhaps the authors might have overlooked this -value adjustment in the experiments as well?
问题
-
How is the helper model used in the baseline trained?
-
The lexicographic order in equation (2) appears to be essential for Corollary 2 and seems to be aimed at ensuring something like to hold. Could the authors provide more details on this?
-
How exactly does the formulation in the paper differ from the setting introduced by Steinke et al., 2023, specifically in regards to the independence assumptions and the probabilistic dependencies between concerned random variables?
局限性
The authors have acknowledged some limitations in the paper. There do not appear to be any negative societal impacts associated with this work. I encourage the authors to address the issues and questions raised in this review.
Thank you for the thoughtful review.
How is the helper model used in the baseline trained? (Q1)
The helper model has the same classification task and architecture as the target model. To train it, we generate separate sets of training and validation data using our generator. For image data, the synthetic samples do not have a label. We thus train a labeler model, also a classifier with the same task, on the same dataset used to train the generator. We use the labeler model to provide labels for our synthetic samples (training and validation sets above). We train the helper model on the resulting training set, and select hyperparameters on the validation set. We give more experimental details for each modality in App. C (l. 473-474, 545-546, 567-568, 586-592). App. D1 further highlights the various architectures of the helper model we tried (l. 638, Table 5), including training the helper model on real non-member data. Table 5 shows the helper model trained on generated data gives the strongest baseline (highest ).
The lexicographic order in Eq. (2) appears to be essential for Corollary 2 and seems to be aimed at ensuring something like to hold [...]? (Q2)
The lexicographic order in Eq. (2) is not to ensure (as a counter-example, ). The technical step relying on this order is the construction of the confidence interval (Corollary 2, l. 187-188) based on hypothesis tests from Prop. 1 and 2.
The reason for this order with first is that Prop. 2 asks: ``if the generator is -close, is the target model -DP?'' We thus must reject any hypothesis with a we can disprove based on data, before computing based on a plausible . Without this order (e.g. first or ), we would not necessarily compute based on a plausible . We would assign some of the MIA performance to privacy leakage, when we know for a fact it comes from the generator being too far from the data distribution.
How exactly does the formulation in the paper differ from the setting introduced by Steinke et al., 2023 [...]? (Q3)
The key difference lies in how we create the audit set. In Steinke et al., 2023, the audit set is fixed, and data points are randomly assigned to member or non-member by a Bernoulli random variable . Members are actually used in training the target model , while non-members are not (so assignment happens before training). In our framework, we take a set of known iid. members (after the fact), and pair each point with a non-member (generated iid. from the generator distribution). We then flip to sample which one will be shown to the ``auditor'' (MIA/baseline) for testing, thereby creating the test task of our privacy measurement.
Regarding the independence assumptions, is independent of everything by construction, as in Steinke et al., 2023. In fact, if we replace our generated data with in-distribution independent non-members, we exactly enforce the same independence, except we waste some data by drawing our auditing game after the fact. Using generated data adds another complexity, as ``leaks'' through the actual data-point shown to the auditor (based on differences between the generator and member data distributions), which we address with the baseline model.
Correctness of Proposition 2. (W1)
We hope Q3 above already helps resolve the misunderstanding about equation line 434 and Prop. 2. To expand: while is independent of (), they are not conditionally independent when conditioning on (where is the test set for the MIA/baseline), so . This is because is either a member or non-member based on , and the member is a data point on which was trained!
Hence, Eq. below l. 434 does not evaluate to . Deliberately ignoring for simplicity, the Eq. reads as the ratio between: (numerator) the membership guess (a post-processing of and ) conditioned on the event that we guess for input which is a generated non-member ; and (denominator) the membership guess (still a post-processing of and ) conditioned on the event that we still guess on which is now a member of 's training set. If is DP, the membership guess is a post-processing of a DP result on two neighboring datasets ( wasn't in the training set vs. was in the training set), and hence obeys the inequality shown after l. 434. Note that collisions (where we have but is also in the training data) make the ratio : the inequality is still true and the theory applies, but the privacy measurement has no power (see discussion on overfit generator in Q1 of reviewer 55Rd).
Adjusting significance level does not give larger DP estimate values. (W2)
We adjust the significance level because we need to compare the highest lower-bounds implied by the MIA and the baseline. Notice on Fig. 3a and 4a that the highest precision values and implied lower-bounds can happen at different (unknown in advance) recall levels. If we were to fix a recall value in advance, we could get misleading results based on whether that value is closer to the best value for the baseline or for the MIA. We thus have to make the comparisons at different recall levels.
In this case our tests from Prop. 1, 2 (which are at a fixed number of guesses) need a union-bound to be correct. We could select the maximum without a union bound as a heuristic, but in practice it barely changes the results. This is because both the baseline and MIA numbers are lowered in a similar way by their respective union-bounds, so the gap between the two remains very similar.
Auditing for -DP algorithms. (W5)
We do analyze the -DP case in App. E2, and show results in Table 10 in App. E3.
We believe we addressed the technical concerns raised in the review, and clarified the role of our lexicographic ordering, how our approach differs from that of Steinke et al., 2023, and cleared the concern about Proposition 2. Please let us know if there are any more concerns on these or other topics!
This paper proposes a novel privacy auditing procedure called PANORAMIA. The method works with a single model that uses all available training data, with no modifications to the training procedure, and with only access to a subset of the training data. This is achieved by synthetically generating non-member samples for auditing. The method produces an estimate of the differential privacy parameter, which, however, is not necessarily a lower bound.
Summary of the method given a model and its training data:
- Train a generative model on a subset of the training data.
- Use the remaining training samples as "members" for auditing.
- Generate an equal amount of "non-members".
- Split both into training and test sets.
- Fit a baseline classifier that distinguishes synthetic and real data on the training set (only input are samples).
- Fit a membership inference classifier on the training set (inputs are samples and model statistics).
- Use predictions on the test set and hypothesis testing to obtain a confidence interval for quantities of interest.
The hypothesis is of the form "The generative model is -close to the true data distribution and the training procedure is -DP", where is a distance measure defined in this paper. The baseline predictions are used to reject the first claim, and the membership predictions to reject the second one. Hence, the returned is a lower bound on the true only if the generative model is indeed at least -close to the data distribution. The theoretical analysis relies on results from the (1) procedure by Steinke et al., 2023 but adapts them to the relaxed setting in this paper.
For evaluation, the paper applies the new auditing procedure to various CNNs with and without DP guarantees on image data, small GPT2 models on text, and on tabular data. As baselines, the evaluation uses the (1) method by Steinke et al., 2023 and real instead of generated members. PANORAMIA does not outperform the baselines, but achieves reasonably close results on image and text data.
优点
- This paper considers a practical and relevant setting for privacy auditing. In particular, training the model to audit using all training data is a big benefit: otherwise, the audit either uses a slightly different setting that uses fewer training samples, or one has to forfeit a potentially large amount of training data (which hurts utility). Additionally, the procedure works even when the auditor knows only part of the training data, which can be relevant (e.g., federated learning).
- The experiments are broad (considering both DP and non-DP training, and different degrees of overfitting), sound, and the conclusions are convincing. Although the results are weaker than for existing methods, those existing methods use a less appealing setting. A particularly convincing result is that PANORAMIA with synthetic non-members often comes close to the same method but using real non-member samples.
- The paper is overall structured well and the auditing procedure is laid-out clearly. The authors manage to explain the auditing results well, despite their complex interpretation. I also appreciated that Algorithm 1 collects all parts of the method into one place, including the calculations to obtain confidence intervals.
缺点
- The considered privacy semantics are partially misleading and could be made more clear.
- The presented auditing game is incomplete; Definition 2 only describes how samples are selected but should also include the goal and actions of the adversary.
- The paper states (as a benefit) that PANORAMIA audits a target model, not the algorithm (L46--47). However, the results of the procedure are DP parameters, and DP is always a property of an algorithm (not model). If the considered semantics are meant to be different from DP, they should be made more clear and discussed.
- Similarly, the paper claims that PANORAMIA does not require worst-case canaries because it audits the target dataset (via a model). However, this ignores that certain samples in the training data might be more vulnerable; hence, auditing should still consider the worst-case sample in the training data (especially since privacy is averaged over the dataset). This might be an explanation why the reported "lower bounds on " are still far from tight in Figure 6c (as is the prior (1) procedure in a white-box setting).
- Relatedly, the definition of c-closeness (Definition 3) might not require a generator to capture the tail of the data distribution. However, this tail contains outliers that are often particularly vulnerable to privacy leakage. Hence, ignoring such outliers might significantly underestimate the lower bound. I would have appreciated a more thorough discussion of this limitation (or a clarification).
- The paper could be more explicit about whether PANORAMIA is intended to be a procedure to be used in practice, or an important stepping stone towards more practical procedures. Right now, PANORAMIA achieves worse results than the existing (1) method in a white-box setting, and requires training of a strong synthetic data generator (which might be non-trivial). Nevertheless, I believe this paper is still relevant from a conceptual perspective as a path for future work.
- There are minor notation issues in Proposition 1 and 2: the statement mentions only and , but the inequality uses and (or and ). Additionally, the left-hand sides both take probabilities over / but simultaneously condition on the respective random variable. Fixing those points makes the propositions clearer. Also, the last sentence on L148 seems misplaced.
问题
- Might there be a way to use the membership classifier itself as a baseline, e.g., by somehow averaging over possible values for model statistics? This would avoid skewed scenarios where the MIA is better at detecting synthetic data than the baseline detector (even when ignoring membership signal).
- In Figure 6/Section 5.3, do all models use (approximately) the same number of training samples? If not, could there be confounding effects (e.g., if models trained on fewer samples leak more privacy)?
- Is there a path to extend this paper's method to always yield a lower bound on (w.h.p.)? Or would this require radical changes?
局限性
The authors are very transparent about all limitations of their work and discuss them transparently.
Thank you for the thoughtful review. Hereafter, we start with answers to the questions, before addressing other weaknesses listed that we believe stem from a miscommunication.
Questions:
Q1: Might there be a way to use the membership classifier itself as a baseline, e.g., by somehow averaging over possible values for model statistics? This would avoid skewed scenarios where the MIA is better at detecting synthetic data than the baseline detector (even when ignoring membership signal).
Thanks for this interesting suggestion. If we could somehow capture a distribution of "non-member losses" for this data point, we might be able to marginalize out the dependency of the MIA on the target model and extract the input-dependent part. A key challenge with such a design would be to make sure that the resulting baseline is indeed as strong as it can be as well as that input differences do not ``leak through'' the target model loss for member data. In our experiments, the baseline is often better at picking up signal from the data point than the MIA model is (hence the cases, like in Figure 5 or some of the tabular data plots in the appendix, in which the baseline outperforms the MIA), which is in a sense the opposite problem. We believe it to be an interesting future work, and in general, any progress on discriminative models and MIAs will directly plug-into and benefit our approach.
Q2: In Figure 6/Section 5.3, do all models use (approximately) the same number of training samples? If not, could there be confounding effects (e.g., if models trained on fewer samples leak more privacy)?
Experiments for all data modalities in S5.3 show the effect of varying test set sizes on the performance of both Panoramia and O(1) while keeping all other experimental variables, including training dataset size constant across all the models. This is to ensure the results we observe are primarily due to varying the test dataset size while keeping everything else controlled. We mention the training details for the models in Appendix C. In addition, we also show, in Appendix D4, how varying training data for a fixed test set size impacts both Panoramia and baseline performance.
Q3: Is there a path to extend this paper's method to always yield a lower bound on (w.h.p.)? Or would this require radical changes?
Great question! This is plausible, and we hope that we, or another group, will figure out how in the future. The most straight-forward way to get a proper lower-bound on in our framework is to figure out how to measure (or construct) an upper-bound on , as then yields . Getting an upper-bound on in the general case seems challenging though. There might be a way to leverage DP to do this by construction (since DP bounds the gap between distributions), though we haven't figured out how to do it yet. This is definitely a promising avenue for future work.
Other important clarifications:
Weakness 2 - Relatedly, the definition of c-closeness (Definition 3) might not require a generator to capture the tail of the data distribution. However, this tail contains outliers that are often particularly vulnerable to privacy leakage. Hence, ignoring such outliers might significantly underestimate the lower bound. I would have appreciated a more thorough discussion of this limitation (or a clarification).
In cases where the generative model is unable to effectively capture the long tail distribution of the real dataset it is modeling, it becomes easier for a baseline classifier to distinguish between real and synthetic non-members, and hence to tell the two distributions apart (-closeness, like pure-DP, does not allow such discrepancies in the tail, though our -closeness relaxation in appendix E does). In practice, we do observe this phenomenon, especially with tabular data. We highlight this limitation via results on the tabular setting in Appendix D6 (we will add a pointer to the paper for clarity). In such a case, we show that the baseline is as strong or stronger than the MIA and Panoramia is unable to effectively detect any privacy leakage from the respective target model.
Weakness 3 - The paper could be more explicit about whether PANORAMIA is intended to be a procedure to be used in practice, or an important stepping stone towards more practical procedures. Right now, PANORAMIA achieves worse results than the existing (1) method in a white-box setting, and requires training of a strong synthetic data generator (which might be non-trivial). Nevertheless, I believe this paper is still relevant from a conceptual perspective as a path for future work.
Thank you for the helpful feedback. We will clarify that we believe that our work is an important step towards solving the privacy measurement setup that we tackle, namely measurements without control of the training process and with distribution shifts between members and non-members. The end-goal for future work is a full-fledged approach (that provides a proper lower bound) with potential improvements over other approaches (e.g., optimizing the generator to improve the audit). However, we also believe that our framework can already be useful, for instance for providing improved measurements with more data (Fig. 6a); or to tackle privacy measurements in models for which there are no known in-distribution non-members, which as recently pointed out in [1] suffers from the same distribution shifts we tackle by using out-of-distribution non-members (our theory would apply to such a case, though whether the generative model part can help is still an open question).
[1] Das, Debeshee, et al. "Blind Baselines Beat Membership Inference Attacks for Foundation Models." 2024. https://arxiv.org/abs/2406.16201
I thank the authors for their response, which answered all my questions thoroughly and resolved Weakness 3.
However, after reading all other reviews and rebuttals, my concerns about the privacy semantics remain. In particular, a robust evaluation procedure should measure the privacy of the worst-case samples (in the dataset), not just an average over the dataset. For example, the high-level arguments in https://differentialprivacy.org/average-case-dp/ still apply for a fixed dataset. Often, samples from the tail of the data distribution are worst-case samples (in a fixed dataset). At the same time, this paper's method seems to fail if there are such samples, because the generator fails to synthesize similar non-members (I greatly appreciate the author's transparency in this regard). While the (1) procedure of Steinke et al., 2023 can suffer from similar issues, there one can simply pick an audit set that solely consists of worst-case samples. I don't see a similarly easy fix for PANORAMIA.
Nevertheless, I do not think this is a deal-breaker, and believe this paper's idea and method carry merit in themselves (e.g., towards a solution of the problems highlighted by Das et al., 2024).
The paper proposes a novel way for privacy auditing of machine learning models through MIAs. The framework they propose, panoramia, aims to audit the privacy of a ML model post-hoc (so with no control over the training procedure), and with access to a known member subdataset. The method first consists of training a generative model on the known subset of members, which is used to then generate synthetic data samples from the same distribution. These synthetic points are then used as non-members, which combined with the known member dataset allows to fit and apply an MIA on the target model. Importantly, authors recognize that there might be a distribution shift between members and synthetic non-members, so they fit a classifier without leveraging the target model as a baseline.
Next, they use the difference between the MIA performance (thus using the target model) and this baseline to estimate the privacy loss. Authors provide the formula (and add proof in appendix) to compute a value of epsilon approximating a lower bound on epsilon-DP.
They further apply the privacy auditing to three kinds of ML models (image classification, language modeling, and tabular data classification). Authors consider models with varying degree of overfitting, DP training, and increasing the amount of member data available to the auditor.
优点
- Originality: The paper introduces a way to audit the privacy of ML models post-hoc, without any control of the training data, which is novel. They cite and position themselves correctly to relevant prior work such as Steinke et al.
- Quality: The proposed method is technically interesting, formally supported and evaluated extensively across data/model modalities.
- Clarity: NA
- Significance: The paper proposes a way to compute an approximation for the lower bound on epsilon, to audit ML models post-hoc, which is technically interesting.
缺点
- Originality: Authors should include an appropriate related work section, touching on other privacy auditing techniques and potentially other (post-hoc) membership inference attacks.
- Quality: The proposed method strongly depends on the quality of a generator and the baseline MIA, the impact and limitations of which can be further explored (see questions/limitations).
- Clarity: I find that the paper's clarity can be improved significantly. Especially the results section (tables, text) is quite notation heavy and hard to follow what everything refers to.
- Significance: While technically interesting, the relevance of a technique to compute a proxy lower bound privacy loss in practice needs more compelling motivation (see questions).
问题
- I understand how c-closeness allows to estimate the quality of the used generator. However, what I struggle to understand, is what happens when you develop a generator that is perfect (c=0) which just samples randomly from the known member subset. Then, a baseline classifier would not be able to distinguish members from non-members, and nor will an MIA, leading to a privacy estimate that will be far off. In less extreme cases, the generator might be slightly overfitted and indeed generate non-members very similar to members, which might also impact the MIA performance. Am I correct that this could have a significant impact on the validity of the procedure? And if so, how would you address the concern?
- More generally, Panoramia largely depends on a good generator and a baseline MIA. Can authors elaborate on the associated limitations? For instance, can good generators can be developed across all use-cases (smaller datasets, data modalities etc)? And how should a baseline performance be developed or evaluated to be used as part of the panoramia framework?
- I understand that the method authors provide does not give a formal lower bound for the privacy loss, but rather a proxy for it. In practice, if indeed a hospital as part of an FL setup would like to assess the privacy leakage incurred by their data, why would they opt to compute an estimate for epsilon using your method? Instead, they could for instance generate non-members in the same way as panoramia and just quantify the MIA performance with an AUC or TPR at low FPR compared to a baseline. In general, authors should further motivate the relevance of a proxy for a lower bound privacy loss in practice.
- To further motivate post-hoc privacy auditing (without any control of the training data), I wonder if it makes sense to also emphasize the context of generative AI models such as LLMs? These models are increasingly trained on all the data model developers can acquire so the absence of non-member data is very real in practice.
- Can authors add a related work section?
局限性
Authors are very clear that their method only provides a proxy for the lower bound of epsilon and thus carefully caveat their method. However, can authors elaborate on the limitations associated with the development of both a generator and a baseline, both of which seem fundamental for panoramia. Currently, their implementation seems rather ad-hoc than adequately discussed for wider applications.
Thank you for the thoughtful review.
Q1: What happens when you develop a generator that is perfect (c=0) which just samples randomly from the known member subset.
In that case (a highly overfitted generator), the baseline and MIA will output (though technically , since -closeness quantifies the distance between the generator's distribution, and the distribution of members, not the specific member data points). This happens mainly because both the MIA and baseline detect leakage based on the same data, and assess the data in two ways: (1) the distributional difference in the data points themselves; and (2) for the MIA the difference in distributions of loss values the target model attributes to the member vs non-member data. In this case, our theory still applies but returns . Thus, we would not faultily ascribe privacy leakage to the target model, but the measurement would also fail to detect any real privacy leakage (one could see that the measurement is failing though, since ).
Q2: More generally, Panoramia largely depends on a good generator and a baseline MIA. Can authors elaborate on the associated limitations? [C]an good generators be developed across all usecases (smaller datasets, data modalities etc)? And how should a baseline performance be developed or evaluated to be used as part of the panoramia framework?
The ability of the generator to capture well the data distribution is a key dependency of PANORAMIA and the reason why we have thoroughly evaluate our approach. In particular, in the tabular case, datasets typically contain a large number of ``average'' samples and a long tail of extreme values or rare classes. Thus in such a case, when the generated data does not capture the long tail, the baseline easily classifies all outliers as real data as no synthetic data points are similar to them. This leads to a strong baseline that is hard to beat for the MIA, which means that the privacy measurement then fails (though we can diagnose why). This is a limitation of our work that we discuss in details in App. D6 as a possible cause for the failure of the tabular data modality. In general, the baseline should be made as strong as possible to detect such failures of the generative model following the traditional ML best practices. For this reason, we have dedicated a lot of effort in designing and evaluating our baselines, including the helper models (S5.1 and App. D1).
Our design and theory offer practical advantages as well:
- As generative models improve, especially for smaller dataset sizes, so will our approach.
- The generator opens an interesting design space, in which one could try to optimize synthetic data for audit quality. While we have not yet explored this design space, we leave it as an interesting avenue for future work.
- Finally, our theory applies to other out-of-distribution non-member data, such as when using other datasets as non-members [1].
Q3: [T]he method authors provide does not give a formal lower bound for the privacy loss, but rather a proxy for it. [...] Why would [one] opt to compute an estimate for epsilon using your method? Instead, [one could] generate non-members in the same way as panoramia and just quantify the MIA performance with an AUC or TPR at low FPR compared to a baseline. [...] Authors should further motivate the relevance of a proxy for a lower-bound privacy loss in practice.
As it is the consensus in the field, we believe that DP provides the best semantics to define and quantify this type of privacy leakage. Thus, usually, privacy audits aim to provide a lower bound for the privacy loss. While we do not yet provide a lower bound, using DP semantics is still useful. For instance, [2] makes a convincing case that accuracy or AUC are not good metrics for privacy leakage. Rather TPR at low FPR is better, but notice that we need to compare the MIA with the baseline, and their performance most revealing of privacy leakage happens at different FPR values (cf. Figures 3a and 4a, though in precision/recall terms). In this case, we cannot directly subtract TPR values. Our theory tells us: (1) which FPR we should chose and (2) how to scale the TPR at this FPR (by mapping it to a or value) to make the values comparable between the baseline and the MIA.
Taking a step back, we hope that our framework will be a stepping stone for further research, maybe enabling a proper lower-bound on by measuring (or enforcing by construction) an upper-bound on (then yields ). Though we do not know how to do it yet, this is a promising future work.
Q4: To further motivate post-hoc privacy auditing (without any control of the training data), I wonder if it makes sense to also emphasize the context of generative AI models such as LLMs? These models are increasingly trained on all the data model developers can acquire so the absence of non-member data is very real in practice.
Thanks for this great point. Actually, a paper made public after our submission [1] highlights this exact issue. As you mention, in this case, there is no known non-member data from the same distribution, MIA benchmarks use out-of-distribution non-member data (e.g., more recent datasets). Thus using generated data might yield better non-member data. Our theory applies equally to both real and synthetic non-members data and we believe that it may be an interesting building block for that setting.
Q5: Can authors add a related work section?
We propose to add a more formal related work section to collect the closest works, and an extended one in appendix.
[1] Das et al. "Blind Baselines Beat Membership Inference Attacks for Foundation Models." 2024. https://arxiv.org/abs/2406.16201 [2] Carlini et al. "Membership inference attacks from first principles." S&P 2022.
We believe we clarified how our approach deals with overfit genertors , and why the theory we developed goes beyond comparing AUC or TPR at low FPR between a MIA and baseline. We will add these clarifications to our paper and mention the generative AI use-case mentioned in the review (thank you!) as additional motivation. Please let us know if there are any more concerns on these or other topics!
Many thanks for the authors for their elaborate rebuttal and addressing the concerns I raised.
While the value of a proxy for the lower bound arguably remains limited, I believe the method and evaluation in this paper provide valuable insights into developing MIAs in the absence of non-members with an appropriate model-less baseline. I believe this pushes the field forward, especially in light of recent concerns raised by [1,2,3]. I hence change my score to 5. Best of luck.
[1] Das et al. "Blind Baselines Beat Membership Inference Attacks for Foundation Models." 2024. https://arxiv.org/abs/2406.16201 [2] Carlini et al. "Membership inference attacks from first principles." S&P 2022. [2] Duan, M., Suri, A., Mireshghallah, N., Min, S., Shi, W., Zettlemoyer, L., ... & Hajishirzi, H. (2024). Do membership inference attacks work on large language models?. arXiv preprint arXiv:2402.07841. [3] Meeus, M., Jain, S., Rei, M., & de Montjoye, Y. A. (2024). Inherent Challenges of Post-Hoc Membership Inference for Large Language Models. arXiv preprint arXiv:2406.17975.
The authors propose a privacy auditing technique that utilizes partial access to member data to generate synthetic non-member data, which in-turn is used to train a meta-classifier that can be used to empirically measure privacy leakage relating to record membership. The authors evaluate their technique on models as large as GPT2 and find correlation between audit scores and the expected leakage from models.
优点
- L226-228: Good to include negative results!
- Having access to non-members is often taken for granted (especially when the auditor is also the model trainer), but in most third-party cases getting good-quality non-member data that is not significantly different from member distribution is hard. The methods proposed in this work can be useful there, with a generator and discriminator that can help ensure any differences in members and non-members (and subsequent MIA performance) do not arise from distributional differences.
- The paper is well written and structured, and experiments are thorough, spanning quite a lot of models and modalities.
缺点
-
As an auditor, access to validation/test data used in model training would not be a far stretch- why not use the non-member set as used in standard MIA setup (validation/test data). What is the added benefit from this extra step of generating synthetic non-members?
-
Regarding the contributions, (1) and (2) has already been explored by [1, 2]. While the method in [1] does not currently support large models/datasets like CIFAR10, it would be nice to highlight differences here. If you indeed have knowledge of a decent-chunk size of members and non-members, I'd imagine you could do something better like [2], and other related methods.
-
I have some concerns over the dependency on how well the baseline in/out distribution detection system works. As a concrete example consider Figure 7(a, b) - even as a human I see a very clear difference in resolution of the generated images and find it hard to believe that the distinguisher does not work well here. Even a non-ML technique (that can work around with blurring) would work pretty well here.
-
Table 1: Accuracy of models here is not good enough, especially when using entire data, with very clear signs of overfitting. If using the entire data (and not half split, as in most MIA evaluations [3, 4] where even with half the dataset test accuracy is ~92%), should be able to train well-performing models that do not overfit so heavily. Please see this resource
-
Figure 1: While the trends suggest that the proposed audit correlates with actual leakage, one could also argue that (100 - test accuracy) is also a useful privacy metric in this comparison given the correlation. The utility of the audit would be more apparent, then, when studying models that have comparable test performance, but have (by design) different leakage, perhaps focusing on moderate/large values of with Differential Privacy-training.
-
L288-290: (1) and (2) are ambiguously enforced- if you have a large portion of train data, you can control process in most cases, and also obtain non-member data. As far as (3) is concerned, that is a compute-related constraint, not a threat model difference. The auditor could, for instance, use available data knowledge to train "in" models.
-
L331-334: I do not find the justification behind having access to partial member data convincing. While I am okay with just stating that "this is a possible limitation, but okay to assume for an auditor", but relating it to how it could be useful in situations like FL is not practical. For instance, here empirical experiments use more than 1/5th of the data; no FL training will have just 5 participants. This is closer to a specific case of distributed learning, or pure data aggregation. Even in such cases, the data distributions per client will be considerably different, as opposed to the experiments here which have uniform samples.
Minor comments
- Please consider a more descriptive abstract. "scheme" here is not very descriptive
- L121: "We formalize a privacy game" - this is standard membership inference evaluation and not a contribution of this work. Please rephrase to make this clear.
- L493-494: Please provide direct distinguishing accuracies of these baseline classifiers for reference and easier understanding.
References
- [1] Suri, Anshuman, Xiao Zhang, and David Evans. "Do Parameters Reveal More than Loss for Membership Inference?." High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning.
- [2] Nasr, Milad, Reza Shokri, and Amir Houmansadr. "Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning." 2019 IEEE symposium on security and privacy (SP). IEEE, 2019.
- [3] Carlini, Nicholas, et al. "Membership inference attacks from first principles." 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022.
- [4] Zarifzadeh, Sajjad, Philippe Cheng-Jie Marc Liu, and Reza Shokri. "Low-cost high-power membership inference by boosting relativity." (2023).
问题
-
Figure 3a suggests an ordering of 100 > 20 > 50 in terms of leakage for high-recall region, whereas numbers in Table 1 suggest that the trends should be like 100 > 50 > 20 if the proposed method is indeed measuring what it is supposed to. Why is that so?
-
L480-481: this means the "labeler" is much weaker and might be generating incorrect labels more often? Are there any numbers for what the actual test performance of this labeler is?
局限性
There are a few statements that currently serve as justification for limitations (see above) but should just be posed directly as base assumptions. Apart from that (and some potential shortcomings in models used for evaluation), most limitations are already stated in the paper.
Thank you for the thoughtful review. We start with answers to the two explicit questions, before addressing two of the weaknesses listed that we believe stem from a miscommunication or misunderstanding.
Questions:
Q1 - Figure 3a suggests an ordering of 100 > 20 > 50 in terms of leakage for high-recall region, whereas numbers in Table 1 suggest that the trends should be like 100 > 50 > 20 if the proposed method is indeed measuring what it is supposed to. Why is that so?
In our experiments on all data modalities and models, the precision values that lead to the highest lower-bounds or are achieved at fairly low recall but for different recall values (especially for images). For this reason, when training the baseline and MIA models, we use a validation set to select hyper-parameters and training stopping time that maximize the lower bounds (effectively maximizing the precision in the lower range of recall values) on the validation set. We will clarify this in Appendix C. This is why for some instances like pictures, MIA models have the wrong order at high recall values. While they could probably be tuned to achieve higher precision and be ordered correctly, we do not do it since that regime is not where the highest lower-bounds happen. Indeed, we can see in both Figure 3 and associated results in Table 1 that in the recall regions at which the lower-bounds are maximal, the order is as expected (100 > 50 > 20).
Q2 - L480-481: this means the "labeler" is much weaker and might be generating incorrect labels more often? Are there any numbers for what the actual test performance of this labeler is?
The labeler used to train the helper model can indeed be weaker and generate wrong labels. For instance, our labeler for image data has 80.4% test accuracy on the CIFAR10 real data test set. The rationale for this approach is to augment the baseline with a model providing good features (here for generated data) to balance the good features provided to the MIA by outside of the membership information. In practice, the labeled generated data seems enough to provide such good features, despite the fact that the labeler is not extremely accurate. We studied alternative designs (e.g., a model trained on non-member data, no helper model) in Appendix D.1, Table 5, and the helper model trained on the synthetic data task performs best (while not requiring non-member data, which is a key point).
Other important clarifications:
Weakness 1 - [W]hy not use the non-member set as used in standard MIA setup (validation/test data). What is the added benefit from this extra step of generating synthetic non-members?
We do study this setup in Fig. 6, comparing a MIA using up to the full test data and our approach using generated data. On CIFAR10 we can observe that, for the same non-member dataset size, using real data performs better (when using an ML-based MIA instead of a loss-based attack in this case). However, generated data enables us to use more data points for training and evaluating the MIA (and baseline), leading to larger measurements of privacy leakage. This is true for both regular and DP models (Fig. 7a and 7c), despite the fact that the CIFAR10 test set is fairly large compared to the training set (20%) and such a large portion of training data may not be kept for testing in general.
Going in the same direction, a very recent work made public after our submission [1] shows that a similar member/non-member distribution shift issue happens when measuring privacy leakage from foundation models. In that case, the models are trained on vast amounts of data, and there is no known non-member data from the same distribution. As a result, MIA benchmarks use out-of-distribution non-member data (e.g., more recent datasets). Thus, using generated data might yield better non-member data. Additionally, our theory applies to both real and synthetic non-members data. Consequently, we believe that our approach may be an interesting building block for the setting described in [1].
Finally, we believe that the generator opens an interesting design space, in which one could try to optimize generated data for audit quality. We have not yet explored this avenue of research, but we believe that it is an interesting future work.
Weakness 3 - I have some concerns over the dependency on how well the baseline in/out distribution detection system works. As a concrete example consider Figure 7(a, b) - even as a human I see a very clear difference in resolution of the generated images and find it hard to believe that the distinguisher does not work well here. Even a non-ML technique (that can work around with blurring) would work pretty well here.
This is a great point, thank you. We apologize as this is a miscommunication on our end: our CelebA experiments are using a resolution, thus our generative model generates images at that resolution. In Figure 7a though, we wrongly showed full resolution x images, hence the resolution difference. We fixed this figure to display the real images as we use them, and included it in the associated PDF answer (Fig. 1), for which we can see that the difference in resolution disappeared. Also note that the generator only needs to generate some good images, enough that it is hard to detect real members with high precision. This is because we play a one-sided member-detection game, as explained in Remark 3 ll. 137-144. Of course, it is always possible that much stronger baselines exist, although we did spend quite a bit of effort and time on making them as good as we could (see details in Appendix D of the submission).
[1] Das, Debeshee, et al. "Blind Baselines Beat Membership Inference Attacks for Foundation Models." 2024. https://arxiv.org/abs/2406.16201
...such a large portion of training data may not be kept for testing in general.
No model (at least in practice) is ever trained without any validation/test data. While it may not be as high as 20%, getting an estimate of privacy leakage of the underlying training mechanism/model should not require more than a small fraction of the data. I think a convincing argument can be made to use generative techniques, and the authors are close to one but not there yet.
Most of my other concerns remain unaddressed, specifically: claims around contributions, underperforming models, and how the score generated via the proposed technique is any better than directly measuring test accuracy as a signal to help understand relative leakage (and other misc. comments). I am hoping the authors will respond to them
Thank you for your follow-up!
The are two things we would like to emphasize about the validation set, in case they were lost in the rest of the answer. First, we do see that generated data can provide higher privacy leakage measurements above the ones we can measure using 20% non-member held-out data (Figures 7a and 7c). Second, we believe that foundation models are a counter-example to the claim that there always is such a large held-out set: as far as we know there typically is no publicly known held-out set of in-distribution non-members (this is from an external point of view, it is likely that there is one internally). However, there are many known member data points, and externally measuring privacy leakage for such models is a topic of interest (see [1] in our answer).
We are happy to discuss the other comments as well!
Claims around contributions
In our contributions paragraph, we list three key properties of the setting we tackle. We do not mean to claim that we are the first to study each separately! We will rephrase this paragraph to clarify that the novelty of our approach lies in the combination of those properties. The two papers cited in the review focus on membership attacks, which is just one part of our approach. It would be interesting to see if we can use those attacks in Panoramia to yield better measurements, but this paper focuses on developing the privacy measurement framework that we propose. Regarding the specific membership inference papers:
- The paper cited in the review as ``[1] Suri, Anshuman, Xiao Zhang, and David Evans. "Do Parameters Reveal More than Loss for Membership Inference?."'' was first posted on arxiv in June'24, after our submission. It is indeed a very interesting candidate to use in Panoramia's privacy measurement in the future.
- The paper cited as ``[2] Nasr, Milad, Reza Shokri, and Amir Houmansadr. "Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning."'' has several membership attacks, and some might be compatible with our approach and could be interesting to try.
Note that in both cases, we would still need non-members to measure and assess the performance of the membership attacks (and hence the generative model in our setting), and we would still need a theoretical framework to meaningfully compare the results of the results of the MIA and baseline to quantify privacy leakage (our theory). We will make sure to mention the papers and which parts of our framework they could improve.
underperforming models
In this paper, we focused on developing our privacy measurement framework, and studying its behavior when varying levels of overfitting (increasing the number of training epochs), changing the number of parameters, or making models DP. For this reason, we did not focus on using the most recent, state-of-the-art models, though it would be an interesting measurement study in the future. On the specific models we use for image data, we use a ResNet101, which reaches 93.75% test accuracy on CIFAR10 with ImageNet pre-training (that is why our version is a bit weaker at 91.61%, as we do not do this pre-training). We note that other papers listed in the review evaluate similar performing models (a WideResnet with 92.4% in [2], 92% in [3], and 90% accuracy in [4]). We will add results for a stronger model. We will report results if they finish by the end of the discussion period, though we are not sure it will be ready due to the computation power we have access to.
how the score generated via the proposed technique is any better than directly measuring test accuracy as a signal to help understand relative leakage
There are two main facets to this answer. First, DP is a well-accepted definition to measure privacy leakage and a good metric for our approach. Second, we need to compare two models (the MIA and baseline), and using our technique makes them comparable. They would not be comparable under other metrics. In more detail (slightly adapted from our answer to reviewer 55Rd):
As it is the consensus in the field, we believe that DP provides the best semantics to define and quantify this type of privacy leakage. Thus, usually privacy audits aim to provide a lower bound for the privacy loss. While we do not yet provide a lower bound, using DP semantics is still useful. For instance, [2] makes a convincing case that accuracy or AUC are not good metrics for privacy leakage.
TPR at low FPR is a better metric than accuracy or AUC, but notice that we need to compare the MIA with the baseline, and their performance most revealing of privacy leakage happen at different FPR values (cf. Figures 3a and 4a, though in precision/recall terms). In this case, we cannot directly subtract TPR values. Our theory tells us: (1) which FPR we should choose and (2) how to scale the TPR at this FPR (by mapping it to a or value) to make the values comparable between the baseline and the MIA.
Taking a step back, we hope that our framework will be a stepping stone for further research, maybe enabling a proper lower-bound on by measuring (or enforcing by construction) an upper-bound on (then c+ \epsilon yields ). Though we do not know how to do it yet, this is a promising future work.
and other misc. comments
Please consider a more descriptive abstract. "scheme" here is not very descriptive
Thank you for the feedback, we will rework the abstract to avoid this generic term.
L121: "We formalize a privacy game" - this is standard membership inference evaluation and not a contribution of this work. Please rephrase to make this clear.
While formalizing a privacy game is standard, the specific one we introduce (definition 2) is a variation of the game from Steinke et al., 2023 (please see our answers to Reviewer Z4qp, Q3 and W1 for details on the difference) and, to the best of our knowledge, unique to our work.
L493-494: Please provide direct distinguishing accuracies of these baseline classifiers for reference and easier understanding.
We do not believe that accuracy is a meaningful metric for privacy leakage measurement. See [2] for a detailed case. Our focus on member detection, as opposed to both member and non-member (see Remark 3 l. 137), which is due to practical requirements on the generator, exacerbates these issues as accuracy measures both the detection of members and non-members. As we argue above, we do believe that the metrics we report based on DP are the best suited to measuring privacy leakage.
We also show the whole precision/recall curve for some of the models on plots like those if Fig. 3. We could add numbers for the (precision, recall) tuples that imply the bounds in Table 3 (drawback: each bound is at a different recall), or for the precision at a fixed low recall (drawback: this will not be the ``best'' recall for all models, so comparisons are a bit misleading). Because of the drawbacks, we decided not to show this information in our submission but can add them if it is valuable.
[2] Carlini et al. "Membership inference attacks from first principles." S&P 2022.
Thank you for your detailed responses. I have increased my score.
..as we know there typically is no publicly known held-out set of in-distribution non-members (this is from an external point of view, it is likely that there is one internally).
For a privacy audit (which is the focus of this paper), the held-out data need not be public. And if the audit is indeed external, getting access to a large fraction of member data (to train the generator) is also an equally strong (if not stronger) assumption.
For this reason, we did not focus on using the most recent, state-of-the-art models
The idea here is not to necessarily pick the "absolute best" model, but have target models comparable in performance to the ones used in the literature. Using a weaker model (and likely higher overfitting, which is indeed the case) would naturally demonstrate higher membership leakage which might make readers believe that the proposed audit is a much better than existing MIAs.
We performed experiments on higher-performing models. Our early results (see table) are qualitatively similar to those in the paper: we can detect privacy leakage, with numbers consistent with those of the O(1) approach, and we detect more leakage from overfit models (e.g. WideResNet-28-10 trained for 300 epochs). We will update our paper with a full evaluation (multiple seeds, similar to Fig. 4) and growing test set sizes (similar to Fig. 6).
Thank you for your valuable suggestions, score increase, and involvement in the discussion!
| Target model | Test Accuracy | Audit | ||||
|---|---|---|---|---|---|---|
| WideResNet-28-10_E150 | 95.67% | PANORAMIA, | 2.51 | 5.05 | 2.54 | |
| (generalized model) | O(1) | - | 2.95 | 2.95 | ||
| WideResNet-28-10_E300 | 94.23% | PANORAMIA, | 2.51 | 6.10 | 3.59 | |
| (overfit model) | O(1) | - | 3.76 | 3.76 | ||
| ViT-small-timm-transfer_E35 | 96.38% | PANORAMIA, | 2.51 | 4.99 | 2.48 | |
| (pre-trained on imagenet) | O(1) | - | 2.83 | 2.83 |
Thank you for the thought-provoking reviews and suggestions! In our answers, we focus on key misconceptions regarding our paper and misunderstandings in the reviews. Hereafter, we summarize the most important points addressed:
- Potential baseline weakness (reviewer CsXP): this is an issue due to Figures 7(a, b) in which we wrongly put full-resolution member images while we work with a lower-resolution dataset for our experiments (hence with a generator outputting lower-resolution images). We fixed the figure (Figure 1 in the PDF attachment) to display the member images that we actually used.
- Technical question on the soundness of equation line 434 (reviewer Z4qp): we explain why this equation is the DP bound as we wrote, and why it does not evaluate to 1. We also clarify how our auditing game differs from (and resembles) the one from (Steinke et al., 2023), which is related to the soundness of the equation at line 434.
- Motivation for the approach (reviewers 55Rd, kScF, Z4qp), including: (1) why using DP is useful even without a lower-bound (i.e., it lets us compare the MIA and baseline on the same scale while being a good way to formalize and quantify privacy leakage); (2) the usefulness of our framework (e.g., the theory can help with other out-of-distribution non-member data, such as when using other datasets as non-members [1], our generator opens an interesting design space to maximize audit efficiency); and (3) that we believe that our current work might be a stepping stone to a more complete approach that yields a proper lower-bound.
We address all these issues and some of the other comments from each reviewer in the individual answers.
[1] Das, Debeshee, et al. "Blind Baselines Beat Membership Inference Attacks for Foundation Models." 2024. https://arxiv.org/abs/2406.16201
The paper presents a method for privacy auditing without access to real non-members by using generated examples as non-members. The paper builds a theoretical framework for addressing the auditing question approximately and presents experiments to test the method on a number of tasks in diverse domains (images, NLP, tabular data).
The reviews are generally positive, commending the paper on addressing a relevant and interesting problem and experiments covering diverse tasks, but raise concerns about correctness of certain results, lack of careful evaluation of how the quality of the generator impacts the results in practice, and the fact that the method does not produce strict lower bounds.
The only reviewer recommending rejection, Reviewer Z4qp, raised concerns about the correctness of some key results, which likely contributed to the negative review. The authors have addressed these in their response, but the reviewer has not reacted to the response. It would seem to me that the most serious concerns regarding the paper have thus been addressed although the review has not been updated to reflect this.
In light of these reviews, I believe the paper should be accepted.
I would however recommend the authors to consider the comments raised by the reviewers and add a more thorough evaluation and discussion of the impact of the quality of the generator to the results. It appears to me that the training protocol for the generators used in the experiments is really elaborate. This raises concern that it might be very difficult to replicate the strong results from the image and NLP models without equally arduous generator training.
Minor comments:
- The authors claim to use top- sampling in the NLP generator (e.g. line 224), but note on line 533 of the Appendix that they use parameter and for top-. I believe this would be more correctly labeled as top-, as top- filtering has no effect under these parameters.
- Unlike what they claim in the paper checklist response #8, the authors do not disclose the compute time required by the experiments.