PaperHub
3.5
/10
Rejected4 位审稿人
最低3最高5标准差0.9
3
3
5
3
3.8
置信度
正确性2.3
贡献度2.3
表达2.5
ICLR 2025

Semantic Membership Inference Attack against Large Language Models

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05

摘要

关键词
Membership Inference AttackLarge Language Models

评审与讨论

审稿意见
3

This paper proposes a new membership inference attack (MIA) for LLMs, called SMIA, with the goal to measure memorization. SMIA's intuition is that semantically similar “neighbors” of a sample contain membership signal. In the threat model, the adversary can query the target LLM to obtain losses/log probs, and has access to a set of known training members and non-members.

Concretely, given a target sample, the attack works as follows:

  1. Generate "neighbors" of the target via repeated random masking and infilling (via T5 3B).
  2. Calculate semantic embeddings of the target sample and all neighbors (via Cohere Embedding v3).
  3. For every neighbor, calculate the difference to the target sample in terms of i) loss and ii) semantic embedding (full direction). Use the resulting 1 + 1024 values as the input to a moderately-sized MLP classifier that predicts a membership score.
  4. Return "member" if the average membership score over all neighbors is larger than some threshold.

The paper evaluates SMIA for Pythia and GPT-NEO models on Wikipedia and MIMIR data. All results are reported in terms of AU-ROC; a few results also in terms of TPR @ FPR. For Wikipedia, the evaluation dataset uses Wikipedia text in the Pile training split as members, and one of two types of non-members: WT uses Wikipedia text from the Pile test split (before March 2020); WC uses Wikipedia pages published after August 2023 (not in the Pile training split). For MIMIR, the evaluation uses the highest n-gram overlap subsplit with various splits.

优点

The paper introduces SMIA and its intuition very thoroughly. Even though there is no code, the attack's description should be sufficient to reimplement SMIA and reproduce most reported results. I also find it very interesting that directions in embedding space seem to carry non-trivial membership signal and think that the idea of considering not only exact verbatim samples for MI in LLMs is important. Lastly, although MI for LLMs remains a hard task, this paper manages to achieve small but consistent improvements over most existing methods in some settings.

缺点

Evaluation datasets measure distribution shift, not MI: Some evaluation datasets in this paper might be deeply flawed, in that they do not measure MI performance but just distribution shifts. For Wikipedia data, SMIA only yields a big improvement on the WC split, which uses temporally shifted Wikipedia articles as non-members. However, Das et al., 2024 show that simple "blind" baselines that classify membership just based on the data (without any signal from the target model) already achieve very high MIA success. Hence, SMIA could potentially just be a good distribution shift detector and the evaluation misleading. This is particularly concerning since the membership classifier is a NN, which might be particularly prone to overfitting on spurious correlations in the evaluation data. One way to alleviate this concern could be a blind baseline that uses inputs independent of the target model (e.g., random or constant loss). For MIMIR, this is less of a concern. However, there SMIA only achieves significantly better MIA success for GitHub data and Wikipedia data for one model; existing attacks perform similarly or better on all other subsplits (in terms of TPR @ 2% FPR).

AU-ROC alone is insufficient to evaluate MIAs: It has long been known that only reporting AU-ROC in the evaluation of MIAs is deeply flawed and misleading (Carlini et al., 2021). Yet, this paper uses only AU-ROC as its main metric, and only reports TPR@low FPR values for a very limited subset of results in App. C.1. In particular, the effects of deduplication (Table 6), varying number of neighbors (Table 7), and slight modifications (Table 3) are only reported in terms of AU-ROC, and can hence not be judged soundly. The paper should report TPR @ low FPR (e.g., 2% or 1%) together with full ROC curves as the main metric, and defer AU-ROC metrics to the appendix (or omit them entirely). NB: Consequently, while I looked at all tables in the main matter, I can only judge SMIA's performance by the values on Tables 4 and 5 (App. C.1).

Complex approach without ablation: SMIA is a complex approach that relies on many moving parts; yet, there is very little investigation about why every part is necessary. Hence, this paper requires a more thorough ablation study (e.g., what if the classifier only acts on losses/embedding directions, or what if the NN is replaced by a linear classifier?).

Other minor points/feedback:

  1. SMIA requires a known member and non-member subset from the training data distribution. This is a non-trivial assumption that requires some discussion.
  2. The lowest reported FPR threshold is 2%. This is still relatively high; it would be interesting to also report the attack's performance at lower FPRs such as 1% and 0.1% if feasible.
  3. I found two things in Figure 2 slightly confusing: i) the axis label, and ii) that all neighbors are larger on one axis; I understood that the similarity is somewhat independent of the parameter space? Also, e.g., x1mx_1^m seems to be further from xmx^m than xnx^n. Right now, I think the figure does not carry much information and could be confusing, hence it might be dropped. Finally, there is a typo in the caption (taregt).
  4. Sec. 5.1 mostly repeats the results displayed in Table 1. I think the space could be used better by dropping this repetition and instead show some of the interesting studies currently in App. C.

问题

  1. L218: Does the replacement actually happen in terms of words, or should it be tokens?
  2. For the neighbor generation, are the resulting neighbors all unique, or could it be that some neighbours are duplicates/equal to the original sample?
  3. Were T5 or the Cohere Embedding V3 model trained on some of the evaluation datasets? In particular, could it be that one of those models was only trained on the members for WC but not the non-members?
  4. L382--384: Why are the MIMIR splits constrained to only 1000 samples, especially before splitting?
  5. L391: The intuition about "Why SMIA Outperforms Other MIAs" mentions that using a neural network is one of the key success factors. However, I think simply using a neural network does not imply a stronger attack; could the authors elaborate this a bit more?
评论

We thank the reviewer for their feedback. We provide responses for the weaknesses and questions raised below:


(w1)-Evaluation datasets measure distribution shift, not MI

To address the potential issue of our evaluation datasets measuring distribution shifts rather than true membership inference performance, we have added Section C.6 to our paper. In this section, we implement two blind attacks from Das et al. (2024): (a) greedy rare word selection and (b) bag-of-words classification, which operate without any signal from the target model. Our results, summarized in Table 13, show that these blind attacks achieve high AUC-ROC scores on the WC split (58.30% and 83.6%, respectively), indicating that WC is susceptible to distribution shift exploitation.

As we mention in Section 5.1, WC contains distribution-shifted data, and we discuss why different MIAs achieve good results on this dataset due to the distinct distributions of members and non-members. Our main evaluations focus on the WT dataset, which provides a realistic and challenging scenario where members and non-members come from closely aligned distributions. Evaluations on WC are included for comprehensiveness but are not the primary basis for our conclusions.

On the WT split, the same blind attacks perform no better than random guessing, with AUC-ROC scores around 52% and TPRs at low FPRs near zero. This stark contrast emphasizes the limitations of datasets like WC for evaluating MIAs, as their success is largely driven by distributional differences rather than true privacy leakage. Importantly, our proposed SMIA method maintains strong performance on the WT dataset (as shown in Tables 7 and 8), effectively detecting membership where blind attacks fail. This indicates that SMIA is not merely capturing distribution shifts but is robustly inferring membership information based on the model's learned representations, thereby addressing true privacy leakage concerns.


(w2)- AU-ROC alone is insufficient to evaluate MIAs

We agree that relying solely on AU-ROC can be insufficient and potentially misleading for evaluating membership inference attacks. In our paper, we have provided TPR at low FPR values for our main results in Tables 4 and 5 and have now added TPR@1% FPR to Table 4 for a more comprehensive evaluation; while previous works like Min-K and Min-K++ did not report TPR for FPR lower than 5%—possibly due to the low absolute values—we recognize the value of including these metrics to facilitate a thorough comparison; from Table 4, it is evident that SMIA achieves the best results in most scenarios, especially at low FPR thresholds, demonstrating its effectiveness over existing methods. We will update other tables to include TPR@1%FPR values in the main paper to enhance transparency.


(w3)- Complex approach without ablation

We have added Section C.7 to present additional ablation experiments that demonstrate the robustness of SMIA under variations in both embedding models and classifier network sizes. Specifically, Table 10 includes results using the E5-mistral-7b-instruct embedding model alongside the original Cohere v3 model, showing consistent performance across different embedding models. Additionally, Table 11 explores the impact of varying classifier network sizes by comparing the original network with a smaller linear classifier and a larger network with additional fully connected layers. These results collectively highlight SMIA’s adaptability to different model configurations.


(w4)- SMIA requires a known member and non-member subset from the training data distribution. This is a non-trivial assumption that requires some discussion.

While SMIA does require access to known member and non-member subsets from the training data distribution, we believe this assumption is realistic in several practical scenarios. In privacy auditing or unlearning verification, for instance, auditors or red team members often have access to portions of both member and non-member data to evaluate a model's compliance with privacy standards. Even in general attack settings, adversaries can leverage knowledge about the model's training cutoff date to infer likely members and non-members—for example, assuming that data published before the cutoff (e.g., Wikipedia articles, arXiv papers, GitHub repositories) are included in the training set, while newer data are not. We discuss this in Section 3.2, illustrating how an adversary might collect such data for SMIA. Therefore, while the requirement is non-trivial, it reflects practical conditions under which membership inference attacks are relevant and actionable.


评论

(w5)- I found two things in Figure 2 slightly confusing: i) the axis label, and ii) that all neighbors are larger on one axis; I understood that the similarity is somewhat independent of the parameter space? ...

In this figure, the y-axis represents the loss values assigned by the model to members, non-members, and their neighbors, illustrating the model's behavior. The x-axis is a compressed representation of the semantic differences between data points, capturing how semantically similar or different the inputs are. We acknowledge that neighbors can indeed have varying positions on both axes; sometimes, a neighbor generated by word replacement may result in higher or lower loss values and semantic differences. To address your concerns, we have revised the figure by moving x1mx_1^m further from xnx^n to better depict these variations and enhance clarity. We believe this figure is informative as it visually demonstrates the two key features that SMIA leverages in a simplified scenario. Additionally, we have fixed the typo in the caption.


(q1)- Does the replacement actually happen in terms of words, or should it be tokens?

In our method, the replacements are performed at the word level rather than at the token level. Replacing at the token level can lead to inconsistent or nonsensical content because tokens may represent subword units or fragments of words. Replacing only part of such a tokenized word would result in meaningless or grammatically incorrect text. Therefore, following previous works [Mattern et al. (2023) and Mitchell et al. (2023)], we replace entire words with other words or phrases generated by a T5 model. This approach ensures that the modified text remains coherent and semantically meaningful, which is crucial for maintaining the integrity of the data and the validity of our semantic analysis.


(q2)- For the neighbor generation, are the resulting neighbors all unique, or could it be that some neighbours are duplicates/equal to the original sample?

In our approach, the generated neighbors are generally unique and differ from the original samples. While it is theoretically possible for the word replacement process to select the same word (this is highly unlikely). We use a T5 model to perform word replacements, which typically generates semantically similar but different words, leading to unique neighbors. For instance, as shown in Figure 5(b), we provide an example where a neighbor generated for a Wikipedia sample includes a replaced word that differs from the original, creating a distinct and meaningful variation for our analysis.


(q3)- Were T5 or the Cohere Embedding V3 model trained on some of the evaluation datasets? In particular, could it be that one of those models was only trained on the members for WC but not the non-members?

We are not aware of the specific datasets these models were trained on, as their training data are not publicly disclosed in detail. However, we utilize these models as general-purpose tools for word replacement and embedding generation, assuming they provide broad semantic capabilities rather than being tailored to specific datasets. To address the concern that these models might have been trained predominantly on the member data and not the non-members—potentially influencing our results—we conducted additional experiments using a different embedding model, E5-mistral-7b-instruct, as presented in Section C.7. The consistent performance of SMIA with this alternative embedding model demonstrates the robustness of our method irrespective of the specific embedding model used. This suggests that our findings are not significantly affected by the training data of the embedding models, and SMIA remains effective even when the embeddings are generated from models with different training histories.


(q4)- Why are the MIMIR splits constrained to only 1000 samples, especially before splitting?

We utilized the MIMIR dataset exactly as it is provided, without imposing any additional constraints on the number of samples. Each subsplit in the MIMIR dataset consists of 1,000 member samples and 1,000 non-member samples, as specified in its documentation on HuggingFace. We did not filter out or exclude any samples in our experiments; instead, we used all the data available in each subsplit.


评论

(q5)- The intuition about "Why SMIA Outperforms Other MIAs" mentions that using a neural network is one of the key success factors. However, I think simply using a neural network does not imply a stronger attack; could the authors elaborate this a bit more?

We agree that simply using a neural network does not inherently result in a stronger attack. In the context of SMIA, the key to its high performance lies in leveraging semantic embeddings of the input texts—making it the first MIA against LLMs to utilize input semantics. The neural network serves as an effective tool for integrating these high-dimensional embeddings with the model's loss values into a unified feature space. This allows the classifier to capture complex patterns and subtle differences between members and non-members that simpler models might miss. As demonstrated by Nasr et al. (2019), neural networks can extract informative features in membership inference attacks. Therefore, it is the combination of utilizing semantic embeddings and the neural network's capacity to model intricate relationships in the data that contributes to SMIA's superior performance, rather than the use of a neural network alone.


评论

I thank the authors for their detailed response, updates, and addressing all my questions.

While I appreciate the authors' efforts, and the updates regarding weaknesses w3, w4, and w5 do convince me, the two largest weaknesses (w1 and w2) still persists. I hence cannot recommend accepting this paper and keep my score.

Re weakness 1 (Evaluation datasets measure distribution shift, not MI): I appreciate the "blind baselines". However, the flawed "WC" baseline is still included in the paper and used to "sell" the method. The authors mention Evaluations on WC are included for comprehensiveness but are not the primary basis for our conclusions., but a reader just skimming the paper and not carefully studying the appendix will miss this. While including the flawed WC dataset in the original version can be seen as an accidental mistake, leaving it in the paper despite knowing its flaws leaves a bad taste in my mouth.

Re weakness 2 (reliance of AUROC as the main metric): While I appreciate the authors' adding TPR@0.1% FPR to Table 4, most problems persist. The issue is not just omitting TPR@low FPR, it is also focusing on AUROC as the main metric. Yet, the revised paper still seems to "sell" the method by only reporting AUROC in the main matter, does not provide TPR@low FPR values for most results, and hides TPR@low FPR in the appendix. While I agree that other MIA papers (especially for LLMs) also make similar mistakes, this does not make it less of a mistake.

评论

Thank you for your thoughtful feedback and for acknowledging our efforts in addressing your previous concerns. We appreciate your time and constructive criticism, which helps us improve our work.

We understand your concern about including the WC dataset in our evaluations. Our intention was to provide a comprehensive analysis by including both challenging datasets like WT and the four splits of MIMIR—which are our primary focus for assessing true membership inference performance—and datasets like WC to illustrate the behavior of MIAs in scenarios with distributional shifts.

In the main body of the paper, we have emphasized that WT and MIMIR are our main datasets for evaluating SMIA's effectiveness in realistic settings. Including the WC dataset was meant to offer additional insights into how MIAs perform when there is a distributional difference between members and non-members. We believe that presenting results on the WC dataset provides readers with valuable information about the performance of MIAs in such contexts.

In Tables 2 and 4, we provide the performance of all baseline attacks on the WC dataset to demonstrate that even in a dataset with distributional shifts, SMIA outperforms other methods. For example, on WC, SMIA achieves a TPR@1% FPR of 36.4%, while the LOSS attack achieves 6.9% and MiNK++ achieves 18%. This comparison highlights SMIA's effectiveness relative to existing baselines, even in different scenarios.

评论

I thank the authors for their prompt response. However, it does not change the aforementioned concerns and I will have to keep my score.

评论

After reading the other reviewers' comments and due to no response from the authors, I keep my verdict and recommend rejecting this paper. The authors have now responded; see below.

审稿意见
3

The paper introduces the Semantic Membership Inference Attack (SMIA), a novel approach that conducts Membership Inference Attacks by leveraging the semantic content of inputs and their perturbations. SMIA trains a model to detect variations in model behavior between member data members and non-members by analyzing how output probability distributions change with input perturbations. Evaluated on two models and two datasets, SMIA outperforms existing MIA techniques, achieving higher AUC and TPR when FPR is low in detecting membership.

优点

  1. The proposed method in this paper shows a very good novelty.

  2. The proposed method has a good performance considering both AUC and TPR when FPR is low.

  3. The proposed method successfully identifies membership even when data undergoes slight modifications due to its design motivation.

缺点

  1. Though the authors claim the proposed method is designed for grey-box models, there are no direct experiments about the real grey-box models.

  2. The experiments are not very comprehensive on the ablation part. For example, the authors might provide results if changing the embedding model or changing the classification models.

  3. There is no intuitive explanation of the proposed methods like what is the difference between the trends of loss from members and trends of loss from non-members.

问题

  1. Could the authors provide some examples of grey-box models?

  2. How will the performance of SMIA change if changing the embedding model or classification models?

  3. What is the general behavior of loss from members? For example, will the loss from members change less than non-members considering neighbours?

评论

Consider the weaknesses raised by other reviews and the authors' inactivity. I would like to adjust my score to 3.

评论

We thank the reviewer for their feedback. We provide responses for the weaknesses and questions raised below:


(w1 & q1)- Could the authors provide some examples of grey-box models?

In our work, the grey-box scenario refers to a threat model where the adversary has access only to the loss values returned by the target model for texts of their choosing. This setting is practical for models accessed via APIs that provide limited feedback, such as loss or perplexity scores, without exposing internal parameters or detailed outputs like per-token logits. Examples of grey-box models include cloud-based language services and commercial language models that offer summary metrics but restrict deeper access to prevent reverse engineering or misuse. Our attack, SMIA, is designed to operate effectively under these conditions, relying solely on loss values. This contrasts with prior works that require white-box access (e.g., [AA, BB]) or exact per-token logits (e.g., MinK and MinK++). In all our experiments, we simulate this grey-box scenario by using only the loss values from the target models, demonstrating the practicality and applicability of SMIA in real-world settings.

[AA] : Milad Nasr, Reza Shokri, and Amir Houmansadr. "Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning". In 2019 IEEE symposium on security and privacy (SP)

[BB] Suri, Anshuman, Xiao Zhang, and David Evans. "Do Parameters Reveal More than Loss for Membership Inference?." arXiv preprint arXiv:2406.11544 (2024).


(w2 & q2)- How will the performance of SMIA change if changing the embedding model or classification models?

We have added Section C.7 to present additional ablation experiments that demonstrate the robustness of SMIA under variations in both embedding models and classifier network sizes. Specifically, Table 9 includes results using the E5-mistral-7b-instruct embedding model alongside the original Cohere v3 model, showing consistent performance across different embedding models. Additionally, Table 10 explores the impact of varying classifier network sizes by comparing the original network with a smaller linear classifier and a larger network with additional fully connected layers. These results collectively highlight SMIA’s adaptability to different model configurations.


(w3 & q3)- What is the general behavior of loss from members? For example, will the loss from members change less than non-members considering neighbours?

In MIAs, it is commonly observed that target models assign lower loss values to members due to overfitting, resulting in a distinct loss trend compared to non-members. We provide these baseline results as LOSS attack in our experiments. However, relying solely on loss values can be problematic because certain non-member inputs—such as short or repetitive texts—may also receive low loss values, leading to false positives. Our proposed method, SMIA, addresses this issue by incorporating semantic embeddings of the text. By leveraging the semantic information, SMIA captures the contextual relationships between data points, allowing us to more effectively distinguish between members and non-members. Specifically, we analyze how the loss values change concerning the semantic neighbors of each input. Members tend to have loss patterns that are not only lower but also more consistent within their semantic neighborhoods. This approach enhances the attack's precision by considering both the loss and the semantic context.


审稿意见
5

This paper introduces SMIA (Semantic Membership Inference Attack), an innovative approach to membership inference attacks against LLMs that leverages semantic analysis. The key insight is that LLMs exhibit distinct behavioral patterns when processing semantically similar variants of their training data versus unseen data. SMIA capitalizes on this by generating semantic neighbours for input texts, analyzing how the target model responds to these variations, and training a neural network to detect membership based on these response patterns. The authors evaluate SMIA extensively on the Pythia and GPT-Neo model families using Wikipedia and MIMIR datasets, demonstrating significant improvements over existing approaches - notably achieving an AUC-ROC of 67.39% on Pythia-12B compared to previous best of 58.90%.

优点

The authors present a well-designed pipeline that combines neighbour generation using masked language models, semantic embedding analysis via the Cohere model, and neural network classification. This comprehensive approach allows SMIA to detect both exact matches and semantically similar content, representing a significant advancement over existing methods. The experimental evaluation is particularly thorough, examining performance across different model sizes, architectures, and datasets. The authors carefully consider the impact of dataset distribution by using two different non-member datasets - one from the same distribution (Wikipedia Test) and another from a different time period (Wikipedia Cutoff). This reveals important insights about how data distribution affects attack performance. Additionally, the analysis of modified text scenarios (through word additions, deletions, and duplications) provides practical insights into the method's

缺点

  1. The choice of key hyperparameters, particularly the use of 25 neighbours, lacks thorough clarification. While Table 7 shows performance improvements with increasing neighbour count, there's no clear analysis of the trade-off between computational cost and performance gain. The paper should examine the diminishing returns beyond 25 neighbours and justify why this specific number optimally balances effectiveness and efficiency.
  2. A concerning weakness emerges in the method's inconsistent performance across different types of text modifications. As shown in Table 3, while SMIA achieves an AUC-ROC of 62.47% for word deletions on Pythia-12B (WT dataset), it drops to 55.13% and 54.19% for word duplications and additions respectively. This significant performance gap suggests an inherent bias in the method's ability to handle different types of text alterations. The authors don't adequately explain this asymmetry or propose potential solutions.
  3. Aminor problem is the heavy reliance on the Cohere Embedding model, a third-party service, introduces both a potential point of failure and a privacy concern - users must share their data with an external service to generate embeddings. The authors don't explore alternative embedding approaches or analyze how the choice of embedding model impacts performance. Furthermore, the paper lacks a comprehensive analysis of the cost implications of using such commercial services at scale.
  4. The method shows notably better performance with word deletions compared to additions or duplications, as evidenced in Table 3. For instance, with Pythia-12B on the Wikipedia Test dataset, SMIA achieves an AUC-ROC of 62.47% for word deletions but only 54.19% for additions. This behavioral asymmetry suggests fundamental limitations in how SMIA handles different types of semantic modifications, yet the paper offers limited analysis of why this occurs or how it might be addressed. This becomes particularly relevant when considering real-world scenarios where adversaries might deliberately modify texts in ways that exploit these weaknesses.
  5. While authors demonstrate the attack's effectiveness, they provide minimal insight into how model owners might protect against such attacks. This omission is particularly notable given the paper's focus on privacy implications and its potential impact on real-world applications of LLMs. A more comprehensive treatment of potential countermeasures would significantly enhance the paper's practical value and provide important context for the security community.

问题

  1. The scalability of SMIA deserves further exploration - how would the method perform with larger training datasets and more complex model architectures? The current evaluation uses 6,000 members and non-members each, which may not fully reflect real-world scenarios.
  2. The choice of neural network architecture for the SMIA model itself appears somewhat arbitrary. Have the authors explored alternative architectures that might improve performance or efficiency? Additionally, how sensitive is the method to the quality and dimensionality of the semantic embeddings?
  3. How might defenders mitigate against SMIA?
评论

(q1)- The scalability of SMIA deserves further exploration - how would the method perform with larger training datasets and more complex model architectures? The current evaluation uses 6,000 members and non-members each, which may not fully reflect real-world scenarios.

In Figure 3, we investigated the impact of varying training dataset sizes on the performance of SMIA by experimenting with 6,000/6,000, 4,000/4,000, 2,000/2,000, and 1,000/1,000 members and non-members. The results demonstrate that increasing the size of the training data enhances the effectiveness of SMIA, indicating its potential scalability. However, we acknowledge that scaling to much larger datasets and more complex models would entail significantly higher computational costs, particularly for neighbor generation and embedding calculations. Addressing these computational challenges and extending our evaluation to larger, real-world scenarios is an important direction for future work.


(q2)- The choice of neural network architecture for the SMIA model itself appears somewhat arbitrary. Have the authors explored alternative architectures that might improve performance or efficiency? Additionally, how sensitive is the method to the quality and dimensionality of the semantic embeddings?

We based our architecture on the work of Nasr et al. (2019), which has proven effective in white-box MIAs against deep neural networks in vision tasks. To explore alternative architectures and assess sensitivity to embedding quality and dimensionality, we have added Section C.7 with additional ablation experiments. Specifically, Table 9 presents results using the E5-mistral-7b-instruct embedding model (with 4096 dimensions) alongside the original Cohere v3 model (with 1024 dimensions), demonstrating consistent performance across different embeddings. Additionally, Table 10 investigates the impact of varying classifier network sizes by comparing our original network with a smaller linear classifier and a larger network with extra fully connected layers. These findings collectively highlight SMIA's robustness and adaptability to different model configurations, suggesting that its performance is not overly sensitive to the choice of neural network architecture or embedding model used.

评论

We thank the reviewer for their feedback. We provide responses for the weaknesses and questions raised below:


(w1)- The choice of key hyperparameters, particularly the use of 25 neighbours ...

In Section B of our paper, we provide a detailed cost estimation that encompasses generating neighbors, computing embeddings, and evaluating loss values for the target model. Our experiments, as shown in Table 7, indicate that performance improves with an increasing number of neighbors up to 25. Beyond this point, we observed diminishing returns in performance gains while computational costs continue to rise significantly. Specifically, utilizing more than 25 neighbors would necessitate larger training datasets to capitalize on the additional information, which leads to increased computational overhead without proportional benefits. Therefore, we selected 25 neighbors as it strikes an optimal balance between effectiveness and efficiency, offering substantial performance improvements without incurring excessive computational costs.


(w2 & w4)- A concerning weakness emerges in the method's inconsistent performance across different types of text modifications .... The method shows notably better performance with word deletions compared to additions or duplications, as evidenced in Table 3 ...

This discrepancy stems from how these modifications affect the coherence and consistency of the input texts, which in turn influences the model's loss values. In the case of word duplication, we duplicate an exact word consecutively within the text. This repetition can introduce redundancy without significantly altering the semantic meaning, which may substantially impact the model's loss values. Consequently, the loss patterns between members and non-members become less distinguishable, affecting SMIA's ability to infer membership accurately.

For word additions, we currently insert only the first word generated by the T5 model into the original text. This approach can lead to incomplete or out-of-context additions, disrupting the text's coherence and causing inconsistent loss values when processed by the target model. Such inconsistencies make it challenging for SMIA to detect meaningful patterns for membership inference.

In contrast, word deletions often result in more coherent sentences, as the removal of a word may not drastically alter the overall meaning of the text. This maintains a consistent loss landscape, allowing SMIA to more effectively differentiate between members and non-members based on their loss and semantic relationships.


(w3)- A minor problem is the heavy reliance on the Cohere Embedding model, a third-party service, introduces both a potential point of failure and a privacy concern ...

In response, we have added Section C.7 to our paper, where we present additional ablation experiments demonstrating the robustness of SMIA under different embedding models and classifier network sizes. Specifically, Table 9 includes results using the E5-mistral-7b-instruct embedding model alongside the original Cohere v3 model, showing consistent performance across different embeddings. This indicates that SMIA does not rely heavily on any specific third-party service, thereby mitigating potential points of failure and privacy concerns associated with data sharing. Furthermore, we have provided a comprehensive cost analysis in Section B, detailing the computational expenses of generating neighbors, computing embeddings, and evaluating loss values for the target model. This addresses the cost implications of using such services at scale and demonstrates the practicality of our approach.


(w5 & q3)- How might defenders mitigate against SMIA?

While our paper primarily focuses on demonstrating the effectiveness of SMIA to highlight existing vulnerabilities in large language models, we acknowledge the importance of discussing potential defenses against such attacks; however, an in-depth exploration of countermeasures such as Differential Privacy (DP)—which is the main solution for defending against membership inference attacks—is beyond the scope of our current work; we consider this an important direction for future research and defer a comprehensive treatment of potential defenses to future studies.


审稿意见
3

This paper studies the problem of membership inference in large language models by proposing "semantic MIAs". The key idea is to use various perturbations to the input text, then analyze the behavior of a model on these points, train an auxiliary model on these behaviors, and finally use that model to predict membership. The authors show that SMIAs outperform prior MIAs on different benchmarks.

优点

  1. The paper does a good job at explaining their methodology, which considers the fact that learning a classifier on top of LLM behaviors can allow learning membership signals well.
  2. The work demonstrates strong gains over past MIAs across multiple benchmark datasets.

缺点

There are various works at this point in literature which have argued that "WC" based assessment of MIAs when data is split across a cut-off date, is not sound. This work bases many of their gains on that setting.

[1] Do Membership Inference Attacks Work on Large Language Models? https://arxiv.org/abs/2402.07841.
[2] LLM Dataset Inference: Did you train on my dataset? https://arxiv.org/abs/2406.06443.
[3] Blind Baselines Beat Membership Inference Attacks for Foundation Models. https://arxiv.org/abs/2406.16201.

Key weaknesses:

  1. The method should be tested across all 20 train-test splits of Pile. Paper [2] on LLM Dataset Inference particularly shows how most MIAs perform well only on a few datasets
  2. The blind baseline of n-gram is not considered in this work (paper [3]). This is an important comparison to understand if the gains in this work are meaningful.
  3. Given that the goal of MI in LLMs is hard, I would love to see experiments on dataset inference.
  4. The idea of learning an auxiliary classifier on top of the membership signal from neighbours seems to have been explored in [2]

问题

  • How does the member data of the embedding model impact the performance of the SMIA
评论

We thank the reviewer for their feedback. We provide responses for the weaknesses and questions raised below:


(w1)- There are various works at this point in literature which have argued that "WC" based assessment of MIAs when data is split across a cut-off date, is not sound. This work bases many of their gains on that setting.

We agree that evaluating membership inference attacks using datasets split across a cut-off date (e.g., 'WC' dataset) can lead to unsound assessments due to temporal distribution shifts. To address this concern, our main experiments, as detailed in Sections 4.2.1 and 5.1, are conducted using the 'WT' dataset. In 'WT', both members and non-members are drawn from the same distribution: members come from the training split of the Pile dataset, and non-members are from the test split of Pile. This ensures that the data distributions for members and non-members are consistent, providing a sound basis for evaluating MIAs.

We included the 'WC' dataset to demonstrate how temporal shifts can lead to misleading conclusions about privacy leakage. As reported in Section 5.1, we observed significant differences in MIA performance between the 'WT' and 'WC' datasets. For instance, the SMIA method achieves an AUC-ROC of 67.39% on 'WT' but increases to 93.35% on 'WC' for the Pythia-12B model. Similarly, the true positive rates at low false positive rates improve markedly on 'WC' compared to 'WT' (e.g., TPR of 3.8% vs. 46.2% at 2% FPR) which is consistent with findings from cited papers [1,3].


(w2)- The method should be tested across all 20 train-test splits of Pile. Paper [2] on LLM Dataset Inference particularly shows how most MIAs perform well only on a few datasets.... Given that the goal of MI in LLMs is hard, I would love to see experiments on dataset inference.

We understand the importance of evaluating our method across diverse datasets. However, we'd like to clarify a key distinction: The paper you mentioned introduces a dataset inference attack aimed at detecting whether an entire dataset was used in a model's training. In contrast, our research focuses on membership inference attacks targeting individual training samples of much smaller size (between 130 to 150 words). These are fundamentally different problems: dataset inference operates at the dataset level, while our MIA operates at the individual sample level. Because of this difference, the authors in [2] evaluated their approach on all train-test splits of the Pile to suit their specific problem.

Testing our method across all train-test splits of the Pile would be computationally infeasible due to the sheer scale of the data. Additionally, some splits contain copyrighted material that we cannot use due to legal and ethical considerations. To the best of our knowledge, no prior work on MIAs has considered all splits of the Pile for individual sample inference, largely because of these constraints.

Most existing works in this area utilize the MIMIR dataset, which we also employ in our study. For our experiments, we selected four significant and commonly used subsets of the Pile: Wikipedia (en), GitHub, PubMed Central, and ArXiv. Notably, Figure 3 in the cited paper [2] shows that these four datasets are the hardest to analyze, with AUC-ROCs close to 50%, indicating performance near random chance.


(w3)- The blind baseline of n-gram is not considered in this work (paper [3]). This is an important comparison to understand if the gains in this work are meaningful.

To address this, we have added Section C.6 to our paper, where we implement two blind attacks from Das et al. (2024): (a) greedy rare word selection (n-gram approach) and (b) bag-of-words classification. These attacks operate without any signal from the target model, and serve to evaluate whether our gains are meaningful beyond exploiting distribution shifts. Our results, summarized in Table 8, show that these blind attacks achieve high AUC-ROC scores on the WC split (58.30% and 83.6%, respectively), indicating that WC is susceptible to exploitation due to distributional differences between members and non-members. As discussed in Section 5.1, WC contains distribution-shifted data, and we acknowledge that this can inflate the performance of MIAs that exploit such shifts. Our main evaluations focus on the WT dataset, where members and non-members are from closely aligned distributions. On the WT split, the same blind attacks perform no better than random guessing, with AUC-ROC scores around 52% and TPRs at low FPRs near zero. In contrast, our proposed SMIA method maintains strong performance on WT (as shown in Tables 7 and 8), effectively detecting membership where blind attacks fail. This demonstrates that SMIA is not merely capturing distribution shifts but is robustly inferring membership information based on the model's learned representations, thereby addressing true privacy leakage concerns.

评论

(w4)- The idea of learning an auxiliary classifier on top of the membership signal from neighbours seems to have been explored in [2]

The cited paper [2] employs simple attacks such as n-gram statistical models and bag-of-words classifiers for membership inference. These methods rely on basic textual features and do not capture the deeper semantic relationships or the behavior of the target model. In above experiment (for showing that WT is a strong dataset), we demonstrate that these attacks are ineffective on our datasets, yielding results close to random guessing (AUC-ROC scores close to 50%). In contrast, our SMIA model leverages the semantic content of the input data and the behavior of the target model as input features for the auxiliary classifier. This allows us to capture more nuanced patterns that are indicative of membership, leading to a more effective attack.

The work most closely related to ours is actually [AA]. In Section 2 of our paper, we discuss the differences between our approach and that of [AA]. Specifically, while [AA] also trains an auxiliary classifier for membership inference, their work focuses on image classification models and operates in a white-box setting with access to the model's gradients. Our method is designed for LLMs and operates in a gray-box setting, relying solely on input-output behavior without access to model internals.

[AA] : Milad Nasr, Reza Shokri, and Amir Houmansadr. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE symposium on security and privacy (SP)


(q1)- How does the member data of the embedding model impact the performance of the SMIA

In our experiments, we utilized the Cohere v3 model to generate embeddings for the input samples. The training data for Cohere v3 is not publicly disclosed, so we do not have specific information about whether it includes our member data. As a result, we cannot definitively assess how the presence or absence of member data in the embedding model's training set influences the performance of the SMIA.

However, we operate under the assumption that large pre-trained models like Cohere v3 provide robust and generalizable embeddings for a wide range of textual inputs, regardless of whether they have encountered those exact samples during training. These models are designed to capture semantic relationships and produce meaningful representations even for novel inputs.

Our rationale is that the embeddings generated by Cohere v3 effectively capture the semantic content and nuances of the target samples, which is crucial for the SMIA to function correctly. Since the embedding model aims to generalize well across different data, we expect it to produce consistent embeddings for both member and non-member samples.


评论

Dear Authors,

Thank you for your detailed and thoughtful rebuttal. I appreciate the care you've taken to address each concern systematically. Let me share my thoughts on your responses:

Regarding w1 (WC-based assessment):.
I appreciate your acknowledgment of the potential issues with WC-based evaluation and your clarification about the primary focus on WT datasets. However, I remain concerned that the paper's current structure and results presentation might still lead readers to draw conclusions from the WC results. I suggest:

  • Moving the WC results entirely to an appendix
  • Adding explicit warnings about interpretation of WC results in all relevant figure/table captions
  • Strengthening the emphasis on WT as the primary evaluation metric in the abstract and introduction
  • Make the nomenclature for WT and WC absolutely clear. It should not be hard to find their meaning.

Regarding w2 (Testing across Pile splits):.
I acknowledge your explanation about the fundamental difference between dataset-level and sample-level inference. My comment was in regards to Fig 3 which you point to later. I am not sure why this should be computationally expensive. Could you take a 1000 example subset of train and test to do this?

Regarding w3 (Blind baseline comparison):
I commend you on adding Section C.6 with the blind baseline comparisons. However, I have some important recommendations:

  • It is important to move these results to the main paper rather than the appendix. This is a critical baseline. The independent table in appendix offers no comparative discussion to the average reader.
  • Add a brief discussion of why your method succeeds where blind attacks fail
  • Include error analysis for cases where your method outperforms the baselines

Regarding the use of MIMIR dataset:
Please check the line below Table 2 in this paper. The authors explicitly mention: We clarify that this step is not a suggestion for researchers to alter their benchmarks

In conclusion: This method might very well be promising. But the current presentation, the focus on irrelevant and unsound benchmarks leads to the authors hurting their own paper and its promise. This work needs to be re-written with a focus on the right practices for membership inference research, with a focus on the right train-test splits. I would strongly encourage the authors to extend this to multiple PILE subsets following the recommendations in [2] ignore the use of MIMIR dataset as suggested by [1], and include blind baselines in the main paper as suggested in [3]. On top of this, WC should only see a place in Appendix, if at all.

评论

Thank you for your thoughtful and detailed feedback. We greatly appreciate the time you've taken to help us improve our paper. While we are unable to update the draft (deadline has passed), we will incorporate your suggestions in the next version.

We understand your concerns about the potential misinterpretation of the WC results. To address this, we will move all WC results to the appendix to emphasize that our primary focus is on the WT datasets. We will also add clear warnings in all relevant figure and table captions about the limitations of interpreting WC results.

Regarding testing across different Pile splits, we agree that this is important. However, our main obstacle is that we cannot access some of the Pile subsplits due to copyright regulations. Additionally, [2] applies 52 different membership inference attacks in stage one of their methodology, which is computationally intensive and poses a challenge for replication within our resource constraints.

Thank you for acknowledging our efforts in including blind baseline comparisons. We will move the results from Section C.6 in the appendix to the main body (Section 4.2) to highlight their importance. Our rationale for using MIMIR dataset was to establish a consistent benchmark, given the subjective selection of Pile splits in different studies. We chose a challenging setting with over 80% overlap in 13-grams to push the limits of our method, compared to other works that use settings with 20% overlap in 7-grams.

We are committed to enhancing our paper by adopting the right practices and addressing all the points you've raised. We kindly request that you consider these planned improvements when evaluating our submission.

评论

We sincerely appreciate the time and effort you invested in reading our paper and providing feedback. We would like to inform you that we have updated our draft in response to the valuable feedback provided during the review process. The new additions are highlighted in blue in the revised draft.

The main changes are:

  • 1. Section C.6 – Evaluating Dataset Complexity of WT and WC against MIAs: We have implemented two blind attacks from Das et al. (2024) to assess the complexity of the WT and WC datasets in the context of membership inference attacks. This evaluation demonstrates that the WT dataset presents a challenging and appropriate setting for our main evaluations of SMIA, while highlighting that the WC dataset is susceptible to distributional shifts that can be exploited by blind attacks. These findings reinforce the robustness of our approach and validate our choice of the WT dataset for primary evaluations.

  • 2. Section C.7 – Ablation Study on SMIA Performance with Different Embedding Models and Classifier Networks: We have conducted additional ablation experiments to examine the robustness of SMIA under variations in embedding models and classifier network architectures. Specifically, we present results using the E5-mistral-7b-instruct embedding model alongside the original Cohere v3 model, showing consistent performance across different embeddings. Additionally, we explore the impact of varying the size of the classifier network by comparing the original architecture with both smaller and larger networks. These experiments underscore SMIA's adaptability and effectiveness across different configurations.

We believe these additions address most of concerns raised during the review process and enhance the clarity and comprehensiveness of our work. We appreciate your time and consideration.

AC 元评审

This paper introduces a novel approach to membership inference attacks against Large Language Models, centered around how these models exhibit distinct behavioral patterns when processing semantically similar variants of their training data versus unseen data. The authors develop a method that analyzes model responses to semantic neighbors of input texts, demonstrating improvements over existing approaches.

The technical approach combines neighbor generation using masked language models, semantic embedding analysis, and neural network classification. This allows the method to detect both exact matches and semantically similar content, representing an advancement in the field. The experimental evaluation examines performance across different model sizes, architectures, and datasets, while considering dataset distribution effects through different non-member datasets. The analysis of modified text scenarios through word additions, deletions, and duplications provides valuable practical insights into the method's robustness.

However, the submission has several significant weaknesses. A fundamental concern is that the evaluation datasets might be measuring distribution shifts rather than true membership inference performance. Recent work demonstrates that simple baselines can achieve high success on temporally shifted articles, casting doubt on the reported improvements. The paper's heavy reliance on a single metric is problematic, and the limited reporting of other metrics makes it difficult to assess real-world effectiveness. Additionally, while the method involves multiple complex components, there's insufficient investigation of their necessity through ablation studies.

The paper omits several important comparisons that would strengthen its evaluation, including performance analysis across train-test splits, comparison against baseline approaches, and exploration of alternative embedding approaches. The selection of key parameters lacks thorough justification and analysis of trade-offs between computational cost and performance gain.

While the technical approach is novel and well-presented, the fundamental concerns about evaluation methodology cast serious doubt on the paper's main claims. The possibility that the method is primarily detecting distribution shift rather than true membership undermines its purported advances. To strengthen this work, the authors should expand the evaluation to include comprehensive comparisons against baselines, incorporate additional metrics, conduct thorough ablation studies, and provide clearer justification for design choices and parameter selection. These improvements would help establish the method as a meaningful advancement in membership inference attacks against language models.

审稿人讨论附加意见

The paper review process revealed significant concerns about the SMIA methodology and its evaluation approach. The primary issue centered on the use of temporally split Wikipedia data for evaluation, as recent research has shown that simple baselines can achieve high success rates on such splits, suggesting potential confusion between distribution shifts and actual membership information. The choice of evaluation metrics also drew criticism, with arguments that the AUC-ROC metrics alone were insufficient for properly assessing membership inference attacks. Though some additional metric results were provided in response, they were not comprehensive across all experimental settings. Critics also pointed out the lack of thorough ablation studies and clear justification for various architectural and hyperparameter choices, with only partial responses addressing these concerns. Practical implementation concerns were raised regarding the dependency on specific embedding models and associated privacy implications, along with questions about scalability for larger datasets and more complex architectures. These issues remained largely unaddressed. The final rejection decision was primarily influenced by the fundamental concerns about the evaluation methodology, as the potential confusion between distribution shift and membership inference significantly weakened the paper's core arguments. While the technical limitations could have been addressed through revision, the fundamental methodological issues would require substantial changes to the paper's approach and experimental design.

最终决定

Reject