Rethinking the Role of Verbatim Memorization in LLM Privacy
Increased verbatim memorization doesn't necessarily lead to greater chat extractability, and model quality is a greater privacy threat than memorization
摘要
评审与讨论
This work proposes a novel analysis of memorisation in LLMs. Specifically, authors divide memorisation into two distinct phases: the pre-training mem and the instruction-tuning mem, showing that some of the preexisting understanding of what it means for the LLM to memorise may not represent the whole picture.
优缺点分析
Strengths:
The paper is well-written, easy to follow, has a large number of nice experimental results and does indeed propose novel conclusions based on the results of these experiments. To the best of my knowledge, such separation between memorisation has not been previously empirically evaluated before.
Weaknesses:
While the work is very interesting and has a number of novel insights, I have my doubts about how much can one read into these results to come to these conclusions. For instance, the argument on 'Model size does not significantly impact chat extraction, while model quality is crucial' - this conclusion is a bit far fetched given the underlying experimental results. Authors have previously evaluated 3 different generations of models (Llama 1-3), where it is not just the size that matters, it is the model itself (the quality - supporting the claim), but more importantly the training dataset. While it is not known which 'exact' data went into this model family during pre-training, we know the relative training corpus sizes: 1T to 15T tokens between them, so I am not sure if you can attribute the difference in extractability simply to 'better models', as there is a higher chance that the model is better, but there is also a higher chance that similar-looking sequence has been memorized due to its recurring appearance in the pre-training dataset (be it part of the sequence or individual tokens next to each other).
Some other conclusions have been previously established already e.g. 'Increased verbatim memorization does not imply more chat extraction' - which we knew from a number of prior works (summary of which can be found in [1]) or that 'IDs are more challenging to encode compared to other attributes' - which we also knew already, hence why these semi-random sequences have previously been used as canaries in most extraction-related prior literature. Same goes for 'Verbatim extraction of unique attributes requires significantly longer unsupervised fine-tuning compared to other attributes' - we already know that duplication of tokens in the training dataset leads to their memorization from [2] and other similar works on unique PII memorization like [3]. So while I see that there is indeed a difference between the stages of training which makes some difference when discussing these insights, the overall principle is the same and we have already known about these phenomena for a while. Note: I do not argue against the validity of these conclusions, but rather that we already knew this and that the impact of these insights is relatively small even if applied to pre-training instead of instruction-tuning.
I also had concerns regarding the dataset composition: as I understand it, this is the same dataset as the one used in Zhu&Li? Because I am not certain how diverse the dataset is given the examples provided in the original manuscript and in this work. Here is my reasoning: Llama models are likely (as is the case with the majority of publicly available LLMs) pre-trained almost exclusively on data oriented towards the US (e.g. city names, states, academic institutions etc.), meaning that a) the model is already predispositioned to be more likely to generate sequence completions that are aligned with the US-based data, b) while the results of extractions hold in this instance, I am not certain if this is likely scalable beyond this geography and c) I've skimmed the original manuscript and it seems to me that some data was generated using Llama models, further amplifying the bias and potentially reproducing some part of the pre-training data. From my understanding none of these concerns were addressed or discussed in the manuscript at all, significantly limiting the potential impact of the findings. I would like the authors to address these through e.g. experiments and extended limitation/discussion sections.
[1] - Usynin, Dmitrii, Moritz Knolle, and Georgios Kaissis. "Memorisation in machine learning: A survey of results." Transactions on Machine Learning Research (2024).
[2] - Carlini, Nicholas, et al. "Quantifying memorization across neural language models." The Eleventh International Conference on Learning Representations. 2022.
[3] - Lukas, Nils, et al. "Analyzing leakage of personally identifiable information in language models." 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2023.
问题
I've already outlined these in weaknesses and these mostly target a) concerns regarding the BIO dataset, b) novelty of some conclusions compared to prior work and c) reliability of some conclusions given the experimental results.
局限性
I've outlined the limitations in the weaknesses section and the mostly concern the dataset composition, reasoning behind some conclusions and the novelty of some insights. Regardless of the above, I believe there is some values in this work as it decomposes the problem that has been extensively studied before into separate sub-problems that have not been given enough attention on their own. So I am willing to increase my score should authors adequately address my comments.
最终评判理由
As per response: most concerns addressed, slight raise.
格式问题
None
We thank the reviewer for their positive assessment of the paper's quality and clarity and for the feedback. We appreciate the opportunity to clarify our contributions and address the valid concerns raised.
While the work is very interesting and has a number of novel insights, I have my doubts about how much can one read into these results to come to these conclusions. For instance, the argument on 'Model size does not significantly impact chat extraction, while model quality is crucial' - this conclusion is a bit far fetched given the underlying experimental results... I am not sure if you can attribute the difference in extractability simply to 'better models', as there is a higher chance that the model is better, but there is also a higher chance that similar-looking sequence has been memorized due to its recurring appearance in the pre-training dataset. [...]
Our claim is not that model size is irrelevant, but rather than "model quality" (as measured by standard benchmarks like MMLU) appears to be a better predictor of chat extraction success than parameter count. However we agree that the pre-training data is a significant confounding variable that we cannot control here, but all the biographies are synthetic so we are certain that none were in the pretraining data of any of the models. In particular in Figure 3, we see that it is significantly easier to chat-extract IDs from Llama3-8B than Llama2-13B, whereas there is no reason to believe that Llama3’s pretraining data had a bigger emphasis on random sequences.
Some other conclusions have been previously established already e.g. 'Increased verbatim memorization does not imply more chat extraction'... or that 'IDs are more challenging to encode compared to other attributes'... So while I see that there is indeed a difference between the stages of training which makes some difference when discussing these insights, the overall principle is the same and we have already known about these phenomena for a while.
We agree there is a strong link with the knowledge extraction literature, which we have stated in the manuscript. But
- our contributions are new to the best even within the literature or
- highlights consequences of already-known facts from a privacy-centric angle.
Specifically, to the best of our knowledge, we are the first to show that:
- Verbatim memorization (pre-instruction tuning) does not correlate well with chat-based extraction (post-instruction tuning). But removing information from a verbatim sense (see unlearning section, e.g. fig 6) removes chat extractability
- Information can become "un-extractable" verbatim after instruction tuning, yet remain encoded and extractable through other means (chat or after realignment), creating a false sense of privacy.
- Model quality, not just size, is a key factor in chat-based leakage.
Again, our contribution also lies in being the first to systematically investigate these effects from a privacy perspective, as the contributions have significant implications for how we audit them for privacy. We will refine our framing to make it clear that our contribution is this comparative, privacy-centric angle.
I also had concerns regarding the dataset composition... Llama models are likely... pre-trained almost exclusively on data oriented towards the US... I am not certain if this is likely scalable beyond this geography... From my understanding none of these concerns were addressed or discussed in the manuscript at all, significantly limiting the potential impact of the findings.
The reviewer is correct; our synthetic dataset is US-centric, mainly because it is based on the dataset from Zhu & Li. We fully agree that this raises questions about generalizability. How these extraction dynamics play out with data from geographies or domains that are underrepresented in the pre-training corpus is an open question that we do not tackle. We can add a dedicated discussion to our limitations section to address the dataset's geographic bias and its implications for the scalability of our findings.
However the ID attributes that we add for instance are totally neutral, and we respectfully disagree with the reviewer that it significantly limits the generalizability of our claims: we care about the results in relative terms rather than e.g. the absolute extraction rates. Which statement in particular does the reviewer think it impacts?
I've skimmed the original manuscript and it seems to me that some data was generated using Llama models, further amplifying the bias and potentially reproducing some part of the pre-training data.’’:
We would like to point out that we use the BIO-S dataset which does not contain any data attributes generated from Llama models, and that the biographies are created by assempling attributes randomly.
We thank the reviewer for their positive feedbacks that will improve our work.
We believe that the message which is that the statement "a better model better manipulates its internal knowledge, and its thus easier to extract memorized private information via chat from better models" is not over-stretched and is supported by empirical evidence, especially the difference in chat extractability of ID and ID seq between the different models. However, we totally agree with the reviewer that our definition of model quality might have been confusing. We will make what we mean by "model quality" cristal clear.
We remain available for any additional question that the reviewer might have.
I would like to thank the authors for their comprehensive response!
The majority of my concerns have been addressed, the main one that remains is on the question of model size vs model quality and the corresponding extraction attack outcomes. I am not in full agreement here with how authors define what it means for the model to be 'better', since this is more complex that a single metric (regardless of how well-adopted it is, it is still a single number representing a large number of confounding factors that affect the model's quality). Now, this does not invalidate the conclusions of this work per se, but I would like the authors to be extra precise what they mean by models of higher quality in the definitions; how the pre-training data of different models contributes to their quality (and consequently to extraction success) etc.
I will slightly raise my score towards acceptance, because fundamentally the insights here are novel, I am just still not fully convinced that authors have clearly identified the specific underlying phenomena that contributes to these (specifically what part of 'model being better' results in better attack performance?)
This paper studies the privacy information memorization in LLM both after pre-training and post-training (only SFT) using a small scale synthetic data setting. The author construct the dataset automatically using templates for personal attributes such as school, degree, born city, and hard attributes such as ID. It then "pre-train" (in the paper setting is basically unsupervised fine-tuning) the Llama-2-7B model on this data first for 0-20 epochs. It then take the checkpoint to do SFT (chat QA template) for 10 epochs. Th dataset are separated during SFT stage so only a part of the data are used for SFT and leave the tail bucket data for evaluation to avoid train-test overlapping. The author then show some findings based on the experiment results.
优缺点分析
Strength:
- the experiment setting separates the training/evaluation population, I think this is a carefully determined design.
- The paper presents several findings through the experiments which are easy to understand.
- The paper propose several methods to improve the chat extraction and verbatim extraction accuracy.
Weakness:
- The main contributions summarized in the introduction are mainly findings from the experiment. I think these findings have been showed by previous work in a similar knowledge extraction setting (Allen-Zhu & Li, 2024)
- In terms of experiment setting, I think the pre-training part is done correctly but the post-training part (SFT) is lacking some key results such as fixed at 10 SFT epoch, mixing chat SFT with pre-training data, and lacking RL approach (see questions)
- Overall, i think the scope of this project with limited synthetic data experiment and the contributions provide a fair quality but not significant.
问题
- It seems at the SFT stage, it starts from epoch N checkpoint from the pre-train stage and SFT for a fixed 10 epochs. Does the 10 SFT epoch lead to overfitting problem? Can you show the verbatim extraction and chat extraction accuracy at each SFT epoch starting from each pre-train epoch checkpoint? I think it will show a more clear picture to eliminate the influence from the SFT stage.
- I notice the in Figure 5, the paper claims after pre-training again on Non-tail/head data, the model's verbatim extraction accuracy recovered back to 100% for easy attributes such as birth city, degree, company on tail bucket. So it seems the model still memorize these information but SFT leads to the model not able to extract them out. What would happen if during the SFT stage, you mix the pre-training data and SFT data together? So the model is trained on both type of data with the training objective. Would this result in better chat extraction accuracy and at the same time preserve the verbatim accuracy?
- The main experiment focused on SFT to learn the chat capability. As the paper show it lead to low chat extraction accuracy, I wonder what if instead of SFT using RLHFlow, the experiment is conducted using reinforcement learning such as RLHF or RL with verifiable reward (in your case, the reward is simply string matching to the attribute) with the same training framework? As we know RL is a standard post-training method in modern LLM training recipe, and many models went through this step.
Minor: Line 142: Person mame --> person name
局限性
N/A
最终评判理由
Author clarifies my Q1 but does not show the results initially. After looking at the data related Q1, it is more clear about how SFT stage impact the accuracy.
Author claim the Q2 is not practical but this is a common practice used during RLHF (Training language models to follow instructions with human feedback) stage.
Author studies only focus on SFT and does not over common practice of RLHF during post-training stage.
Author clarifies my W1 and emphasize the unique contribution in this work focuses mainly on the misunderstand of false sense of security after instruction tuning. I think this makes the main message more clear.
Overall, i think the scope of this project with limited synthetic data experiment and the contributions provide a fair quality but not significant. I changed my score from 2 to 3.
格式问题
N/A
We thank the reviewer for their feedback. We appreciate the suggestions for extending our experimental setup, which we believe would strengthen the work. We address the main points below.
The main contributions summarized in the introduction are mainly findings from the experiment. I think these findings have been showed by previous work in a similar knowledge extraction setting (Allen-Zhu & Li, 2024)
We agree there is an overlap with the knowledge extraction literature. But our contributions as stated are new, specifically:
- Verbatim memorization (pre-instruction tuning) does not correlate well with chat-based extraction (post-instruction tuning). But removing information from a verbatim sense (see unlearning section, e.g. fig 6) removes chat extractability
- Information can become "un-extractable" verbatim after instruction tuning, yet remain encoded and extractable through other means (chat or after realignment), creating a false sense of privacy.
- Model quality, not just size, is a key factor in chat-based leakage.
These key new insights don't just imply that chat extraction differs from verbatim memorization, but that this disconnect creates an issue for privacy auditing. When standard audits (like verbatim extraction or loss-based MIAs) fail after instruction tuning, they can give a false sense of security, even though the information is still very much present and extractable via other means. Our work is the first to systematically demonstrate this failure mode and map the factors that influence it, which we believe is a significant and practical contribution to the privacy community.
It seems at the SFT stage, it starts from epoch N checkpoint from the pre-train stage and SFT for a fixed 10 epochs. Does the 10 SFT epoch lead to overfitting problem? Can you show the verbatim extraction and chat extraction accuracy at each SFT epoch starting from each pre-train epoch checkpoint?
That's a fair point. We fixed the SFT epochs to 10 based on preliminary runs where performance plateaued, but we agree that showing the full learning curve would provide a clearer picture and rule out overfitting. We ran the experiments and have the curve ready to be added: the verbatim memorization does not decrease through the 10 epochs of SFT, it stabilises after some instabilities at the beginning.
What would happen if during the SFT stage, you mix the pre-training data and SFT data together? So the model is trained on both type of data with the training objective. Would this result in better chat extraction accuracy and at the same time preserve the verbatim accuracy?
From a practical standpoint, mixing the data would likely be a powerful strategy to improve chat extraction (by reinforcing the knowledge during instruction-tuning) while simultaneously preserving verbatim extraction (by maintaining the next-token prediction objective), but we don’t see it as a realistic thing that practitioners do during SFT. But in the same vein, we show in e.g. figure 5 that “re-aligmenent” restores a lot of verbatim extraction, which leads to a similar conclusion to the one suggested by the reviewer.
As the paper show it lead to low chat extraction accuracy, I wonder what if instead of SFT using RLHFlow, the experiment is conducted using reinforcement learning such as RLHF...
We chose SFT as our focus because it's a foundational and universally applied step in aligning chat models, and show in particular that verbatim extraction does not work well after this, which is one practical take-away from our work. We hypothesize that RLHF would also weaken verbatim extraction, but this would need to be tested.
Thanks for your response.
For Q1, you mention you have the curve ready to be added. Could you show the results in a table?
For Q2, you mention it is not practical. I think this is a well-established practice used during RLHF for post-training (Training language models to follow instructions with human feedback).
Thanks for clarifying W1 and this emphasize the unique contribution in this work focuses mainly on the misunderstand of false sense of security after instruction tuning. I think this makes the main message more clear.
We thank the reviewer for the additional questions asked.
Q1: Note that Figure 5 shows that although most attributes encoded during unsupervised finetuning cannot be recovered after SFT via chat nor verbatim extraction, they can be almost fully recovered when reversing the process by re-training on (different!) biographies and breaking the alignment. This suggests that the pre-trained knowledge is not forgotten even after 10 epochs of SFT, which suggests that the SFT part is not too “heavy”.
Here is a table with results from the new experiment where we measure the verbatim extraction rate after each SFT epoch on the model final model (unsupervised training for 20 epochs on BIOs). It corresponds to Llama-2-7B-BIOS, and one epoch of SFT is 625 steps as described in section 3. The Llama-2-7B-BIOS before SFT has 100% verbatim extractability as shown in Figure 2 b. As we can see, the verbatim memorization does not decrease through the 10 epochs of SFT, it even stabilises after some instabilities at the beginning. We will add this point in the discussion.
| SFT Epoch | ID | birthcity | Company | field | University | ID seq |
|---|---|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| 1 | 0.01 | 0.04 | 0.63 | 0.05 | 0.03 | 0.01 |
| 2 | 0.01 | 0.01 | 0.67 | 0.07 | 0.00 | 0.01 |
| 3 | 0.01 | 0.26 | 0.82 | 0.27 | 0.08 | 0.01 |
| 4 | 0.01 | 0.24 | 0.82 | 0.42 | 0.19 | 0.01 |
| 5 | 0.01 | 0.23 | 0.85 | 0.45 | 0.18 | 0.01 |
| 6 | 0.00 | 0.25 | 0.85 | 0.50 | 0.20 | 0.00 |
| 7 | 0.01 | 0.25 | 0.86 | 0.49 | 0.19 | 0.00 |
| 8 | 0.01 | 0.23 | 0.87 | 0.49 | 0.18 | 0.01 |
| 9 | 0.01 | 0.23 | 0.87 | 0.49 | 0.18 | 0.01 |
| 10 | 0.01 | 0.24 | 0.88 | 0.49 | 0.19 | 0.00 |
We would need to check for verbatim extraction & chat extraction at the end of N SFT epochs, where N takes values from 1 to 10 like in this table, but with adequate lr schedules for each N, if we wanted to validate that 10 epochs was optimal. However, this is not our point here, but rather to show that after the X SFT epochs, there is a discrepancy between the chat and verbatim extractability of the attributes.
Q2: Yes, we agree, but studying SFT seemed like the logical first step. We wanted to follow a similar set-up as the one of Allen Zhu (2023). Its indeed very possible the RLHF improves this fact retrievals: this seems like an interesting follow-up.
We thank again the reviewer for their feedback which will improve our work, and hope that our rebuttal has clarified things.
Thank you for providing data for Q1. It shows a more clear picture to the stage of SFT stage.
The paper studies data extraction in LLMs, in particular comparing extractability when prompted verbatim on (partial) sequence seen during unsurpervised training and when prompted via chat.
To study this, the authors use synthetically generated biographies, containing a range of PII for fictional identities. They then take a pretrained LLM to further finetune this (unsupervised) on the bios, to then conduct instruction tuning on question/answer pairs related to the biographies.
They then evaluate:
- verbatim extraction (greedy next token prediction), using the same prefix as during training
- Chat extraction: asking questions similar to the ones used in instruction tuning
They find that verbatim extraction quickly works quite well, with perfect extraction happning after 12 epochs.
Authors then take the unsurpervised finetuned models (at different epochs) and do instruction tuning for 10 epochs on question answering pairs from data already seen during unsupervised training. While the extractability does increase with number of epochs, it plateaus quickly and stays well below 100%.
They then draw insights from further experiments:
- chat extraction becomes easier when models are better.
- Chat extraction becomes better when data is augmented (e.g. rephrased) rather than repeated exactly, both in unsupervised as in supervised finetuning.
- Instruction tuning on Q/A about the same data as seen during pretraining is crucial for things to become chat-extractable.
- It's not because sequences are not verbatim extractable that they are not encoded, eg a supervised finetuned model can not elicit verbatim completions as its finetuning objective had steered it away from next token prediction, but when finetuned again to do next token prediction (on non overlapping data) it again succeeds.
- Applying unlearning in the verbatim sense during the unsupervised does reduce reduce chat extraction.
优缺点分析
Strengths
- Studying how information memorized verbatim during pretraining is then extractable during chat interaction is quite a fascinating and impactful problem statement, also very much beyond privacy concerns.
- the dataset used in the paper alligns very well with the problem statement
- The secondary findings in the paper are all well thought through and lead to fascinating results. For instance, better models, and not larger models, are more prone to chat extraction, or augmentation helps rather than repetition, that no verbatim extraction does not mean not memorized.
Weaknesses:
- Generally, I find the writing and structure of the paper often quite poor. In particular:
- A substantial amount of typos, see formatting comments below.
- I find the related work quite unstructured. For instance, reading the 'other work on LLMs and privacy' does not flow very nicely and it's unclear what is relevant for this particular work.
- Sometimes certain points of reference are not well articulated and hard to follow. For instance, the argument around DP in lines 50-53 is not clear, nor is the argument around MIAs in 193-194 (also see questions).
- Sometimes it's not clear how things have been implemented exactly (eg the setup used in figure 4 is not clear, I don't really understand the unlearning implementation on lines 321-327).
- I don't really understand the points made in lines 199-213 - why does this exact formulation matter if you measure extraction too?
- One major experiment that seems to be missing is the impact of going from greedy decoding (done to measure verbatim extraction) to stochastic decoding (done to measure chat extraction). For instance, to properly understand the gap between figure 2b and 2C, i would argue that you'd also need another figure like 2c but with greedy decoding. I understand this is less realistic, but now it's unclear whether the drop from 100% to a plateau well below that comes from the data not being extractable or from the stochastic decoding.
On a related note, it might be useful to consider some related work on probabilistic extraction [1].
[1] Hayes, J., Swanberg, M., Chaudhari, H., Yona, I., Shumailov, I., Nasr, M., ... & Cooper, A. F. (2025, April). Measuring memorization in language models via probabilistic extraction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 9266-9291).
-
The paper only provides results for one dataset. While I believe the synthetic biographies are a great way to study this from a privacy perspective, it would make sense to also include results for another dataset (to see how it all generalizes, but potentially also in a setting more related to utility/encoding of factual knowledge).
-
I do find a lot of these results relevant also beyond privacy, eg to encode and retrieve factual knowledge such as the date of birth of Napoleon. I think the paper would benefit from either more results on a related dataset for this or some paragraph on this in a discussion section.
Overall I really like the problem the paper studies, its execution and its findings, but the quality of the writing, level of details in displaying results/experiment and general structure need to be improve for me to raise my score.
问题
- Could you clarify the argument around DP made in lines 50-53? Currently I don't understand the point.
- What do you mean with the paragraph on MIAs in lines 193-196? I understand that the loss of the biography data goes down during training, but to my understanding it's hard to link this to MIA as you don't consider any non-members?
- What's the exact setup used in Figure 4? To my understanding, details around model, which bucket B_i, the number of epochs are missing.
- Why do you need the buckets as specified in lines 145-156?
- When just mentioned very briefly on line 354, it is very hard to understand what you mean by "Membership Inference Attacks or detecting training data splits".
局限性
Yes
最终评判理由
I like the problem the paper studies, how the authors have approached it and I believe the findings are interesting. I do think the paper can benefit from improved writing and structure, which I believe can be addressed during revision. I keep my score of weak accept.
格式问题
- Line 33: twice the
- Sentence on line 35 seems off.
- Line 51, 'and is' does not seem right
- Line 103: assumptions?
- 143: constructed
We thank the reviewer for their feedback. We appreciate the constructive criticism regarding the paper's clarity and structure, and we will address each point to improve the manuscript.
Writing, Structure, and Clarity
We will perform a thorough revision to fix all typos, improve the flow, and clarify ambiguous points. Unfortunately we don’t have the opportunity to update the manuscript during the rebuttal, but we commit to doing it for the camera ready version.
I find the related work quite unstructured... Sometimes certain points of reference are not well articulated and hard to follow. For instance, the argument around DP in lines 50-53 is not clear, nor is the argument around MIAs in 193-194. [...]
We will restructure the related work section for better narrative flow.
- On DP (lines 50-53): DP provides guarantees by bounding the total information encoded in model weights for each sample. Our work shows that even when information is encoded (or verbatim memorised) its practical extractability via chat is a separate issue. We wanted to highlight the choice of our work to be “beyond verbatim“, looking beyond the classical privacy angle (such as the one from DP) that deals only with information encoding.
- On MIAs (lines 193-196) Our point is to introduce a nuanced, three-level view of memorization. The first level, which we show in Figure 2a, is the model's lower loss on training data—the very signal that loss-based MIAs exploit. This a necessary but insufficient indicator of extraction risk. The second level is direct verbatim extraction (Figure 2b), a stronger form of memorization. The third is chat extraction (Figure 2c), where information is leaked conversationally. By presenting these three distinct levels, we argue that privacy audits must look beyond simple loss-based signals or MIAs, but even beyond verbatim.
- Figure 4 Setup: The details are mentioned in Section 3.2 but we will clarify the setup in the caption and text. All experiments in Figure 4 use the Llama2-7B model, are evaluated on the tail set (B1), and follow our standard protocol of 20 unsupervised epochs followed by 10 instruction-tuning epochs.
- Unlearning Implementation: We will clarify the unlearning process. To test if removing verbatim memorization also removes chat extractability, we modified the training process for a specific subset of our data (the 'unlearning set'). For the 'ID' attribute in this set, instead of minimizing the standard cross-entropy loss, we trained the model to achieve a target loss of 2 by minimizing abs(loss - 2) only for the ID tokens, and normal training for the rest. We chose this target because a loss of ~2.3 (ln(10)) is what one would expect from a model randomly guessing a digit from 0-9. A high loss of 2 thus removes verbatim extraction. To prevent the model from forgetting how to process IDs entirely, we continued normal training on a separate 'regularization set'. We then evaluated how this targeted verbatim unlearning affected chat extractability on both the unlearning set and a heldout set.
I don't really understand the points made in lines 199-213 - why does this exact formulation matter if you measure extraction too ? This connects the model's training objective (minimizing cross-entropy loss) to verbatim extraction. It formally highlights that the expected number of queries needed to extract a sequence is exponentially related to the loss, whatever decoding scheme is used. While we measure extraction empirically, this formulation explains why low loss is a prerequisite for verbatim extraction, and why the loss tells you everything that you should expect for verbatim extraction. But we show that it does not tell much about chat extraction.
One major experiment that seems to be missing is the impact of going from greedy decoding (done to measure verbatim extraction) to stochastic decoding (done to measure chat extraction)... it's unclear whether the drop from 100% to a plateau well below that comes from the data not being extractable or from the stochastic decoding.
Our goal was to compare two realistic extraction: a verbatim extraction attack, where an adversary would have access to the logits and will likely use greedy decoding to maximize success, as well as a typical chat interaction through an API, which involves stochastic decoding. We acknowledge that the decoding strategy is a confounding variable when comparing the absolute extraction rates between these two settings.
However, the central claim of this part of our study is not so much about the absolute drop in performance, but more about the relative extractability of different attributes in a chat setting. Since the same stochastic decoding parameters were used for all attributes during chat extraction, our conclusion that some attributes are significantly harder to extract via chat than others remains valid.
On a related note, it might be useful to consider some related work on probabilistic extraction [1].
Thank you for this valuable reference, it is very relevant. We will incorporate it into our related work section, as it complements our discussion of extraction methodologies.
The paper only provides results for one dataset... it would make sense to also include results for another dataset (to see how it all generalizes, but potentially also in a setting more related to utility/encoding of factual knowledge).
We chose synthetic biographies to have precise control over the properties of private information, which was essential for our privacy-focused study. We agree our findings have broader implications for knowledge retrieval, as the reviewer notes. We will add a paragraph to the discussion.
Why the buckets (lines 145-156)?
The bucket system allows us to control for data repetition and augmentation, key factors in memorization. The tail (B1, seen once) serves as our consistent evaluation set for "rare" data leakage. The head (B10, seen 10 times with augmentation) is used for instruction tuning. This simulates a realistic scenario where a model is taught to answer questions about common knowledge, and we then test if this transfers to extracting rare, private information it was not explicitly taught to retrieve. This setup cleanly separates the data used for teaching the skill from the data we are testing for leakage. About the other buckets, we have results that show as expected that the extraction rate is an increasing function of the bucket number, but we did not find these results very important as we already have a lot of experiments, but we can add them to the appendix.
Line 354 "detecting training data splits":
This phrase refers to a common, basic form of privacy audit where one simply checks if the model regurgitates long, contiguous chunks of its training data verbatim. Our argument is that modern privacy evaluations must be more sophisticated, e.g. go beyond verbatim.
Many thanks for the rebuttal, and answering my questions. I still tend to accepting the paper, and will keep my score.
The paper argues that in the age of instruction-tuned chat-based LLMs, measuring privacy risks through the technique of verbatim memorization of training sentences is unrealistic and does not capture the real risks of information leakage. The paper shows trends in both directions, i.e., information that might seem extractable using verbatim memorization of the base model is not extractable anymore after instruction-tuning, and information that might not seem extractable using verbatim memorization of the final model is still memorized and can become extractable if the model again undergoes unsupervised fine-tuning. The paper supports these claims with an extensive set of results across various Llama models and types of information encoded.
优缺点分析
Quality: The paper is technically sound, and it has an extensive set of empirical results to support its claims.
Clarity: The paper is clearly written and well organized.
Significance and Originality: The framing of the work, in my opinion, hurts its significance and originality. The motivation behind the work, i.e., trying to better understand how to appropriately audit information leakage in chat-based models, is strong. However, the actual questions tackled in the paper, the difference between verbatim memorization during original unsupervised training vs chat-based memorization after instruction-tuning, have been studied before (knowledge extraction literature, also noted in the paper).
Some results are interesting, such as an investigation of the 'head' vs the 'tail' of the distribution, studying the repetition of data during training and its impact on extraction, and the importance of fine-tuning on some data from unsupervised learning (this was the most interesting result to me, and I wish was explored deeper). But overall, a more practical perspective on trying to test the limits of information extraction from chat-based models instead of just studying the difference between verbatim and chat-based memorization would have been a stronger work. More details are discussed in the 'Limitations' section.
问题
Please take the rebuttal as an opportunity to further clarify how your work stands out from the existing literature on knowledge extraction in chat-based models.
局限性
Consider the problem: How to best audit information leakage in chat-based instruction-tuned LLMs?
This question spans several sub-questions, like,
- Does verbatim memorization still perform well, or do we need a different chat-based approach?
- What is the vulnerability of information present in the original unsupervised training vs the information present in instruction-tuning? Is there a relationship between the two?
- How does the distinction of 'extractable' vs 'discoverable' memorization translate in this setting? Does having access to how the data was presented in training, i.e., the discoverable memorization setting, help (for instance, useful in auditing copyright issues)? Does asking a large number of questions without access to any information, i.e., the extractable memorization setting, help (for instance, auditing how easily the training data can be stolen)?
- How does the variability in question formulation affect the extraction? How does the form in which the information is present in the training data, or its repetition, affect extraction?
- And many more questions...
My complaint with this paper is that it only briefly touches on and never answers any of these questions in depth. What the paper actually answers is a question that has already been explored in the knowledge extraction literature, only translating the experiment to a privacy-specific dataset.
To begin, the first question is not really answered in the paper. What the paper shows is that the verbatim memorization audit of the 'intermediate model' after unsupervised training but before instruction-tuning is different from the chat-based memorization audit of the 'final model'. While there is an experiment showing the verbatim memorization audit of the final model, it is not compared directly with the chat-based memorization. For instance, comparing Figures 2(c) and 5, it seems that verbatim memorization of the final model extracts more information from the category 'Company' than chat-based memorization of the same. On a similar note, for all other categories, chat-based memorization seems to extract far more information. A discussion of this trend and a deeper exploration of the anomalous behaviour of just the 'company' category would have been quite interesting.
Similarly, the paper briefly answers questions 2 and 4, but no in-depth exploration is done in the paper.
To be clear, I understand that the original question I put at the top is quite wide, and not something that can be covered in one research paper. I believe the paper would have been stronger if any one of these questions were explored in depth. Instead, in its current form, I find that the paper doesn't distinguish itself enough in terms of technical contribution from existing knowledge extraction literature.
最终评判理由
My concerns were mostly clarified by the authors. I will raise my score to accept. I believe this is a good paper that I would like to see appear at NeurIPS. I encourage the authors to continue making efforts to improve the structure of their arguments as discussed in my review and in their rebuttal, which can further increase the impact of their work.
To summarize the final justification, I believe the paper tackles an important question, and all relevant parts of the answer are already present in the paper; it just needs to be arranged better to focus on more concrete research questions instead of how they are currently presented.
格式问题
NA
We thank the reviewer for their feedback. We appreciate their positive assessment of the paper's technical quality and clarity. We will address the main concerns below and will revise the paper to clarify these points.
The framing of the work, in my opinion, hurts its significance and originality... the actual questions tackled in the paper, the difference between verbatim memorization during original unsupervised training vs chat-based memorization after instruction-tuning, have been studied before (knowledge extraction literature, also noted in the paper).
One key insight is that verbatim memorization, a standard for privacy audits, is a poor proxy for the actual information leakage risks in instruction-tuned models. We will revise the introduction and related work sections to more explicitly highlight this distinction and the privacy-centric motivation of our study, but to the best of our knowledge the fact that instruct models have lost most of their verbatim memorization (among other things) is a new observation even outside of the privacy scope. See next answer for the clarified contribs.
Please take the rebuttal as an opportunity to further clarify how your work stands out from the existing literature on knowledge extraction in chat-based models.
Our work stands out by focusing on the disconnect between different forms of memorization and their privacy implications. Specifically, we show:
- Verbatim memorization (pre-instruction tuning) does not correlate well with chat-based extraction (post-instruction tuning). But removing information from a verbatim sense (see unlearning section, e.g. fig 6) removes chat extractability
- Information can become "un-extractable" verbatim after instruction tuning, yet remain encoded and extractable through other means (chat or after realignment), creating a false sense of privacy.
- Model quality, not just size, is a key factor in chat-based leakage.
These findings are important for effective privacy auditing methods for modern LLMs, and are new even outside of the privacy scope.
My complaint with this paper is that it only briefly touches on and never answers any of these questions in depth. What the paper actually answers is a question that has already been explored in the knowledge extraction literature, only translating the experiment to a privacy-specific dataset.
We agree that the questions posed are broad. Our goal was not to provide an exhaustive answer to all of them, but rather to demonstrate that the conventional approach to privacy auditing is insufficient and to map out the complex interplay of factors that a more robust audit would need to consider. We chose to provide a breadth of initial findings to motivate this. We will clarify the scope in the introduction to better manage expectations.
However:
- question 1 is tacked by comparing the pink bars in figure 5 (verbatim extraction after SFT) and Figure 2.c (chat extraction after SFT)
- Question 2 on the relationship between SFT and unsupervised finetuning data: we show that the SFT data need to include some unsupervised finetuning data to even be effective (fig. 4.c).
- Question 4: this is answered in fig. 4.a. And 4.b.
We are very much open to clarifying any of these points.
[...] For instance, comparing Figures 2(c) and 5, it seems that verbatim memorization of the final model extracts more information from the category 'Company' than chat-based memorization of the same. On a similar note, for all other categories, chat-based memorization seems to extract far more information. A discussion of this trend and a deeper exploration of the anomalous behaviour of just the 'company' category would have been quite interesting.
First, our methodology ensures that after the initial unsupervised fine-tuning, all attributes are fully encoded and achieve a 100% verbatim extraction rate (as shown in Figure 2b). This establishes that the information is perfectly memorized in a verbatim sense.
Second, after instruction tuning, we evaluate chat extraction (Figure 2c). Here, we observe that the extraction success varies significantly across attributes. This difficulty appears to correlate with factors like the uniqueness of an attribute's value (Table 1), where attributes with many unique values (like IDs) are harder to extract via chat than those with fewer (like Birth City).
Finally, Figure 5 explores a different phenomenon: the fragility of verbatim extraction itself after SFT. The y-axis in Figure 5 is for verbatim extraction, and it shows that after instruction tuning, the verbatim extraction capability is very low (the pink bars), even though the information is still extractable via chat (as seen in Fig 2c). However, if we "realign" the model by continuing unsupervised training, verbatim extraction is largely restored. This highlights that a failure in verbatim extraction does not mean the information is gone.
We will add a clearer interpretation of this trend. About the company attribute specifically, the observation made by the reviewer is totally correct, but we don't have a clear explanation for it. We will highlight this, that SFT removes verbatim extraction for most of the attributes.
I appreciate the clarifications made by the authors. I will raise my score to accept. I believe this is a good paper that I would like to see appear at NeurIPS. I encourage the authors to continue making efforts to improve the structure of their arguments as discussed in my review and in their rebuttal, which can further increase the impact of their work. To summarize, I believe the paper tackles an important question, and all relevant parts of the answer are already present in the paper; it just needs to be arranged better to focus on more concrete research questions instead of how they are currently presented.
We thank the reviewer for their positive feedbacks that will improve our work. We remain available for any additional question that the reviewer might have.
The main idea of the paper is to challenge the conventional wisdom that verbatim memorization is the primary way LLMs lose privacy. The authors argue that for instruction-tuned LLMs, privacy loss is a more complex issue. They demonstrate through experiments that chat-based extraction is a more effective way to get private information than verbatim extraction. They show that better-performing models are more likely to leak private data through chat, even if they are smaller in size. Moreover, they show that repeated training does not always increase chat extraction, but rephrasing and augmenting data can significantly increase it. Finally, they show that information that is no longer verbatim-extractable after instruction tuning can still be retrieved through a conversational approach.
Overall, the reviewers are in agreement that the paper studies an important problem and makes some interesting contributions. Some concerns were raised by the reviewers around the lack of clarity in writing and some missing experimental details.
The authors' rebuttals effectively addressed most of the reviewers' concerns, with the authors committing to a thorough revision of the paper to improve its structure and clarify key points.
In light of the above, I support the paper for acceptance.