Watermarking Makes Language Models Radioactive
LLM watermarking, intended for generated text detection, has the secondary effect of revealing when synthetic data are used to fine-tune another model.
摘要
评审与讨论
This paper considers the problem of detecting whether watermarked text was used as training data for a language model. It identifies two several different settings under which to study this question, and proposes detection methods for identifying language models trained on watermarked text. Experiments analyze the effectiveness of these detection methods across a variety of settings using a particular configuration of the Kirchenbauer watermark, and briefly analyze the Aaronson watermark in the closed-model setting.
优点
The paper introduction, background, and problem formulation are well-motivated and easy to read.
Identifying that a model has been trained on watermarked text is an interesting problem, of broad relevance to the community.
Empirical results show some evidence of detectability of hash-based watermarks, especially in the supervised setting (Figure 5).
缺点
The algorithms in Sections 4--the central contribution of the paper--are not well described. What explicitly is the null hypothesis for which we are computing a p-value? How does filtering/de-duplication change the null hypothesis? How explicitly is the p-value computed?
Based on the discussion around line 258 "Influence of the de-deduplication on the correctness of radioactivity tests" and Table 4, it appears that the detection protocol is entirely heuristic; it is not clear to me that the computed values are p-values in any formal sense.
The methods are specific to fixed-window, hash-based watermarks. Most of the experiments focus specifically on the Kirchenbauer watermark, with the exception of Table 5, which considers the Aaronson watermark (another fixed-window hashing watermark). Contrary to the general claims of the title and exposition, it is not clear how broadly these results hold, e.g., for distribution-preserving watermarks (variable-length hashes) [1] or watermarks that aren't based on hashes. Based on the decay in detectability vs. k shown in Table 5, I strongly suspect that at least the watermark [1] is not radioactive.
The abstract of the paper claims to give statistical guarantees for detection of models trained on watermarked text. No such guarantees are given, only experimental evidence.
[1] Undetectable Watermarks for Language Models Miranda Christ, Sam Gunn, Or Zamir
问题
Regarding the "supervised" and "unsupervised" settings: this word choice was initially quite confusing to me. These works typically refer to training regimes (with or without labels) but training is not being studied in this paper. Would "observed" vs. "unobserved" be more clear terminology?
Is Table 5 the supervised or unsupervised setting? Why is the watermark more detectable in model outputs (Rad) than in the training data (Orig)?
detection tests can be empirically inaccurate due to biases in the distribution of tokens, violating the independence assumption.
What independence assumption? Perhaps this is discussed in Fernandez et al. but in the present context it's not clear what is being assumed.
radioactivity can only be observed for watermarked (k + 1)-grams {watermark window + current token} that were part of A’s outputs in B’s training data
Is this true? For Kirchenbauer-style watermarks at least, it seems like a weaker observation could hold, because there is a distribution shift.
局限性
Definitions of "text radioactivity" and "model radioactivity" are introduced in Section 3, but the experiments seem to exclusively study model radioactivity.
We thank the reviewer for the feedback. We have addressed each point individually. While we understand the concern, we argue that the p-values are not “heuristic”. We add a detailed response to clarify the reliability of the p-values and refer to App. D.2.1 “More details on token scoring and de-duplication” and App. D.2 “Correctness”. We will further emphasize this in the manuscript.
W1. The algorithms in Sections 4 are not well described [...]How explicitly is the p-value computed?
H0 asserts that the observation of tokens from B were not generated following watermark W with secret key K (s.t. "B was not trained on watermarked data from A" is included in H0 as in Def 1). Filtering/de-duplication does not change the null hypothesis. It only modifies the observation from which we compute a score such that we know the distribution of this score under H0. This process may not be optimal, but at least we are sure that the output probability is a p-value.
The p-value is computed as the probability P(s(X,K) > t | H0), where:
- X represents the observation
- s is the score function (e.g. number of 'green' tokens)
- K is the random secret key.
With the deduplication, X is a set of N unique (k+1)-tuples of tokens. With our filtering/tape, the tuples are not repeated from the prompt. This ensures that under H0, each (k+1)-tuple has a probability γ of leading to an increment in the score (on expectation over the random secret key K and if we assume an ideal hashing function). Therefore, s(x,K) is distributed as a binomial B(N, γ), and we compute P(s(x,K)>t|H0) using the regularized incomplete beta function (see App. C).
Without deduplication, the (k+1)-tuples of tokens are not unique: it is not easy to derive the distribution of s(x,K) or P(s(x,K)>t | H0). Similarly, without our filtering/tape, we do not know the distribution of the score under H0, as the model might simply have repeated watermarked tokens that were present in the attention span (cf. appendix D.2.1).
Works in the literature resort to bounds (e.g., Markov), or to estimation from random simulations (Monte Carlo, like Kuditipudi et al. do). We prefer to use a sub-optimal test (due to filtering/deduplication) whose p-values are sound, i.e., not based on heuristics.
W2. Based on the discussion around line 258 [...] it is not clear to me that the computed values are p-values in any formal sense.
While the filtering and deduplication rules are indeed heuristic, the p-value computations are adapted from established tests in Kirchenbauer et al. 2023, Aaronson et al. 2023, and Fernandez et al. 2023 and are theoretically grounded as detailed in our previous answer to W1.
We also validate them experimentally: App. D.2 "Correctness" elaborates on our tests and their validity. Figures 8 and 9 demonstrate that under H0 (models not trained on watermarked data), the p-values are uniformly distributed between 0 and 1 across all settings, confirming their practical value. We show in Tab. 4 that it is not the case without filtering/deduplication. We are committed to clarifying this further and welcome suggestions on how to convincingly demonstrate that the p-values are rigorously derived.
W3. The methods are specific to fixed-window, hash-based watermarks [...]
In App. A (limitations) and App. C.3, we discuss the generalization of radioactivity detection to other watermarks. The claim of our paper is not “All LLM watermarking schemes are radioactive” but “Some very commonly used LLM watermarking schemes are”. In particular (see general comment), we focus on the most used methods, for which reliable p-values can be theoretically computed, which is not the case for [Christ et al.] or [Kuditipudi et al.] that relies on Monte Carlo simulations.
W4. The abstract of the paper [...] No such guarantees are given, only experimental evidence.
We disagree: the guarantees are given under H0: the false positive rate is provably small and under control (see Tab. 4, App. D.2 and answer to W2). The experimental evidence is here to support the theoretical guarantees. There is indeed no guarantee under H1, as usual in the watermarking literature.
Q1. Regarding the "supervised" and "unsupervised" settings [...]
This is a very good suggestion, we will change the manuscript accordingly.
Q2. Table 5 [...]
Tab. 5 is under the unsupervised setting.
Orig: We detect watermarks in the training data. The goal aligns with the original application of the watermark, i.e., detection of AI generated text. We focus on a "real-life" scenario with 100-token texts.
Rad: In this setup, we aim to detect if a model is radioactive. We can score a larger number of tokens; in this case, we score 30k tokens.
The tasks are different, which explains why the p-values differ. What we study is, given a fixed watermark strength in the training data (= p-value on a fixed number of tokens), do different watermarks lead to different radioactivity signals.
Q3. “[...] independence assumption.” [...]
Please refer to the previous answer to W1: The tests rely on the assumption that the score increments for every token are i.i.d.. This holds true if the generated (k+1)-tuples are i.i.d. under the null hypothesis. This is not the case a priori since (k+1)-tuples of tokens are more frequent than others in natural text distribution, and certain prompts can lead to the more frequent appearance of specific tuples.
Q4. “radioactivity can only be observed for watermarked (k + 1)-grams [...]
The bias that radioactivity detection is capturing is at the token level. We are not detecting an overall "distribution shift". As there is no correlation between the greenlists of different watermark windows (since the partition is only a function of the watermark window through a hash function), radioactivity can only be observed for watermarked (k+1)-tuples {watermark window + current token} that were part of A’s outputs in B’s training data.
Dear Reviewer, we hope our rebuttal addresses your concerns about the inaccuracies of the p-values. If our responses are satisfactory, could you consider revising your score? If anything remains unclear, please let us know so we can clarify.
This paper proposes a method to detect whether a language model is trained on (a subset of) watermarked outputs from another victim model. Their method utilizes the fact that the watermarking schemes are shifting the output tokens' distributions, such that the model trained on the watermarked outputs will also have such distribution-shifting behaviors. The authors propose detecting such shifting to determine if the model's training dataset contains watermarked content. The promising results illustrate the effectiveness of their method with a high accuracy.
优点
- The proposed method is sound with promising evaluation results.
- The studied problem is important and timely.
缺点
- The proposed method seems limited to the green-red list splitting-based watermarks.
- The presentation can be improved.
问题
-
My major concern is the generalizability of the method. Given that the KGW [ICML'23] variant watermarks are not distortion-free, which in theory leaks the watermarked tokens' distribution. Kuditipudi et al. [2023] proposed a watermarking scheme that does not rely on the splitting of green and red lists. I think it might be hard to generalize the method to this watermark as it is proven to be distortion-free. Can the authors comment on this?
-
Additionally, the false positive rate is an important metric in such detection systems. From the results in Figure 5, the p-value is not large enough when . Can the authors provide further explanations on this?
-
Minor: The presentation can be further improved. For instance, you can put the images near the text where they are referred to for the first time. So that the readers do not need to jump back and forth.
局限性
See my concerns in the Question section.
We thank the reviewer for their feedback on our paper. We address each point specifically. Please note that Appendix A "Limitation" and Appendix C.3 "Does the radioactivity generalize to other watermarking schemes?" address some of the concerns. Importantly, we emphasize the reliability of our p-values in Section D.2.2 "Correctness experiments." We believe the reliability of our p-values is one of our paper's main contributions and hope it addresses the reviewer’s concerns.
My major concern is the generalizability of the method. Given that the KGW [ICML'23] variant watermarks are not distortion-free, which in theory leaks the watermarked tokens' distribution. Kuditipudi et al. [2023] proposed a watermarking scheme that does not rely on the splitting of green and red lists. I think it might be hard to generalize the method to this watermark as it is proven to be distortion-free. Can the authors comment on this?
We invite to reviewer to refer to the general comment and the aforementioned appendices for a discussion on generalization. There is specifically a discussion on the Kuditipudi et al. [2023] scheme. To summarize, while this scheme may exhibit radioactivity, detecting it would be prohibitively expensive, requiring times more resources to properly evaluate p-values. Additionally, we analyze the scheme of Aaronson et al. in Section 6, which does not rely on green/red lists. This should address some concern about generalization. It is true however that we only focus on the most prominent hashing-based watermarking methods.
Note about "distortion-free": The term is not yet perfectly agreed upon in the literature. Aaronson’s scheme is said to be “distortion-free” by the authors in the sense that the probability of selecting a token is the same on expectancy over the random vectors. On the contrary, Kuditipudi et al., oppose hashing-based methods (like the one of Aaronson et al.) and distortion-free methods. For them, hashing previous (k-1)-grams to create the secret vectors produces a bias towards certain k-grams, which introduces a distortion, while using a sequence of vectors as secret keys does not. We would argue that both methods are distortion-free only under the limit where the watermark window (for Aaronson et al. and Kirchenbauer et al.) or where the sequence length (for Kuditipudi et al.) are big enough. There is an interesting discussion about this in 4.1 of [Kuditipudi et al.]
We also invite the reviewer to take a look at our attached pdf were we show that even multi-bit watermarking are radioactive, in a response to a question raised by reviewer q8YZ.
Additionally, the false positive rate is an important metric in such detection systems. From the results in Figure 5, the p-value is not large enough when 𝜌=0. Can the authors provide further explanations on this?
We agree that assessing the FPR is crucial, and this is why it was important for us to get reliable p-values. There is a direct link between the FPR of the detection system and the p-values obtained for different samples. Given an LLM for which the statistical test gives a certain p-value, we would flag the LLM as radioactive at every FPR ≥ p-value. For instance, if all samples have p-value ≤ , then the observed TPR would be 1.0 at FPR=.
When 𝜌=0%, there is no watermark in the training dataset (H0), so the detection should output an average p-value of ≈0.5 since p-values under H0 should be uniform between [0,1] (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6629378/#s2title). This is what we observe in Figure 5. In Table 4 "Average p-values under H0" and Appendix D.2.2 "Correctness Experiments”, we specifically focus on this and show that the test yields random p-values under H0 in all considered settings, validating the reliability of our p-values.
Minor: The presentation can be further improved. For instance, you can put the images near the text where they are referred to for the first time. So that the readers do not need to jump back and forth.
We will update the paper accordingly.
Thanks for the response. Please incorporate these discussions in the revision. I'll keep my score.
We thank the reviewer for their response. We will include the discussion in the revision.
The paper studies the "radioactivity" of watermarked texts, i.e. if using such texts in LLM finetuning leaves noticeable watermark signal that can be reliably detected in future outputs. The main case study considered is the common scenario of using LLM-generated data for IFT. Authors use off-the-shelf LLM watermarks but propose several improvements to the standard watermark detection pipeline, demonstrating that they are necessary to obtain correct and sufficiently small p-values for radioactivity detection. The experiments consider four different threat models and include several additional studies of radioactivity across different dimensions.
优点
I very much enjoyed reading this paper and believe it is an impactful contribution to the field. I highlight the key strengths below.
- Novelty, focus on an important underexplored problem: While individually all the following fields have been studied, to the best of my knowledge no prior work studied the current wave of LLM watermarking research from the perspective of active methods for tracing unauthorized data use and data radioactivity. Thus the authors identified an important gap in the literature.
- Impact for different subcommunities: The work can be valuable for several mostly disjoint subcommunities with different foci such as: design of new LLM watermarking schemes, OSS watermarking, passive tracing of unauthorized data use (membership inference), active tracing (e.g., backdooring), model IP protection, but also model distillation, and instruction fine-tuning as such. This holds independent of the "strength" of the final takeway -- even if radioactivity does not occur such a study is equally valauble.
- Thorough exploration of the problem nuances: I appreciate the careful setup of 4 settings and a thorough discussion of when they are realistic/important, the extensive additional studies, and the detailed comparison to related and adjacent work such as membership inference attacks.
- Rigorous and extensive experimental evaluation: The evaluation part offers many clearly communicated insights that arise from carefully constructed evaluation scenarios and I particularly appreciate that nothing is "swept under the rug", especially the issue of mismatched empirical and theoretical p-values. The appendices cover many of the additional questions that the reader may have.
- High-effort writing and presentation: The paper is exceptionally well written and structured, the information is logically organized, and special care is taken to provide figures to ease understanding. Even when simple (e.g., Fig2) these steps help with reading.
缺点
I can identify several weaknesses, none of which are fundamental.
- The last abstract sentence renders as an overclaim given that it applies only to the less realistic open case, while most readers would assume the more realistic closed case, where the number is 10%. I strongly suggest the authors make this clear in all places, as this number does not really affect the merit of the work.
- Main results (e.g. Fig5) compare results on N=225k tokens (open) and N=500k tokens (closed) models. As shown in Figure 14 increasing N improves the performance of detection, thus this comparison on different levels of N is misleading. Why did the authors not use the same N in both experiments?
- Table 1 is insufficiently discussed and hard to understand just from the main paper but makes some strong claims. In particular: (i) "Without WM (MIA)" would make the scope of these columns clearer (ii) it is unclear if "X" means "fundamentally inapplicable" or "achieves bad results". MIA+unsup. is latter but IPP+unsup. is former, with the caveat that we focus on current methods and not IPP in general. (iii) MIA+Closed are essentially "label-only" attacks ("Label-Only Membership Inference Attacks", Choquette-Choo et al. 2020) so this is not fundamentally inapplicable, while it may be that for LLMs no such technique vas demonstrated viable; this should be made clearer. (iv) The tilde is unclear, after reading the appendix I take this to mean "very limited results demonstrated in this setting + technical issues when trying to reproduce". All this should be more carefully unpacked as it is important to position the current work.
- L170 is in the context of "our contributions" yet states an approach ("ignore repeated ngrams") that is very common in prior work. A citation (given later) should already be given here to clarify which part of the Tape is novel.
L8 typo: "methods"+"detects"
问题
- Where do the unwatermarked instructions used for B (1-rho percentage) come from? Using A without the watermark may introduce unnecessary entanglement, so I hope these pairs are fully independent, e.g., human written.
- is there a reason why the open setting studies only d=1? Line 231 is unclear as it may temporarely make the reader think that "unsupervised" <=> "d=1" which is not true. This choice should be discussed.
局限性
The authors adequately discuss the limitations in one of the appendices.
We thank the reviewer for their positive feedback on the paper, as well as valuable questions and comments.
- The last abstract sentence renders as an overclaim given that it applies only to the less realistic open case, while most readers would assume the more realistic closed case, where the number is 10%. I strongly suggest the authors make this clear in all places, as this number does not really affect the merit of the work.
We agree with this comment and will modify the abstract and the claims in the paper in consequence.
- Main results (e.g. Fig5) compare results on N=225k tokens (open) and N=500k tokens (closed) models. As shown in Figure 14 increasing N improves the performance of detection, thus this comparison on different levels of N is misleading. Why did the authors not use the same N in both experiments?
A reason for this choice is to show that the open model setting does not need as many tokens as in the closed model setting and is thus more effective. This can be observed in Figure 5, where lower p-values are achieved with fewer tokens than with the closed model.
Another reason is that p-values plateau beyond a certain point (see Figure 6), so scoring more tokens in the open model does not significantly enhance detection.
We acknowledge the potential for misunderstanding as it mixes two effects (open vs closed and number of tokens) and the numbers may appear arbitrary. We will clarify this in our revision.
- Table 1 is insufficiently discussed and hard to understand just from the main paper but makes some strong claims. In particular: (i) "Without WM (MIA)" would make the scope of these columns clearer (ii) it is unclear if "X" means "fundamentally inapplicable" or "achieves bad results". MIA+unsup. is latter but IPP+unsup. is former, with the caveat that we focus on current methods and not IPP in general. (iii) MIA+Closed are essentially "label-only" attacks ("Label-Only Membership Inference Attacks", Choquette-Choo et al. 2020) so this is not fundamentally inapplicable, while it may be that for LLMs no such technique vas demonstrated viable; this should be made clearer. (iv) The tilde is unclear, after reading the appendix I take this to mean "very limited results demonstrated in this setting + technical issues when trying to reproduce". All this should be more carefully unpacked as it is important to position the current work.
We discuss this point in section 5.5 "Other approaches" and in Appendix E and F, but agree that it could appear sooner in the paper, for instance when we present the table. In this table:
- “X” means that no method in the literature currently tackles this problem with LLMs;
- “~” means that the only methods that address the problem have strong technical issues when trying to reproduce:the statistical guarantees do not hold.
We will emphasise this for better clarity.
- L170 is in the context of "our contributions" yet states an approach ("ignore repeated ngrams") that is very common in prior work. A citation (given later) should already be given here to clarify which part of the Tape is novel.
Although [KGW2023] and [FCT2023] introduce the deduplication of tokens, the tape is indeed different in our paper as we also care about the prompts (closed-scenario) and about the LLM context (open-scenario), so we must also properly deduplicate these tokens in a way that is specific to our detection (proven in Tab. 4 and appendix D.2.2 “Correctness”). We will update this paragraph accordingly to clarify this contribution.
References:
[KGW2023] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023a.
[FCT2023] Pierre Fernandez, Antoine Chaffin, Karim Tit, Vivien Chappelier, and Teddy Furon. Three bricks to consolidate watermarks for large language models. 2023 IEEE International Workshop on Information Forensics and Security (WIFS), 2023.361
I thank the authors for the rebuttal. I have read other discussion threads as well, and will maintain my score.
We thank the reviewer for their answer and support! We will incorporate the rebuttal in our manuscript.
The paper investigates the detection problem of whether LLM-generated texts are used to train another LLM, a phenomenon referred to as 'radioactivity'. The paper finds that it is feasible to detect the radioactivity of LLM-generated text via LLM watermarking. Consequently, the authors design radioactivity detection methods for four scenarios: closed-model, open-model, supervised setting, and unsupervised setting. In the experiment section, the authors present reliable detection under only 5% watermarked training texts, validating the effectiveness of their designed methods.
优点
- The paper is well-structured and easy to follow.
- The topic of this paper, i.e., the radioactivity of LLMs-generated text, is interesting.
- The paper conducts extensive experiments and provides an in-depth analysis.
缺点
- Although the paper designs detection methods for four scenarios, no new methods are actually proposed. The methods in the paper still follow the watermark detection approach and are merely applied in different scenarios. In other words, the so-called new methods in the paper are only about how to construct scenarios to better detect radioactivity.
- The paper lacks exploration into the impact of the latest watermarking methods, especially multi-bit watermark, such as [1] and [2]. The multi-bit watermark better align with the requirements of real-world scenarios
- The overall contribution of the paper appears relatively weak, as some existing model protection methods (e.g., [3]) have already explored the impact of watermarking on radioactivity detection. Although the settings in [3] differ somewhat from those in your paper, similar results can still be obtained.
References
[1] Towards Codable Watermarking for Injecting Multi-bits Information to LLMs
[2] Advancing Beyond Identification: Multi-bitWatermark for Large Language Models via Position Allocation
[3] Protecting language generation models via invisible watermarking
问题
See the Weaknesses part
局限性
NA
We thank the reviewer for the feedback. We have addressed each point individually. We kindly invite the reviewer to refer to Appendix A ("Limitations") for details on weaknesses 1 and 2, and Appendix F ("Comparison to Active IP Protection Methods") for the comparison related to weakness 3. We also provide additional experiments to address weakness 2, and if it addresses some of the reviewer’s concern, we would greatly appreciate it if they could consider updating their rating to reflect this.
- Although the paper designs detection methods for four scenarios, no new methods are actually proposed. The methods in the paper still follow the watermark detection approach and are merely applied in different scenarios. In other words, the so-called new methods in the paper are only about how to construct scenarios to better detect radioactivity.
We indeed adapt existing watermark detection techniques to tracing data usage by other models. But our proposed methods are not only to "construct better scenarios to detect radioactivity”. We show that the naive approach does not work: filtering / deduplication is necessary to score millions of tokens and get reliable p-values (see Table 4 and Appendix D.2.2). Without this novelty, it is not possible to demonstrate watermark radioactivity, which is a main discovery of our paper.
Our methods are not trivial adaptations either: contrary to the classic detection setup analyzing if a given piece of text is watermarked, we analyze whether an LLM is contaminated. For instance we are the first to use the model itself to detect traces of the watermark (open-model scenario).
- The paper lacks exploration into the impact of the latest watermarking methods, especially multi-bit watermark, such as [1] and [2]. The multi-bit watermark better align with the requirements of real-world scenarios
We have addressed the selection of watermarking methods and their implications in Appendix A ("Limitations"). In summary, we focused on these two methods (Kirch., Aar) because a lot of watermarking methods build upon them (Fu et al. [2024a,b], Hu et al. [2023], Kirchenbauer et al. [2023b], Kuditipudi et al. [2023], Wu et al. [2023], Zhao et al. [2024], Yoo et al. [2024], …)
The specific case of multi-bit watermarking is indeed interesting, and we provide 2 elements of answer:
- we disagree with the claim that “multi-bit watermark better aligns with … real-word scenarios' ': regulations (EU AI Act, California Act, BH Act, …) only require AI generated content detection. Moreover, tracing users may be forbidden under the GDPR in Europe for instance.
- our experiments cover the schemes of Kirchenbauer et al. and Aaronson et al. that have been extended to multi-bit (see Fernandez et al [2023]), so our detection test should transfer to multiple messages (corresponding to rolled versions of a secret key) as well.
Additionally, we provide new experiments with [2] “Advancing Beyond Identification: Multi-bit Watermark for Large Language Models via Position Allocation”, aka MPAC, which was mentioned in the review. Please refer to the PDF for the corresponding figures.
In these experiments, we adopt the same framework as in Sec. 5: "Radioactivity in Instruction Datasets”. We generate watermarked instructions from A=Llama2-chat with a random binary message of size len_msg (more precisely, we take bits 2 by 2 to generate a message m = m1, m2, .. mk where mi = 0,1,2 or 3, corresponding to r=4 in MPAC and b = len_msg//2 ).
We then fine-tune B=Llama1 with instructions, ρ% of which are watermarked with the above method.
Finally, we detect its radioactivity in the supervised / closed-model setup, i.e., access to the data used for fine-tuning and no access to the model. We filter and deduplicate the tokens used in the prompts as explained in Sec. 4.2, par. “Token scoring and de-duplication.”
We plot in Fig. 1. (a) (ρ = 100% of watermarked fine-tuning data) and (b) (ρ = 10% of watermarked fine-tuning data) the bit accuracy against the number of scored tokens that we are able to obtain from the fine-tuned model. This is done for several lengths of the binary message. Furthermore, we provide the same curves in a control experiment where the key is different than the one used for training, to ensure that the bit accuracy is approximately 0.5 as it should be under H0. We observe that the bit accuracy:
- is significantly higher for the model trained on watermarked instructions, and random otherwise
- is higher for smaller messages,
- is lower when ρ is lower,
- is higher the more tokens we score.
Note that every experiment is run 10 times for different text output by B, which explains the 95% confidence interval in the plots.
- The overall contribution of the paper appears relatively weak, as some existing model protection methods (e.g., [3]) have already explored the impact of watermarking on radioactivity detection. Although the settings in [3] differ somewhat from those in your paper, similar results can still be obtained.
We disagree: the related work section (2.1), Appendix A ("Limitations"), and Appendix F ("Comparison to Active IP Protection Methods"), discuss [3] specifically. p-values become meaningless when scoring a large volume of tokens in their setting. This is a key contribution of our paper, highlighting the limitations of existing IP protection methods in various scenarios where similar results to ours have not been obtained to our knowledge. Conversely, we demonstrate that established watermarking techniques, enhanced by our radioactivity detection methods, lead to reliable detection guarantees, and are broadly applicable.
Dear reviewer, we hope our rebuttal and additional results on the multi-bit setup address your main concerns. If so, could you consider increasing your rating? If anything is unclear, please let us know so that we can clarify. Thank you.
The paper investigates the "radioactivity" of text generated by large language models (LLMs), focusing on the detectability of synthetic text used as training data. It introduces a novel method to reliably identify whether the outputs of a watermarked LLM have been employed to fine-tune another language model. The study reveals that watermarking in LLMs is radioactive, allowing for the detection of weak watermark signal residuals in fine-tuned models. The authors link the level of radioactivity to watermark robustness, its proportion in the training set, and the fine-tuning process. Notably, the research demonstrates that training on watermarked synthetic instructions can be detected with high confidence, even when a small percentage of the training text is watermarked. The paper contributes radioactivity detection methods for different scenarios based on model access and training data, showing how to obtain reliable p-values for watermark detection and proving the practicality of detecting radioactivity in realistic settings.
优点
This paper presents a novel study of the "radioactivity" of watermarked text generated by Large Language Models (LLMs), with a particular focus on the detectability of such text when used as training data for fine-tuning other models. The paper makes a significant contribution by designing radioactivity detection methods for various scenarios based on model access (open/closed) and training data exposure (supervised/unsupervised). Among other things, the authors effectively relate the level of radioactive contamination to key factors such as watermark robustness, the proportion of watermarks in the training setting, and the fine-tuning process, and the authors innovatively propose to utilize filtering and de-duplication for detection enhancement. The results show that even a small percentage of watermarked synthetic instructions in the training data (as low as 5%) can be detected with high confidence with a p-value of less than 10^-5. The proposed method provides more reliable statistical guarantees for detecting whether LLM outputs are used in the training dataset than existing methods such as membership inference or active IP protection. The study also provides valuable insights on the effects of fine-tuning parameters and watermarking methods on radioactivity, which helps to deepen the understanding of the underlying mechanisms.
缺点
-
The author's lack of proper consideration of the written rigor of the essay is evident here in many places where semantic or formatting errors are made. For example: (1) Uniformity of punctuation: in section 2.1, Related work (MIAs) should be formatted with the context, missing periods, similar missing periods are also found in the second sentence of the response in Figure 4 ρ = 0% in Section 5.3, and in the two bolded subheadings in Appendix C.3. (2) Uniformity of formatting: in the last question/answer pair in Figure 15 of Appendix H.7, Context content is missing line breaks; in the main text section 2.2, there has been a formula with the formula number 1, and the formula in Appendix C.1 below is numbered 2, but the formula located in the middle, under appendix C, is missing the formula number; in Table 1 "closed" in "Open/closed" needs to be capitalized. (3) Content flaws: in the first scenarios of Problem Formulation Access to Bob's model in Section 3, the tense of "open-sources" is not used properly; also in Section 3, for Definition 1, "B was not trained on D", according to the author's definition of data labeling, there is an error in D here; in Figure 2, the last sentence (after and) lacks the necessary verb. (4) Unification of singular and plural:This article suggests that the authors should standardize the singular and plural, for example, "text generatedf by **" is used in the abstract and some other places. However, there are some places where "texts generatedf by **" is used, and the same phenomenon also occurs in "data/datas".
-
The watermark detection method used in this paper is only given for specific fine-tuning cases, can the method achieve the same anomaly detection effect if the model is fine-tuned in a more complex or covert way?
-
In Section 5, the article emphasizes the consideration of a realistic scenario for watermark detection, but it does not discuss in detail how this would affect the "radioactivity" if someone attempted to remove the original watermark from the training data, which is likely to be more common in reality as well.
-
The authors mention in the paper that they rely on the proofs in previous papers, but they should give some necessary information to make the paper logically coherent, for example, in Appendix C.1, the authors cite kirchenbauer et al. For the design of LLM Watermarking, which mentions that "The logit of every token in the greenlist is incremented by δ.". However, the definition of δ is not mentioned before, but in the original paper, kirchenbauer et al. proposed the algorithm "Text Generation with Soft Red List", in which the parameter δ is quoted to modify the logit and get the probability distribution of words.
问题
See weaknesses.
局限性
None.
We thank the reviewer for the constructive comments and suggestions.
- The author's lack of proper consideration of the written rigor of the essay is evident here in many places where semantic or formatting errors are made. For example: (1) Uniformity of punctuation: [...] for example, "text generatedf by **" is used in the abstract and some other places. However, there are some places where "texts generatedf by **" is used, and the same phenomenon also occurs in "data/datas".
We thank the reviewer a lot for the careful consideration given to the manuscript. We agree with all of them and we will update it accordingly.
- The watermark detection method used in this paper is only given for specific fine-tuning cases, can the method achieve the same anomaly detection effect if the model is fine-tuned in a more complex or covert way?
We address additional fine-tuning scenarios in Section 6.1 and Table 6. While not exhaustive, these examples illustrate that increased data fit enhances model radioactivity. For further details, please see Appendix H (Additional Results), which includes ablations on "Bigger teachers," "Mixing instruction datasets from different sources," and "Radioactivity purification." We welcome suggestions from the reviewer on more "complex or covert way[s]" of fine-tuning.
- In Section 5, the article emphasizes the consideration of a realistic scenario for watermark detection, but it does not discuss in detail how this would affect the "radioactivity" if someone attempted to remove the original watermark from the training data, which is likely to be more common in reality as well.
We addressed this important point in Appendix A (Limitations), noting that radioactivity correlates with watermark robustness; attempts to paraphrase or alter the watermark will indeed weaken radioactivity. In Appendix H.3 ("Radioactivity Purification"), we observe that if Bob fine-tunes his LLM on non-watermarked data to deliberately remove traces of the watermark, the radioactivity decreases (but remains detectable), illustrating a similar scenario to what the reviewer described.
However we argue that compromising the quality of Alice's high-quality LLM outputs through paraphrasing may not be a "common” approach, as (1) it could degrade the fine-tuning benefits, and (2) Bob might just not know that Alice’s outputs are watermarked.
- The authors mention in the paper that they rely on the proofs in previous papers, but they should give some necessary information to make the paper logically coherent, for example, in Appendix C.1, the authors cite kirchenbauer et al. For the design of LLM Watermarking, which mentions that "The logit of every token in the greenlist is incremented by δ.". However, the definition of δ is not mentioned before, but in the original paper, kirchenbauer et al. proposed the algorithm "Text Generation with Soft Red List", in which the parameter δ is quoted to modify the logit and get the probability distribution of words.
We will add a definition of δ to the manuscript : "For instance, Kirchenbauer et al. [2023b] create a "greenlist” of tokens whose logits are augmented by a quantity δ, increasing their sampling probability.” in section 2.2. We thank the reviewer for pointing this out.
We thank all reviewers for their insightful comments and suggestions. We address two main weaknesses that emerged from the reviews:
Radioactivity is only demonstrated for some LLM watermarking schemes.
- We focus on LLM watermarking schemes designed for AI-generated text detection, which is a zero-bit watermarking problem, not multi-bit. We focus on two foundational methods (Kirchenbauer et al., Aaronson et al.) because:
- (a) most of watermarking approaches are based on these (Fu et al. [2024a,b], Hu et al. [2023], Kirchenbauer et al. [2023b], Kuditipudi et al. [2023], Wu et al. [2023], Zhao et al. [2024], Yoo et al. [2024], …). The paper's objective is to demonstrate that some very commonly used LLM watermarking schemes are radioactive, not all of them.
- (b) they are the only ones, to our knowledge, to provide theoretical guarantees on p-values and therefore to allow for very low FPRs. We answer to all reviewers individually, but also refer to Appendix A “Limitations” and Appendix C.3 “Does the radioactivity generalize to other watermarking schemes?”, where we justify our choices in more detail.
- We provide additional results on a multi-bit scenario with the method of Yoo et al., as suggested by reviewer q8YZ. Please refer to the pdf for the figures and to the rebuttal to q8YZ for more details.
Reference: [Yoo, KiYoon, Wonhyuk Ahn, and Nojun Kwak. "Advancing beyond identification: Multi-bit watermark for language models." arXiv preprint arXiv:2308.00221 (2023).]
The detection of radioactivity seems heuristic.
The detection consists of 2 components:
- (a) filtering / deduplication and score computation
- (b) p-value computation
While a) involves heuristic steps, b) is proven to be theoretically sound and experimentally validated. A core strength of our study is that our p-values are not heuristic. The paragraph “Influence of the de-duplication on the correctness of radioactivity tests'' in section 5.4, and Appendix D.2 ``Correctness”, confirm the proper distribution of our p-values under the null hypothesis. This is explained in more details to reviewer JVYy.
We hope this rebuttal clarifies any ambiguities, and we will ensure to incorporate the essential points in the main text.
This submission analyzes several watermarking schemes for LLM outputs and investigates how these schemes would impact the traceability when later LLMs are trained on watermarked text.
Overall, reviewers were interested in the submission and we agree that this study contains a number of interesting observations. Initially, concerns were raised about the validity of the hypothesis tests done in this work. Yet, after an extended discussion between the reviewers all reviewers were ultimately convinced that the hypothesis test proposed in this work is indeed sound.
I do think the authors could have avoided this problem by defining their null hypothesis directly as a statement about the final distribution instead of defining it in terms of the general, biased statistic first, and then defining "filters". I would ask the authors to synthesize the results of the discussion here and in Appendix D.2 and rewrite the main body to clarify this point.
Further, there are a number of other outstanding improvements to the paper that the reviewers outlined, such as improvements to the abstract (ZoLZ), and changes to improve writing quality (Y2ta,JVYy), that I urge the authors to include, to make sure that the paper is not just accepted, but actually well-received and read by the community at large.