PaperHub
8.2
/10
Spotlight4 位审稿人
最低5最高5标准差0.0
5
5
5
5
4.0
置信度
创新性3.0
质量3.3
清晰度3.0
重要性2.5
NeurIPS 2025

Enhancing LLM Watermark Resilience Against Both Scrubbing and Spoofing Attacks

OpenReviewPDF
提交: 2025-04-06更新: 2025-10-29
TL;DR

We propose SEEK, a novel watermarking method for large language models that simultaneously improves robustness against both scrubbing and spoofing attacks, achieving a Pareto-optimal balance superior to existing approaches.

摘要

关键词
machine learning securityllm watermarking

评审与讨论

审稿意见
5

The paper addresses the vulnerability of watermarking techniques used to protect large language models (LLMs) from misuse. Traditional watermarking methods face challenges from scrubbing and spoofing attacks, which can undermine their effectiveness. The authors introduce a novel mechanism called equivalent texture keys, which allows multiple tokens within a watermark window to independently support watermark detection. This innovation leads to a new watermarking scheme called Sub-vocabulary decomposed Equivalent tExture Key (SEEK).

In summary, the paper's contributions include:

  • Introducing the equivalent texture key mechanism to enhance watermark robustness.
  • Proposing SEEK, a new watermarking scheme that balances scrubbing and spoofing defenses while maintaining text quality.
  • Demonstrating through experiments that SEEK offers a Pareto improvement over prior watermarking schemes .

优缺点分析

Strengths

  • Proposes a novel mechanism (equivalent texture keys) and the SEEK scheme to overcome the inherent scrubbing-spoofing robustness trade-off in LLM watermarking. Achieves Pareto improvement by enabling large watermark windows (spoofing resistance) while maintaining scrubbing robustness via hash-space optimization.
  • Demonstrates state-of-the-art results across 5+ datasets (WikiText, C4, Dolly-CW, MMW):
    • +82–92% spoofing robustness (FPR reduction)
    • +6–24% scrubbing robustness (TP@1% gain)
    • Maintains text quality (perplexity/log-diversity ≈ baselines).
  • Formally proves key properties:Expected erased tokens for watermark removal increases with smaller hash space (Prop 4.2), Hash collision probability bounds (Prop 4.1), Robustness guarantees under adversarial modifications (Appendix D).

Weakness

  • While the paper shows strong performance across various datasets, it could benefit from a more detailed discussion on how the proposed method generalizes to different types of language models and real-world scenarios. Additional experiments on diverse models and tasks could strengthen this aspect.
  • SEEK’s sub-vocabulary decomposition assumes uniform token distribution. Performance may degrade for low-entropy/specialized domains (e.g., medical/legal jargon) where sub-vocabulary imbalance could affect watermark stability.
  • Experiments only compare KGW variants (e.g., KGW-Min/Left/Sum). Lacks comparison with other methods, eg. SIR[1]
  • This paper lacks ablation study. No component-wise analysis to disentangle contributions of: Equivalent texture keys (hash-space reduction), Sub-vocabulary decomposition, Obscures whether both innovations are essential for Pareto gains.

[1] Liu, Aiwei, et al. "A semantic invariant robust watermark for large language models, 2024." URL https://arxiv. org/abs/2310.06356.

问题

  • While Figure 2 shows spoofing robustness improves with larger windows for KGW, Table 2 reports SEEK results only for (h=6, d=6) and (h=16, d=16). How does SEEK's scrubbing/spoofing trade-off specifically vary with window size?
  • Table 6 tests SEEK on 7B–13B models, showing consistent gains. Is SEEK's improvement consistent across scales (e.g., <1B or >30B models)?
  • The paper only compares against KGW-family methods. Yet recent works like UNIGRAM (fixed green list), SemStamp (semantic hashing), and CATER (contextual watermarks) also address spoofing/scrubbing trade-offs. Do you compare with these methods?

局限性

yes

最终评判理由

This paper features a novel token-level watermarking method with clear theory and rich experiments, showing solid performance against attacks. The rebuttal effectively addressed baseline selection rationale and added supplementary experiments on domain generalization and model scalability, resolving all key concerns. No significant unresolved issues remain.

格式问题

No formatting issues in this paper

作者回复

To Reviewer CfYA

W1&W2: Performance may degrade for low-entropy/specialized domains

We greatly appreciate the reviewer's insightful question regarding the structural assumptions and generalizability. To systematically assess the impact of this design on text generation quality across different domains (especially specialized fields), we evaluated our method against the KGW-Min baseline using the LLAMA2-7B model. The evaluation was performed on various datasets, including general, news, biomedical, legal, and financial texts. We compared the PPL (using LLAMA2-7B as the oracle model, which represents the original unwatermarked generation logic), as shown in the table below:

DatasetTaskKGW-Min (delta=2)SEEK (delta=2)KGW-Min (delta=4)SEEK (delta=4)
WikiText-103General6.215.8813.7813.33
MMWNews8.728.5318.2415.75
C4Long-form Text7.448.3819.8116.62
PubMedBiomedical8.447.5417.0216.71
FreeLAWLegal7.556.5317.2514.98
Financial-NewsFinance6.336.2718.2915.63

In response to the reviewer’s concern, we also conducted experiments on well-established low-entropy text datasets, specifically for code generation tasks (HumanEval, MBPP), using the StarCoder2-7B model. The results are summarized below:

Dataset(No-watermark)KGW-MinSEEK
HumanEvalPass@1: 33.4; /Pass@1: 27.4; AUROC 88.1Pass@1: 28.1; AUROC: 89.3
MBPPPass@1: 37.5;/Pass@1: 33.9; AUROC: 80.3Pass@1: 34.7; AUROC: 82.7

By consistently outperforming or matching the baseline across the vast majority of domains, these results provide strong empirical evidence that our approach generalizes effectively without compromising semantic coherence.

Q1: How robustness trade-off varies with window size.

We thank the reviewer for this important and practical question. Below, we present our parameter discussion on how the SEEK trade-off varies with window size.

Specifically, the difficulty of spoofing attacks increases exponentially with h, while erasure robustness is positively correlated with the expected collision rate (as detailed in Proposition 4.1/4.2). Our core design principle is as follows: use a sufficiently large h to resist spoofing attacks, while adjusting d to achieve the desired hash collision probability, thereby enhancing erasure robustness.

Our core design principle is as follows: use a sufficiently large h to resist spoofing attacks, while adjusting d to achieve the desired hash collision probability, thereby enhancing erasure robustness. Specifically, the difficulty of spoofing attacks increases exponentially with h, while erasure robustness is positively correlated with the expected collision rate (as detailed in Proposition 4.1/4.2). Since the UNIGRAM scheme represents an extreme case of our method, we explicitly advise against such extreme configurations. The paper configuration of (h=6, d=6) represents a choice based on this principle.

First, the primary role of increasing the window size h is to exponentially increase the statistical complexity for spoofing attacks. A larger h forces an attacker to learn a vastly larger mapping from context to output, dramatically raising the cost of an attack. As the table below illustrates, the number of probability vectors required to reproduce a statistical attack grows explosively with h. When reproducing a statistical attack with h > 8 in the experiment, the number of required probability vectors exceeds 180 million, which exceeds the cache capacity of our current hardware.

Windows sizeh=3h=4h=5h=6h=8
Num. of probability vectors7M12M36M66M180M

Additionally, we conducted experiments and employed the LLAMA2-7B model to generate text sequences on C4-Eval, applied a scrubbing attack using GPT (TP@1%), and conducted a spoofing attack using statistical methods (FPR@0.1%).

We conducted additional experiments and employed the LLAMA2-7B model to generate text sequences of at least 200 tokens on C4-Eval, applied a scrubbing attack using GPT (TP@1%), and conducted a spoofing attack using statistical methods (FPR@0.1%).

Attack/Param.h=6,d=4h=6,d=6h=6,d=8h=8,d=6h=8,d=8h=8,d=10
Scrubbing(TP@1%85.984.381.485.783.980.7
Spoofing(FPR@0.1%0.140.130.160.080.100.08

Future work will include further parameter ablation studies and attempts to mathematically prove

The results provide evidence that the number of probability vectors an attacker must learn expands dramatically with h, rendering large-scale statistical spoofing attacks computationally infeasible.

Q2:Generalization Across Model

To empirically validate this, we conducted a new set of experiments across three models of varying scales (0.5B, 7B, and 32B). We evaluated robustness against ChatGPT attacks. The results (AUROC / TP@1% / TP@5% ) are summarized below:

Dataset/ModelQwen2.5-0.5BQwen2.5-7BQwen2.5-32B
KGW-Min (h=4)90.7/79.5/88.992.3/83.4/91.494.7/81.8/90.5
SEEK92.3/83.4/90.695.7/90.1/94.395.9/89.1/96.3

As the table demonstrates, SEEK exhibits strong and consistent robustness on both the 0.5B, 7B, and 32B models, confirming its scalability. The slightly lower performance on the 0.5B model is attributed to the lower intrinsic coherence of its generated text. This content is more susceptible to substantial rewriting when subjected to a scrubbing attack by a vastly more powerful model (ChatGPT, >175B), which can more easily disrupt the watermark's structure.

W3& Q3:Comparison with Related Work.

We sincerely thank the reviewer for this insightful question regarding the scope of our baselines. Our work is specifically focused on addressing the scrubbing-spoofing trade-off within the dominant paradigm of token-level watermarking. We therefore prioritized direct comparisons with the most representative methods in this category, namely the KGW-family. Here we discuss other related excellent works you mentioned:

  1. UNIGRAM: We agree on UNIGRAM's relevance and have already included it as a baseline in our analysis (see Figure 1). UNIGRAM can be considered a special case of KGW (h=0). However, it completely sacrifices spoofing resistance for scrubbing robustness and has been shown to be highly vulnerable to in multiple studies [1, 15, 17, 23].

  2. CATER: This method represents an earlier traditional technical paradigm based on deterministic, rule-based substitution. It relies on predefined synonym rules and linguistic triggers rather than modifying the model's probabilistic output, as KGW-family methods do. Its primary threat model involves an adversary reverse-engineering and removing these rules, and it does not address spoofing attacks.

  3. SemStamp/SIR (Sentence-Level Methods): Methods like SemStamp belong to another line of research: sentence-level watermarking. We did not select them as direct competitors for several key reasons:

    • Additional Overhead: They typically require an external encoder for both generation and detection, introducing significant computational costs.
    • Low Statistical Efficiency: Operating at the sentence level, they require impractically long texts to achieve acceptable detection performance (e.g., SemStamp has an FPR of ~1%), and their z-score p-value approximation is only accurate with a large number of sentences, a condition rarely met in practice.
    • Vulnerability to Fragment Attacks: They are not robust against paragraph-level manipulations like copy-paste attacks (as shown in our Table 8).
    • The state-of-the-art spoofing attacks [17] used in our evaluation are specifically designed for token-level schemes.

    Therefore, while the methods you mentioned are all significant contributions, their different technical paradigms and threat models place them outside the scope of our direct comparative analysis. We will expand our Related Work section in the final manuscript to discuss these distinctions more thoroughly and better situate our contribution.

W4: Ablation Study.

We sincerely appreciate the reviewer's valuable suggestion. Although ablation analysis is not presented as a standalone section, the core structure of our paper is inherently organized as a step-by-step ablation study. We begin with the standard KGW method and initially introduce the "equivalent texture key" mechanism by reducing the hash space. The experimental results in Table 1 and Figure 2 demonstrate the effectiveness of this component in improving the trade-off. Next, we discuss the potential compromise in text generation quality introduced by this component. To address this issue, we introduce the "sub-vocabulary decomposition," which culminates in the final SEEK method. The results in Figure 3 further validate that this component successfully restores text quality.

Thank you for guiding us toward a clearer and more polished presentation!

评论

Thank you for your detailed response, which has effectively addressed the questions and concerns I raised earlier.

The supplementary experiments on low-entropy/specialized domains and models of varying scales provide strong evidence for the generalizability of your method. Please integrate these results into the relevant sections of the paper to allow readers to gain a comprehensive understanding of the method's performance.

Regarding the related works mentioned in W3&Q3, I appreciate your explanation of why they were not included as direct baselines. Given their significance in the watermarking field, it would be beneficial to introduce them in the "Related Work" section, clearly elaborating on the differences from your method in terms of technical paradigms, threat models, etc. This will help better contextualize your research contributions.

Overall, your rebuttal has sufficiently resolved my concerns, and I will increase my score. Please ensure that these revisions are incorporated into the final version of the paper.

评论

Thank you again for your valuable and constructive feedback. We will incorporate all of these revisions into the next version of the paper.

审稿意见
5

The paper introduces a novel watermarking algorithm designed to be robust against both spoofing and scrubbing attacks. The core idea leverages insights: increasing the context window enhances resistance against spoofing attacks, while reducing the cardinality of the hash space used to generate the watermark key enhances robustness against scrubbing attacks. Consequently, the proposed algorithm, SEEK, minimizes the output space of the hashing function while maintaining a large context window size for watermark key construction. Additionally, SEEK adapts the method for partitioning vocabulary into green and red sets based on this refined watermark key construction. Extensive evaluations demonstrate SEEK’s robustness against both spoofing and scrubbing attacks.

优缺点分析

Strengths

  1. The algorithm effectively reduces the trade-off between robustness against spoofing attacks and scrubbing attacks, which is a common limitation in existing watermarking algorithms.
  2. The paper is clearly written and easy to follow. Figure 4, in particular, provides an excellent overview of the differences between SEEK and KGW.
  3. The motivation behind the algorithm, specifically how the window size and hash space cardinality impact watermark robustness against spoofing and scrubbing attacks, is clearly articulated. This analysis highlights the potential reasons for the observed trade-offs.

Weaknesses

  1. The evaluation of text quality has significant limitations, resulting in an unclear assessment of SEEK's actual impact on generated text quality. Additional experimental results are required to conclusively demonstrate that SEEK-generated text remains usable and maintains high quality.
  2. The paper claims reduced hash space cardinality improves robustness against scrubbing attacks without affecting spoofing resistance. However, an implicit trade-off exists and should be explicitly analyzed.
  3. SEEK may incur increased computational overhead during text generation, potentially impacting its practical application, particularly with larger vocabulary sets. This concern should be explicitly addressed in the evaluation to clarify the practical limitations.

问题

  1. In Algorithm 1, Line 5, how is the vocabulary partitioned into sub-vocabularies? Does the partition depend solely on $\zeta$? Please explicitly clarify this in Line 221 and Algorithm 1, Line 5.

  2. The paper claims, "reducing the hash space size does not influence the watermark window size, thereby maintaining similar spoofing resistance." However, there seems to be an implicit relationship between the window size ($h$) and the cardinality of the hash space ($d$). My question: With a fixed hash space cardinality ($d$), isn't there a limit to how large or small the window size ($h$) can practically be?

    1. If $h$ is significantly larger than $d$, it becomes likely that all hash values appear within the window, making SEEK behave like a unigram watermark (assuming sub-vocabularies partitioned solely by $\zeta$). For example, with $d=6$ (as in the paper), if $h=30$, the probability of encountering all hash values is already greater than 0.96. Could you verify and discuss this explicitly?
    2. Conversely, if $h$ is much smaller than $d$, collisions become less likely (as indicated by Proposition 4.1), potentially decreasing robustness against scrubbing attacks. Given these points, isn't there still an explicit trade-off between robustness against spoofing and scrubbing attacks by varying $h$ for a fixed $d$? If this reasoning is correct, please explicitly state it in the paper and provide evaluations of SEEK against spoofing and scrubbing attacks for different combinations of $h$ and $d$.
  3. There are significant concerns about the text quality evaluation:

    1. The P-SP scores reported in Figure 6 are very low (<0.4). Previous literature (e.g., [29]) indicates that a score above ~0.7 is necessary to maintain linguistic fidelity. Thus, current results do not support the claim of preserving linguistic fidelity. Clarify or provide new evidence supporting your claims.
    2. The paper uses OPT-2.7B as the oracle model to evaluate LLaMA2-13B-generated text. Typically, the oracle model should have higher quality than the target model (e.g., [28], [31]). Using OPT-2.7B might bias results unfairly against LLaMA2-13B. Clarify this and justify the oracle model choice. Given this, Figure 6, as presented, inadequately demonstrates SEEK’s true impact on text quality compared to KGW. Proper and additional evaluations or metrics are necessary to substantiate your claims clearly.
  4. Although SEEK aims to increase vocabulary diversity, log-diversity metrics are missing. Include these results explicitly.

  5. The paper mainly evaluates the case where $\delta=5.0$ and $\gamma=0.25$. These settings inherently cause significant performance drops (as indicated in Figure 6) and simplify spoofing attacks. Include evaluations using more practical values of $\delta$ that maintain acceptable text quality.

  6. What is the time complexity of text generation for SEEK compared to KGW? How does complexity scale with increased vocabulary size, given the necessity of partitioning vocabularies and assigning different keys? Provide empirical timing results explicitly for SEEK and KGW.

局限性

yes

最终评判理由

  1. The paper initially misrepresented the trade-off that SEEK addresses. SEEK cannot be simultaneously highly robust against both spoofing and scrubbing attacks—there is an inherent trade-off. The authors acknowledged this during the rebuttal and have committed to revising the relevant parts of the paper.

  2. There were some issues with the text quality evaluation setup. The authors have provided new results and committed to updating the paper accordingly.

With these changes, I believe my main concerns have been addressed.

格式问题

No concerns.

作者回复

To Reviewer kvdR

We sincerely thank you for your detailed and constructive feedback. Below, we address your questions and weaknesses.

Q1: How is the vocabulary partitioned into sub-vocabularies.

Sorry for the confusion. The sub-vocabulary partitioning depends on the key ξ, ensuring that, without the key, an attacker cannot predict the sub-vocabulary of any given token. To avoid potential ambiguity, we will provide further clarification in both the main text and the algorithm in a later version.

W2&Q2: An implicit trade-off exists and should be explicitly analyzed.

Thank you for pointing this out. We would first like to clarify that the trade-off between robustness to scrubbing and spoofing in our work reflects movement along a Pareto frontier.

We fully acknowledge that such a trade-off also exists within SEEK, and we do not claim otherwise in our paper. However, we would like to re-emphasize our key contribution: SEEK breaks the existing trade-offs observed in the KGW-family, which are widely believed to be governed by window size, and expands the boundary of the previous Pareto frontier.

Notably, SEEK with d=6,h=6 achieves strong performance on both fronts (about 0.98 AUROC against ChatGPT-scrubbing and about 0.1 FPR@0.1% against spoofing), which demonstrates that it is practically viable for real-world deployment scenarios.

As with any method, it is important to avoid arbitrary hyperparameter choices—in SEEK’s case, the parameters d and h play a critical role. The difficulty of spoofing attacks increases exponentially with h, while scrubbing robustness is positively correlated with the expected collision rate (as detailed in Proposition 4.1/4.2). Our core design principle is: use a sufficiently large h to resist spoofing attacks, while adjusting d to achieve the desired hash collision probability, thereby enhancing scrubbing robustness.

The perspective you raised in Q2-1 is correct. As we mentioned in the paper, UNIGRAM is an extreme case of our method. When the difference between d and h becomes too large (like d << h or d >> h), SEEK will degenerate into UNIGRAM. Therefore, we explicitly advise against such extreme configurations.

In practice, we generally offer the following recommendations for parameter settings:

  1. Avoid Extreme Parameter Configurations: We do not recommend using excessively disparate h and d values, as this causes the SEEK design to degrade to h=0, losing its advantage in balancing both robustness types. A d ≈ h configuration is generally a more balanced starting point.
  2. Prefer Larger h Values: A larger watermark window h exponentially increases the difficulty of statistical spoofing attacks. We recommend deploying the watermarking scheme with a larger h to significantly raise the cost for potential attackers.

We conducted additional experiments and employed the LLAMA2-7B model to generate text sequences of at least 200 tokens on C4-Eval, applied a scrubbing attack using GPT (TP@1%), and conducted a spoofing attack using statistical methods (FPR@0.1%).

Attack/Param.h=6,d=4h=6,d=6h=6,d=8h=8,d=6h=8,d=8h=8,d=10
Scrubbing(TP@1%85.984.381.485.783.980.7
Spoofing(FPR@0.1%0.140.130.160.080.100.08

The default parameter configuration of (h=6, d=6) mentioned in the paper is a reasonable experimental setting based on this principle. The table below demonstrates its generalizability across various datasets.

DatasetMMWPubMedFreeLAWFinancial
KGW-Min73.5/0.9580.1/0.8375.8/0.7783.5/0.70
SEEK80.5/0.1883.5/0.1080.1/0.1385.7/0.18

With the current parameter settings, we achieve a superior Pareto frontier by delivering a much better balance of robustness than the baseline. Future work will include further parameter ablation studies and attempts to mathematically prove the optimal solution under attacks.

We hope this detailed analysis and the additional experiments have addressed your concerns and clarified the contributions of our work.

W3&Q6 : Time Cost in Watermarking.

In watermark text generation, methods modifying logits directly (e.g., SEEK, KGW, SWEET) are highly efficient. And, the primary time cost of watermark deployment still arises from the forward propagation of the large language model itself. To validate this, we conducted empirical timing tests in the following environment:

  • Hardware: NVIDIA A40-48GB
  • Language Model: LLAMA2-7B
  • Configuration: batch_size=1, temperature=1.0, torch_seed=42, num_beams=1, gamma=0.25
  • Data: Averaged over 500 prompts from the C4 dataset
MethodKGW-MinSEEK(d=6)SEEK(d=8)SEEK(d=10)SEEK(d=16)
Generation16.6561s16.8245s16.7202s16.7445s16.7855s
Detection44.13ms43.38ms45.54ms46.66ms49.26ms

Q3-1: Clarification on P-SP Scores.

We sincerely appreciate your insightful question regarding the P-SP scores. We would like to clarify a key distinction regarding the context in which the P-SP scores are evaluated.

In the referenced work [29], the threshold "P-SP > 0.7" is applied in their work to evaluate the quality of human scrubbing attacks. Specifically, it compares (watermarked text) vs. (human-scrubbed text) to select more meaning-preserving attack samples, ensuring that the attack is high-precision.

In contrast, our experiment in Figure 6 is designed to measure whether there is a significant gap in the semantic deviation introduced by different watermarking schemes, which we assess by comparing (watermarked text) vs. (unwatermarked text). In fact, the study in [29] also conducts a similar experiment, as shown in Figure 11 [29], comparing P-SP scores between watermarked and non-watermarked texts. The results indicate that the P-SP scores are similarly well below 0.7 (typically <0.5), and decrease further with increasing watermark strength (z-score). Additionally, in Section A5 of [29], it is stated that "We note that the drift between human text and unwatermarked text can be estimated as just under 0.50 P-SP”. Even considering high-quality human-generated text, the P-SP score between such text and unwatermarked text is still below 0.7. Therefore, it is reasonable that watermarked texts cannot reach the threshold, and the P-SP scores reported in Figure 6 of our paper fall within a reasonable range.

W1&Q3-2: Concerns about the text quality evaluation.

Our choice of OPT-2.7B as the oracle model follows the experimental setup established in seminal prior work, such as KGW [28] (see Section 6) and WaterMax [16]. As noted in [16], “while absolute perplexity values may differ depending on the oracle model, the relative performance between different watermarking algorithms remains consistent. Using a common oracle like OPT-2.7B is therefore recommended to ensure comparability with state-of-the-art methods”.

To fully address your concern and provide even stronger evidence, we have included a new evaluation within our generalizability study that we evaluated our method against the KGW-Min using the LLAMA2-7B model to systematically assess PPL across different domains (using the base model as its own oracle provides the most direct measure of distributional shift). As shown in the table below:

DatasetTaskKGW-Min (delta=2)SEEK (delta=2)KGW-Min (delta=4)SEEK (delta=4)
WikiText-103General6.215.8813.7813.33
MMWNews8.728.5318.2415.75
C4Long-form Text7.448.3819.8116.62
PubMedBiomedical8.447.5417.0216.71
FreeLAWLegal7.556.5317.2514.98
Financial-NewsFinance6.336.2718.2915.63

In response to your concern, we also conducted experiments on well-established low-entropy text datasets like code generation tasks (HumanEval, MBPP), using the StarCoder2-7B model. The results are summarized below:

Dataset(No-watermark)KGW-MinSEEK
HumanEvalPass@1: 33.4; /Pass@1: 27.4; AUROC 88.1Pass@1: 28.1; AUROC: 89.3
MBPPPass@1: 37.5;/Pass@1: 33.9; AUROC: 80.3Pass@1: 34.7; AUROC: 82.7

By consistently outperforming or matching the baseline across the vast majority of domains, these results provide strong empirical evidence that our approach generalizes effectively without compromising semantic coherence.

W4: Diversity Metric in SEEK.

Thank you for your question. The diversity metric for SEEK is already included in Figure 3 of the paper.

W5: Delta Parameter Settings.

We selected a larger value for δ in line with the parameter settings used by [26] for more challenging spoofing attacks on the victim’s watermark model M. As you correctly noted, a larger δ generates a stronger watermark signal, which creates a more favorable condition for the attacker, who can then learn the watermark pattern with fewer queries. We adopted this setting, which means our defense (the SEEK method) was evaluated under a more challenging and stringent scenario.

We agree that such a delta might not be the first choice in a practical deployment where text quality is the priority. However, it still ensures fairness in experimental comparisons as long as delta is kept consistent across all methods. And all experimental conclusions drawn remain valid.

As mentioned in our previous responses, we have already provided evaluations of text quality (PPL) for the more practical settings of δ=2 and δ=4.

If our responses address your concerns, we would be grateful if you would consider raising your final rating to a higher score.

评论

For W2 & Q2:

Please clarify this point in the paper: the improvement in breaking the trade-off is substantially stronger than what is currently reflected. To better illustrate this, consider adding a Pareto frontier figure, which would help readers understand the exact trade-off being addressed.

Additionally, please emphasize that hyper-parameters hh and dd should be chosen dependently. The paper claims that increasing hh improves robustness against spoofing, and decreasing dd strengthens resistance to scrubbing attacks. A naive configuration using a large hh and small dd simultaneously may seem optimal. However, as I pointed out in my previous review—and as acknowledged in your rebuttal—this independent tuning breaks the validity of your robustness claims. It's crucial to reinforce that hh and dd must be set in coordination for the theoretical guarantees to hold to hold.

For W1 & Q3-2:

It is not convincing to justify your evaluation setup by stating that KGW [28] also uses OPT-2.7B. In [28], OPT-2.7B is used to evaluate generations from OPT-1.3B, whereas in your case, you evaluate LLaMA2-13B-generated text using OPT-2.7B, which is not a directly comparable setup.

I appreciate the additional results you've included. Please make sure to include them in the final version of the paper.

For W4:

In the paper, you state: “As illustrated in Figure 3, we observe a clear degradation in generation quality with the decrease of cardinality dd.”

However, it is unclear:

  • Was Figure 3 computed with or without the sub-vocabulary decomposition?
  • What dataset was used to produce this result?
  • What temperature setting was applied during generation?
  • Is this degradation consistently observed across all datasets and experimental settings evaluated in the paper?
评论

We sincerely thank the reviewer for the feedback and suggestions. Your recommendations are crucial for improving the clarity and rigor of our paper, and we are committed to incorporating your guidance into the final version.

W2 & Q2

We will amend the experiment setting section to explicitly state that h and d should be chosen dependently and give clear guidance for parameter selection to prevent degeneration.

W1 & Q3-2

Thank you for suggestion. We will replace the evaluation based on OPT-2.7B with the results submitted in our previous result (using the base non-watermarked model as the oracle).

W4

  1. In Figure3, KGW-MIN (with small d) does not employ decomposition, whereas SEEK employs the sub-vocabulary decomposition.

2/3) This analysis is conducted on the C4 dataset. The temperature setting is 1.0.

4)The answer is yes. For the KGW-Min scheme, the degradation in generation quality as d decreases is a theoretically expected outcome, whose diversity is upper-bounded by O(d). To further illustrate this, we have conducted experiments on the WikiText and LFQA datasets by LLAMA-7B.

DatasetUNIGRAM(d=1)KGW(d=4)KGW(d=128)SEEK(d=6)SEEK(d=16)KGW(d=V)No-watermark
WikiText0.530.660.710.790.870.891.0
LFQA0.690.720.770.780.850.841.0

We recognize that there are some inaccuracies and errors in the article, which will be corrected in subsequent editions. Thank you for guiding us toward a clearer and more polished presentation!

评论

We will amend the experiment setting section to explicitly state that h and d should be chosen dependently and give clear guidance for parameter selection to prevent degeneration.

You should also emphasize this when introducing the algorithm—e.g., by including a remark. This dependency is essential for SEEK to remain robust against spoofing and scrubbing attacks. For example, the statement on line 191, "On the other hand, reducing the hash space size does not influence the watermark window size, thereby maintaining similar spoofing resistance," is potentially misleading as we discussed. Please revise this in Section 4 to clarify the interaction between these parameters and the robustness of the method.

using the base non-watermarked model as the oracle I am not suggesting using the non-watermarked model as the oracle. Rather, you should use a fixed, strong model—such as LLaMA2-13B—as the oracle for all evaluations to ensure consistency and fairness.

For the diversity results, please include them in the experimental section as well.

Overall, I believe the rebuttal adequately addresses my concerns, and I will increase my score. Please incorporate these changes into the final version of the paper.

评论

We will incorporate all of these changes into the next version of the paper. Thank you once again for your constructive guidance.

审稿意见
5

The authors study LLM watermarking, aiming to address vulnerability to spoofing and scrubbing attacks. To this end, they propose SEEK, a variant of the KGW approach that simultaneously improves the resistance to both, while not degrading text quality. The authors support this with extensive experiments.

优缺点分析

Strengths:

  • The authors focus on an important problem (security and robustness of LLM watermarking).
  • Experiments are well set up and thorough along many dimensions (properties of interest, models, baselines, experimental parameters), and the key results seem to convincingly demonstrate that SEEK has benefits over the main baselines.
  • While SEEK is heavily based on KGW, the proposed modifications are well-motivated and to the best of my knowledge novel.
  • The paper is well-written. I especially appreciate how the authors introduce their method by first motivating and addressing the security-robustness tradeoff by reducing the hash space size (4.1), and then identifying and resolving a new issue that appears in this regime, which is the lack of diversity (4.2). This contextualizes SEEK and provides new insight into the parameter space of KGW.

Weaknesses:

  • In my opinion, the spoofing-scrubbing tradeoff is misrepresented in the paper which may confuse readers as this is the key motivation behind the paper. In particular, my interpretation of [26] slightly differs from the authors'. While it is true that the context size controls the scrubbing-spoofing tradeoff, I believe the point of [26] was that when stealing-based spoofing becomes possible, scrubbing is also much easier as using the learned knowledge enables "assisted scrubbing attacks". Thus "too small h" => "both spoofing and scrubbing are easy", while "too large h" => "spoofing hard, scrubbing easy". As the authors' main motivation is the second regime and they successfully address this I do not find this to be a major weakness, but I believe the tradeoff should be stated more precisely. On top of that, for completeness, the experiments should include assisted scrubbing and not only blind scrubbing attacks---I do not expect them to work on SEEK as the spoofing also fails.
  • It is not fully clear to me what the tradeoffs of (d,h) are with SEEK as SEEK-L seems sometimes better and sometimes worse than the main variant chosen by the authors. The paper would be stronger with such a discussion, in particular: what parameter choices are obviously bad? Do authors always suggest setting d==h? If d>>h, I assume running watermark stealing as if the scheme was Unigram would recover the split given by θdefault\theta_{default}? Such explorations seem to be missing and would strengthen the paper.

In conclusion, despite the weaknesses above I believe the paper is a valuable addition to the conference and will inspire meaningful discussion as it provides a new perspective on key design components of LLM watermarking schemes and the security-robustness tradeoff.

  • Nit: the authors use both 1e-2 and 1% to denote the FPR of 0.01 across the paper, it would be cleaner to unify this.

问题

  • Can you comment on the questions in the second weakness above related to tradeoffs of SEEK?
  • Why does Figure 2 use both 1e-2 and 1e-3 on different axes as the FPR?
  • How are the numbers used in the abstract computed? I did not find this clarified.

局限性

Yes

最终评判理由

The authors' rebuttal answers my questions which were relatively minor; overall I remain positive about the paper.

格式问题

/

作者回复

To Reviewer 8aZQ

W1: Explanation of the tradeoff may confuse readers.

Thank you for your valuable feedback, and your understanding is indeed correct. One perspective presented in [26] is that once the watermark is stolen, the attacker can perform spoofing and then use the stolen watermark for assisted scrubbing attacks. However, the scrubbing attacks in [26] are not as effective as those from LLMs like ChatGPT. If strong spoofing is not achieved, the scrubbing process becomes less efficient and thus unsuitable as a robust scrubbing method. Due to limitations in paper length and computational resources, we primarily used [26] as a spoofing test method in the earlier version of our paper. To address any potential for confusion, we will revise the manuscript introduction section to articulate the explanation of [26] and our core motivation more clearly.

Q1&W2: What parameter choices are obviously bad.

Thank you for your valuable feedback. Here, we present our design philosophy for parameter guidance.

Specifically, the difficulty of spoofing attacks increases exponentially with h, while erasure robustness is positively correlated with the expected collision rate (as detailed in Proposition 4.1/4.2). Our core design principle is: use a sufficiently large h to resist spoofing attacks, while adjusting d to achieve the desired hash collision probability, thereby enhancing erasure robustness.

The insight you raised in W2 is indeed correct. As we mentioned in the paper, UNIGRAM is an extreme case of our method. When the difference between d and h becomes too large(like d >> h), SEEK will degenerate into UNIGRAM. If an attacker has access to the actual parameters of the watermarking scheme, they could exploit the degenerated watermark in the same way as attacking UNIGRAM. As with any method, it is important to avoid arbitrary hyperparameter choices. Therefore, we explicitly advise against such extreme configurations.

In summary, we offer the following recommendations:

  1. Avoid Extreme Parameter Configurations: We do not recommend using excessively disparate h and d values, as this causes the SEEK design to degrade to h=0, losing its advantage in balancing both robustness types. A d ≈ h configuration is generally a more balanced starting point.
  2. Prefer Larger h Values: A larger watermark window h exponentially increases the difficulty of statistical spoofing attacks. We recommend deploying the watermarking scheme with a larger h to significantly raise the cost for potential attackers. Additionally, when reproducing a statistical attack with h > 8 in the experiment, the number of required probability vectors exceeds 180 million, surpassing the cache capacity of our current hardware.

We conducted additional experiments and employed the LLAMA2-7B model to generate text sequences of at least 200 tokens on C4-Eval, applied a scrubbing attack using GPT (TP@1%), and conducted a spoofing attack using statistical methods (FPR@0.1%).

Attack/Param.h=6,d=4h=6,d=6h=6,d=8h=8,d=6h=8,d=8h=8,d=10
Scrubbing(TP@1%85.984.381.485.783.980.7
Spoofing(FPR@0.1%0.140.130.160.080.100.08

While the principles above provide a framework for tuning, deriving a single parameter set that is provably optimal across all datasets, domains, and evolving attack models is a formidable theoretical challenge. Future work will include further parameter ablation studies and attempts to mathematically prove the optimal solution under attacks. The parameter configuration of (h=6, d=6) mentioned in the paper is a reasonable experimental setting based on this principle. The table below demonstrates its generalizability across various datasets.

DatasetMMWPubMedFreeLAWFinancial
KGW-Min73.5/0.9580.1/0.8375.8/0.7783.5/0.70
SEEK80.5/0.1883.5/0.1080.1/0.1385.7/0.18

We hope these additional experiments and clarifications address your concern.

Q2:FPR Metric in Figure 2.

Thank you for your question. In Figure 2, the vertical axis uses FPR@1e-3 to represent the success rate of the spoofing attack. We are uncertain where the value "1e-2" you refer to appears. Could you please give more information?

Q3: Other Writing Improvements.

Thank you for your feedback. We will make the following revision to the paper:

  • We will change "1e-2" to "1%" in the next version.
  • The spoofing robustness gains were calculated by comparing SEEK against KGW-MIN (h=4) under statistics-based spoofing attacks, using the FPR@0.1% metric. We acknowledge that the basis for the scrubbing robustness gains was not clearly specified. To rectify this and improve clarity, we will revise this claim in the final manuscript. The gain will be defined as the improvement of SEEK over KGW-MIN (h=4) on the TP@1% metric under the DIPPER-I attack. Based on Table 2, this will update the abstract to: "achieving scrubbing robustness gains of +10.8% on WikiText, +13.4% on C4, and +8.6% on LFQA, respectively."

We recognize that there are some inaccuracies and errors in the article, which will be corrected in subsequent editions. Thank you for guiding us toward a clearer and more polished presentation!

评论

Re Q2: I meant Figure 1, where y axis is at FPR 1e-3 but x axis at FPR 1e-2.

评论

Thank you again for this valuable suggestion. We agree that using a consistent evaluation threshold will improve the clarity and directness of the comparison in Figure 1. Accordingly, in the final version, we will revise Figure 1 to present the robustness of two attacks under 1% .

审稿意见
5

This paper introduces SEEK, a watermarking scheme for large language models that resolves the inherent trade-off between resistance to scrubbing and spoofing attacks. The key insight is the use of equivalent texture keys, which enable multiple tokens within a window to independently support watermark detection. By decomposing watermark construction across sub-vocabularies, SEEK preserves text quality while improving robustness. The method establishes a new design paradigm for secure and resilient watermarking in generative language models.

优缺点分析

Strengths:

  • The paper introduces the concept of equivalent texture keys, where multiple tokens in a watermark window can independently support watermark detection. This mechanism enables redundancy in detection, making it harder for scrubbing attacks to eliminate the watermark fully.

  • The paper addresses a core limitation in existing LLM watermarking methods by eliminating the trade-off between scrubbing and spoofing robustness. Prior methods typically required choosing between short windows and long windows, but SEEK manages to improve both simultaneously.

  • The proposed SEEK scheme uses sub-vocabulary decomposition to maintain a high density of equivalent texture keys while ensuring a diverse set of green-red partitions. This allows the method to achieve robustness without degrading text generation quality, which is a common issue in reduced-hash watermark designs.

  • The threat model is clearly defined and aligns with real-world deployment scenarios, where attackers can access model outputs via APIs and have full knowledge of the watermarking scheme. This makes the evaluation credible and practically relevant.

Weaknesses:

  • The method relies on reducing the hash space to increase the likelihood of equivalent texture keys, which introduces a potential risk of overfitting to specific green-red list patterns. Although the sub-vocabulary decomposition aims to mitigate this, the paper does not thoroughly analyze long-term distributional shifts or adversarial adaptation under this structure.

  • While the notion of equivalent texture keys is novel, the concept builds heavily on existing KGW-Min mechanisms and hash-collision strategies. The technical advancement, though well-executed, may be viewed as incremental rather than fundamentally transformative.

  • The empirical evaluation, though extensive, is limited to statistical and distillation-based spoofing attacks. It does not explore more advanced or adaptive spoofing strategies that may emerge once attackers are aware of the SEEK design, such as gradient-based spoofing or token-level substitution with semantic constraints.

  • The use of disjoint sub-vocabularies introduces a structural assumption about token distribution and semantic independence across partitions. The paper does not investigate whether this assumption holds consistently across languages, domains, or low-resource settings.

  • SEEK requires careful tuning of hash space size and watermark window length. The method's sensitivity to these hyperparameters is acknowledged but not deeply explored, which may limit reproducibility and practical deployment without further ablation studies.

问题

  1. The method leverages hash space reduction to promote equivalent texture keys, which introduces redundancy into detection. However, it remains unclear how this redundancy behaves under distribution shift or adaptive attackers who may exploit repeated patterns. Could the authors provide a theoretical or empirical analysis of how SEEK performs under longer sequences, multilingual corpora, or adversaries trained to detect and exploit common green list configurations? A more thorough analysis here could significantly strengthen confidence in the method’s generality.

  2. The sub-vocabulary decomposition approach assumes that splitting the vocabulary into disjoint subsets does not harm semantic coherence or introduce artifacts. Could the authors provide an empirical study or qualitative examples across different domains to support this assumption? For instance, is there a drop in quality for tasks involving specialized vocabulary or in low-resource domains? Clarification here would increase confidence in SEEK's applicability across a broader range of use cases.

3.The method introduces several hyperparameters (e.g., watermark window size and hash space cardinality), but their tuning appears somewhat ad hoc. Could the authors provide a sensitivity analysis or offer guidance for selecting these parameters in practice? In particular, how does the method perform when these parameters are misaligned with the dataset’s characteristics? More insight here would improve reproducibility and strengthen the case for real-world adoption.

局限性

Please see the weaknesses.

格式问题

None.

作者回复

To Reviewer 7bZb

We sincerely thank you for your detailed and constructive feedback. Your comments have helped us identify key areas for clarification and improvement. Below, we address your questions and the noted weaknesses.

Q2&W1&W4: Could the authors provide an empirical study or qualitative examples across different domains?

To systematically assess the impact of this design on text generation quality across different domains, we compared our method with the KGW-Min baseline using the LLAMA2-7B model. The evaluation was performed on various datasets, including general, news, biomedical, legal, and financial texts. We compared the PPL (using LLAMA2-7B as the oracle model), as shown in the table below:

DatasetTaskKGW-Min (delta=2)SEEK (delta=2)KGW-Min (delta=4)SEEK (delta=4)
WikiText-103General6.215.8813.7813.33
MMWNews8.728.5318.2415.75
C4Long-form Text7.448.3819.8116.62
PubMedBiomedical8.447.5417.0216.71
FreeLAWLegal7.556.5317.2514.98
Financial-NewsFinance6.336.2718.2915.63

Furthermore, to address the concern about specialized, low-entropy text, we evaluated SEEK on code generation tasks (HumanEval, MBPP) using the StarCoder2-7B model. The results are summarized below:

Dataset(No-watermark)KGW-MinSEEK
HumanEvalPass@1: 33.4; /Pass@1: 27.4; AUROC 88.1Pass@1: 28.1; AUROC: 89.3
MBPPPass@1: 37.5;/Pass@1: 33.9; AUROC: 80.3Pass@1: 34.7; AUROC: 82.7

By consistently outperforming or matching the baseline across the vast majority of domains, these results provide strong empirical evidence that our approach generalizes effectively without compromising semantic coherence.

Q1&W3: More advanced or adaptive spoofing strategies.

We understand your concerns regarding the robustness of SEEK against other attacks.

In our work, we adopted attack strategies that are widely recognized in the community. Specifically, for scrubbing attacks, we followed the setup of [29]; for spoofing attacks, we employed the statistical method [26] and the distillation-based method [17], both of which are considered among the most effective.

We opted not to include “token-level substitution” attacks, as prior work has shown them to be less effective compared to the attacks we considered (like LLMs or DIPPER, which even perform reorder attacks). Regarding “gradient-based spoofing”, we surveyed recent literature but did not find a standard, widely adopted attack of this nature that directly applies to our watermarking framework. We would be grateful for any specific references you might suggest, and we will gladly incorporate them into our work. Given the scope of this paper, we prioritized a thorough evaluation against the community's most well-defined threats.

Thank you for this valuable suggestion regarding additional attack strategies. Given the scope of this paper, we prioritized a thorough evaluation against the community's most well-defined threats due to space and computational constraints. We agree that exploring a broader range of attacks is a crucial next step, and we have noted this as a key direction for future work.

W2: The advancement is incremental rather than fundamentally transformative.

We concur with your observation that our work is built upon the KGW framework. Indeed, the KGW framework serves as the foundation for most watermarking techniques, such as UNIGRAM [52], which can be seen as a special case of KGW for h=0, and SWEET [32], which can be viewed as a variant of KGW for low-entropy tokens only. However, this does not diminish the significance of these contributions within the field. While our approach is rooted in KGW, its underlying insights provide a novel perspective.

W5:Provide guidance for selecting parameters in practice.

We thank you for this important and practical question. Below, we present our design philosophy for parameter guidance.

Specifically, the difficulty of spoofing attacks increases exponentially with h, while erasure robustness is positively correlated with the expected collision rate (as detailed in Proposition 4.1/4.2). Our core design principle is as follows: use a sufficiently large h to resist spoofing attacks, while adjusting d to achieve the desired hash collision probability, thereby enhancing erasure robustness.

As we mentioned in the paper, UNIGRAM is an extreme case of our method. When the difference between d and h becomes too large, SEEK degenerates into UNIGRAM. Therefore, we explicitly advise against such extreme configurations.

In summary, we offer the following recommendations:

  1. Avoid Extreme Parameter Configurations: We do not recommend using excessively disparate h and d values, as this causes the SEEK design to degrade to h=0, losing its advantage in balancing both robustness types. A d ≈ h configuration is generally a more balanced starting point.
  2. Prefer Larger h Values: A larger watermark window h exponentially increases the difficulty of statistical spoofing attacks. We recommend deploying the watermarking scheme with a larger h to significantly raise the cost for potential attackers. Additionally, when reproducing a statistical attack with h > 8 in the experiment, the number of required probability vectors exceeds 180 million, surpassing the cache capacity of our current hardware.

We conducted additional experiments and employed the LLAMA2-7B model to generate text sequences of at least 200 tokens on C4-Eval, applied a scrubbing attack using GPT (TP@1%), and conducted a spoofing attack using statistical methods (FPR@0.1%).

Attack/Param.h=6,d=4h=6,d=6h=6,d=8h=8,d=6h=8,d=8h=8,d=10
Scrubbing(TP@1%85.984.581.485.783.980.7
Spoofing(FPR@0.1%0.140.130.160.080.100.08

While the principles above provide a framework for tuning, deriving a single parameter set that is provably optimal across all datasets, domains, and evolving attack models is a formidable theoretical challenge. Future work will include further parameter ablation studies and attempts to mathematically prove the optimal solution under attacks. The parameter configuration of (h=6, d=6) mentioned in the paper is a reasonable experimental setting based on this principle. The table below demonstrates its generalizability(TP@1%/FPR@0.1%) across various datasets.

DatasetMMWPubMedFreeLAWFinancial
KGW-Min73.5/0.9580.1/0.8375.8/0.7783.5/0.70
SEEK80.5/0.1883.5/0.1080.1/0.1385.7/0.18

We hope these additional experiments and clarifications address your concern. Thank you for guiding us toward a clearer and more polished presentation!

最终决定

This paper presents a novel watermarking method that effectively balances robustness against both scrubbing and spoofing attacks, a known trade-off in existing techniques. Reviewers found the experiments thorough and the results convincing. My recommendation is to accept the paper. I encourage the author(s) to improve the discussion on parameter selection and broader comparisons with related work in the camera-ready version.