Inevitable Trade-off between Watermark Strength and Speculative Sampling Efficiency for Language Models
a no-go theorem states that we can maintain either the sampling efficiency or the watermark strength, but not both.
摘要
评审与讨论
The paper investigates whether speculative sampling is compatible with watermarking for LLMs.
优点
S1. This paper is original in the landscape of LLM watermarking.
S2. This paper shows an interesting "no-go" theoretical result (Th. 1).
S3. This paper proposes 2 designs sustaining either the sampling efficacy or the watermark strength.
S4. I like very much the proposed measurement of the watermark strength: the average exponent of the p-value.
S5. The presentation is crystal clear.
缺点
W1. Speculative sampling. I am not an expert in sampling for LLM. I do not know how "speculative sampling" is key compared to more common methods like nucleus or top-k sampling, which prevents me from judging this paper's impact.
W2. More comments on the experimental results Say at least that MWS is much better than MSE in the sense that the MWS loss of sampling efficiency is barely visible, whereas the MSE loss of watermarking strength is significant. The LLMs used in the experimental protocols are old and their entropy is bigger than more recent ones. It might be worth stating that this choice gives high ANLPPT.
问题
Q1. I understood that ANLPPT equals where is the measured P-value over tokens. Which logarithm? What is the typical value of ? My experience with the Aaronson scheme is that the score fluctuates a lot. A median might be a more reliable statistic than a mean over token.
Q2. I am very surprised that DeltaGumbel is way better than Gamma. Is this due to
Q3. Line 33: "Unbiased watermarking schemes [12] have been developed." Well, Aaronson [1] was the first person to introduce this concept, isn't it?
Q4. It is quite curious that no citation or reference is given for "DeltaGumbel" and "Gamma" schemes. DeltaGumbel is known as Aaronson [1] (everywhere in the literature), and Gamma looks like ITS from Kuditipudi [15]. Isn't it?
Q5. Some details of the experimental protocol are missing. I suspect the measurements are done on an "idealized" setup where the secret key changes at random from one token to another, and the detector knows this. This is not realistic as it is absolutely not robust. To be practical, one has to make the secret key dependent on the previous tokens (see Kirchenbauer [13, 14]). This might hurt speculative sampling since a rejection implies a recomputation of the hash. Moreover, repeated token blocks need to be skipped at the detection side (see Fernandez [8]); otherwise the p-value is incorrect. This hurts the ANLPPT. I don't believe these tweaks modify the general conclusions of this work, but this "idealized" setup should be clearly stated, with the implications (ANLPPT and AATPS are lower in practice).
局限性
No limitation is given. A limitation about the lack of practicality of the experimental protocol would be welcome.
We are thrilled to receive such a meticulous and knowledgeable review of our research. It is a privilege to have our paper assessed by a reviewer with such extensive knowledge in the landscape of LLM watermarking. Your acknowledgment of the originality of our work is truly rewarding. In the following sections, we will respond to your inquiries and feedback.
how "speculative sampling" is key compared to more common methods like nucleus or top-k sampling
Speculative sampling and nucleus or top-k sampling operate at different design levels and are not inherently conflicting. They can be used simultaneously. For instance, the seminal paper on speculative sampling, "Accelerating Large Language Model Decoding with Speculative Sampling" [5], mentions that "With standard sampling methods such as nucleus, top-k sampling and adjusting temperature, we can modify the probabilities accordingly before applying this rejection sampling scheme."
More comments on the experimental results Say at least that MWS is much better than MSE in the sense that the MWS loss of sampling efficiency is barely visible, whereas the MSE loss of watermarking strength is significant. The LLMs used in the experimental protocols are old and their entropy is bigger than more recent ones. It might be worth stating that this choice gives high ANLPPT.
We greatly appreciate your suggestion and are eager to include more discussion of the experimental results using the additional page in the camera ready version. We concur that MWS exhibits a smaller performance loss compared to VSpS and is highly practical. In fact, we strongly recommend using MWS in practice. We will elaborate on the impact of different models on entropy, and how entropy influences watermark strength using the additional page.
Q1. I understood that ANLPPT equals where is the measured P-value over tokens. Which logarithm? What is the typical value of ? My experience with the Aaronson scheme is that the score fluctuates a lot. A median might be a more reliable statistic than a mean over token.
The logarithm used here is the natural logarithm. is the number of tokens in the generated sentences. We generate many sentences with a max_length of 128, but they may be shorter due to early stopping. We can calculate the median, introducing the Median Negative Log P-value Per Token (MNLPPT).
Text summarization task with LLaMa-7b model as target model and LLaMa-68m model as reference model:
| K | method | reweight | n | ANLPPT(U Score) | MNLPPT(U Score) | ANLPPT(maximin-LLR) | MNLPPT(maximin-LLR) |
|---|---|---|---|---|---|---|---|
| 1 | Basic | No Reweight | 122.0±0.7 | 0.0±0.0 | 0 | 0.0±0.0 | 0 |
| 1 | VUW | DeltaGumbel | 121.0±0.8 | 0.376±0.009 | 0.385 | 1.71±0.03 | 1.933 |
| 1 | VUW | Gamma | 121.9±0.7 | 0.097±0.002 | 0.098 | 0.272±0.005 | 0.333 |
| 1 | VSpS | No Reweight | 122.6±0.7 | 0.0±0.0 | 0 | 0.0±0.0 | 0 |
| 1 | MSE | DeltaGumbel | 121.5±0.8 | 0.153±0.004 | 0.141 | 0.640±0.014 | 0.660 |
| 1 | MSE | Gamma | 121.9±0.7 | 0.0433±0.0012 | 0.036 | 0.0605±0.0019 | 0.054 |
| 1 | MWS | DeltaGumbel | 121.3±0.8 | 0.374±0.009 | 0.380 | 1.71±0.03 | 1.921 |
| 1 | MWS | Gamma | 121.9±0.7 | 0.098±0.002 | 0.100 | 0.275±0.005 | 0.335 |
| 2 | VSpS | No Reweight | 122.7±0.7 | 0.0±0.0 | 0 | 0.0±0.0 | 0 |
| 2 | MSE | DeltaGumbel | 122.5±0.7 | 0.111±0.003 | 0.095 | 0.419±0.010 | 0.403 |
| 2 | MSE | Gamma | 122.4±0.7 | 0.0322±0.0010 | 0.024 | 0.0310±0.0014 | 0.021 |
| 2 | MWS | DeltaGumbel | 121.4±0.8 | 0.374±0.009 | 0.379 | 1.71±0.03 | 1.913 |
| 2 | MWS | Gamma | 122.8±0.7 | 0.096±0.002 | 0.097 | 0.272±0.005 | 0.332 |
| 3 | VSpS | No Reweight | 121.4±0.8 | 0.0±0.0 | 0 | 0.0±0.0 | 0 |
| 3 | MSE | DeltaGumbel | 121.4±0.8 | 0.094±0.003 | 0.079 | 0.331±0.009 | 0.306 |
| 3 | MSE | Gamma | 121.8±0.7 | 0.0281±0.0009 | 0.020 | 0.0214±0.0012 | 0.011 |
| 3 | MWS | DeltaGumbel | 121.2±0.8 | 0.374±0.009 | 0.380 | 1.70±0.03 | 1.919 |
| 3 | MWS | Gamma | 122.3±0.7 | 0.097±0.002 | 0.098 | 0.274±0.005 | 0.335 |
| 4 | VSpS | No Reweight | 122.5±0.7 | 0.0±0.0 | 0 | 0.0±0.0 | 0 |
| 4 | MSE | DeltaGumbel | 122.2±0.7 | 0.083±0.002 | 0.067 | 0.280±0.008 | 0.249 |
| 4 | MSE | Gamma | 122.6±0.7 | 0.0258±0.0008 | 0.018 | 0.0167±0.0011 | 0.007 |
| 4 | MWS | DeltaGumbel | 121.1±0.8 | 0.375±0.009 | 0.380 | 1.71±0.03 | 1.923 |
| 4 | MWS | Gamma | 122.2±0.7 | 0.096±0.002 | 0.097 | 0.271±0.005 | 0.331 |
Q2. I am very surprised that DeltaGumbel is way better than Gamma
DeltaGumbel devotes all the entropy to watermarking, resulting in an ANLPPT approximately equal to the language model's entropy. Gamma has a weaker watermarking strength, adding at most 1 bit of watermark per step, so the ANLPPT cannot exceed and is significantly smaller than DeltaGumbel.
Q3. Line 33: "Unbiased watermarking schemes [12] have been developed." Well, Aaronson [1] was the first person to introduce this concept, isn't it?
We fully acknowledge the seminal and pioneering work of Aaronson[1]. It achieves an unbiased distribution for each token. However, Aaronson[1]'s method is not commonly referred to as an unbiased watermark because it still incurs some performance loss, as evidenced in "Mark My Words: Analyzing and Evaluating Language Model Watermarks". Although the distribution for each token is unbiased, the watermarks at different token positions may correlate when watermarking the entire sequence, leading to performance degradation. This issue is tackled in follow-up work like [12], which ensures an unbiased distribution not only for each token but also for the entire sequence.
Q4. It is quite curious that no citation or reference is given for "DeltaGumbel" and "Gamma" schemes. DeltaGumbel is known as Aaronson [1] (everywhere in the literature), and Gamma looks like ITS from Kuditipudi [15]. Isn't it?
You are correct regarding Aaronson [1]. However, Gamma differs from Kuditipudi[15] and originates from [12]. Their details have been moved to Section D due to space limitations. We will add the citation and add more explanation in the main paper.
To be practical, one has to make the secret key dependent on the previous tokens (see Kirchenbauer [13, 14]).
We have already implemented this in our code. In lm.py, the step_watermark function handles this logic.
This might hurt speculative sampling since a rejection implies a recomputation of the hash.
In our implementation, there is no hash recomputation triggered by rejection, as long as we carefully pass the computed hash results. The mc_watermark.py code implements this logic, ensuring that step_watermark is called at most once after obtaining the draft tokens, only when all draft tokens are accepted. Moreover, the hash computation cost is relatively low compared to the LLM's computation.
Moreover, repeated token blocks need to be skipped at the detection side (see Fernandez [8]);
We have accounted for this in our code. The detect_pre function in lm.py implements this logic, using the skipped variable to determine whether to skip.
Q5. Some details of the experimental protocol are missing. I suspect the measurements are done on an "idealized" setup where the secret key changes at random from one token to another, and the detector knows this. I don't believe these tweaks modify the general conclusions of this work, but this "idealized" setup should be clearly stated, with the implications (ANLPPT and AATPS are lower in practice). A limitation about the lack of practicality of the experimental protocol would be welcome.
As explained above, our experiments are not "idealized," and the calculated ANLPPT and AATPS are not distorted (except for the changes in ANLPPT due to the change in entropy caused by using a different model, as discussed earlier). We will provide additional explanations of the experimental protocol, detailing how we carefully handled these aspects.
We sincerely appreciate your thoughtful comments and suggestions, especially the supplementary explanations regarding entropy and the experimental protocol. We are delighted to engage in such a positive academic exchange with the reviewer and eagerly await your feedback on whether our explanations have addressed your concerns. Please feel free to raise any further inquiries; we are happy to answer all questions.
We look forward to your valuable feedback.
I acknowledge that I have read the rebuttal. The authors took great care to answer my questions. I confirm my grade. My comments below are just out of curiosity.
You are correct regarding Aaronson [1]. However, Gamma differs from Kuditipudi[15] and originates from [12].
You are right. I got confused between the Gamma (original) and the Delta (very similar to Kuditipudi) schemes of [12].
We fully acknowledge the seminal and pioneering work of Aaronson[1]. It achieves an unbiased distribution for each token...This issue is tackled in follow-up work like [12].
I have difficulty understanding the difference. Aaronson [1] and Hu [12] achieve an unbiased distribution for each token, and both of them use hashing of previous tokens to refresh the key. So, I do see why [12] is unbiased, but I do not see why [1] is not.
Thank you very much for reading our rebuttal and for your prompt response.
We are happy to participate in further discussion on the difference between achieving an unbiased distribution for each token and an unbiased distribution for the entire sequence.
To illustrate this difference, let's consider a thought experiment. Suppose we have a prompt that says:
Continuously generate uniformly distributed random 01 numbers. Output in a specific format:\nA new random 01 number: 1\nA new random 01 number: 0\nA new random 01 number: 1
Assume we have an LLM that is powerful enough to generate a perfect output distribution that fully reflects the prompt. In this case, the entropy would be 0 for the "\nA new random 01 number: " part and 1 bit for the random 01 variable. We can measure the quality by the absolute difference between the number of 0s and 1s after generating 100 random 01 numbers.
Without watermarking, the generated 0s and 1s would be nearly uniform, with only small fluctuations. However, when watermarking is introduced, the situation becomes more interesting. Since the previous tokens are always "\nA new random 01 number: " for each line, the key used in watermarking will be the same every time a random 01 variable is generated. This will result in consistently outputting either 0 or 1, leading to a large difference between the number of 0s and 1s.
To address this issue, [12] introduced the principle that each watermarking operation should use an independent key. If independence from previous watermarks cannot be guaranteed, the watermark must be skipped, that is stop adding new watermark until an independent key is obtained.
The above example amplifies the difference between achieving an unbiased distribution for each token and an unbiased distribution for the entire sequence. In general use, this difference exists but is relatively small.
We appreciate the opportunity to engage in this positive academic discussion and hope our explanation provides clarity on the nuances between token-level and sequence-level unbiased distributions.
Ok, thanks for the explanation.
This paper explores the inherent trade-off between watermark strength and speculative sampling efficiency in large language models. A no-go theorem is presented, proving that it is impossible to maintain the highest watermark strength and sampling efficiency simultaneously. This paper also proposes a framework called the "two reweight framework" and develops two practical methods that focus on either maintaining watermark strength or sampling efficiency.
优点
- New framework. The proposed framework allows for the integration of unbiased watermarking and speculative sampling techniques without altering the output distribution, thereby improving generation efficiency.
- Theoretical proof. This paper rigorously proves a no-go theorem that demonstrates when the vocabulary size exceeds two, it is impossible to maintain both watermark strength and sampling efficiency simultaneously.
缺点
- Limited experimental dataset and models and experimental coverage. The experiments are conducted only on specific datasets (e.g., CNN_DAILYMAIL) and models (e.g., Llama-7b and Llama-68m). Additional benchmarks on different datasets and models would strengthen the generalizability of the findings. The experiments only cover a few tasks (text summarization and open-ended text generation)
- Algorithm clarity. The pseudo-code provided for the algorithms could be further detailed, with clearer explanations for each step to improve reproducibility.
- Lack of analysis on the robustness of watermarking. "On the Reliability of Watermarks for Large Language Models" mentions paraphrasing attacks, and copy-paste attacks, etc. Could this article potentially evaluate the robustness of the watermark under the two reweight framework?
问题
- Impact on the quality of the text. I am intrigued by the potential impact on the quality of generated text when adjusting the balance between watermark strength and sampling efficiency. Apart from LOGPPL, are there alternative evaluation metrics such as ROUGE or BLEU utilized for assessment?
- Selection of target model and draft model. The paper uses Llama-7b as the target model and the Llama-68m as the draft model. How are the target model and draft model selected for the article? Why did the paper choose the Llama series of models? Are there any specific requirements for their selection?
- Choice of different draft sequence length K. Regarding line 613, the choice of draft sequence length is critical in the deployment. Regarding Figure 2, when K is chosen from 1 to 4, a larger value of K (such as K = 4) generally demonstrates better performance in both reweighting methods. This implies that selecting a larger K value can enhance sampling efficiency while maintaining watermark strength. Consequently, if K were to be increased even further, how would the effects be?
局限性
The authors have discussed the limitations and potential negative societal impacts in Appendix F and G.
Thank you for spending time reviewing our work. However, I am afraid that your 4-point rating may be based on incorrect premise.
We would like to clarify a misunderstanding in your comment. You claimed that our paper only used Llama-7b and Llama-68m models. If you read lines 273-278, you will find that this is not the case. We have explained that we considered language models of different sizes, including Llama-7b and Llama-13b as the target models and Llama-68m as the draft model.
We understand that in your opinion, our contributions only reach the level of '2: fair.', falling short of your expectations for a good contribution. However, our paper is the first study to investigate the inherent trade-off between watermark strength and speculative sampling. The new framework we propose lays a solid foundation, our no-go theorem provides valuable theoretical insights, and our two novel algorithms represent concrete advancements in techniques.
Regarding the weaknesses you mentioned, we are willing to make improvements to meet your expectations. To do so, we kindly request more specific suggestions.
-
In your opinion, are the current experiments sufficient or insufficient to verify our findings? If insufficient, could you please let us know the scale of experiments you would consider adequate for verifying our findings? Although our findings are already proven in Theorems 1, 5, and 6, and we have already invested 1200 A6000 GPU hours (~$1k) to verify them, we understand that the reviewer still considers the experiments limited. However, we are unclear about the specific reasons that lead the reviewer to expect a further increase in experimental scale (and cost), despite that our findings have been proven to hold. We would be grateful if the reviewer could inform us, when listing experiments as a weakness, does the reviewer regard the current experiments as sufficient or insufficient to verify our findings, and if insufficient, what scale of experiments would the reviewer consider adequate to verify our mathematically-guaranteed findings.
-
Regarding your concern about "Algorithm clarity," do you find any part of the algorithms difficult to understand, or do you simply anticipate that other readers might need clearer explanations? We are determined to help readers understand our algorithms and make the results reproducible. In Algorithms 1-4, we have provided all the algorithms used in the paper, including every step of the procedure and the specific calculation for each value. Regarding reproducibility, we have provided all the code in OpenReview. If you find any part of the algorithms difficult to understand, we kindly request that you indicate which parts require further elaboration. We will strive to provide targeted explanations to help improve your understanding. If you simply anticipate that other readers might need clearer explanations, we would also appreciate it if you could point out which specific parts of the presentation could be improved for clarity. We will strive to provide targeted explanations to enhance the presentation.
-
Regarding robustness, what specific research question do you expect us to investigate? In the rebuttal below, we will provide an analysis based on the close relationship between watermark strength and robustness. The excellent work "On the Reliability of Watermarks for Large Language Models," you cited reveals the essence that "Attacks dilute the watermark strength." Therefore, robustness and watermark strength are closely related. Robustness measures how much dilution the watermark strength can withstand while still maintaining detectable (e.g., a certain AUC). "On the Reliability of Watermarks for Large Language Models" addresses the research question "Can diluted watermarks still be detected as long as they are sufficiently long?" Our work primarily focuses on the research question of "Is it possible to accelerate the generation of watermarked content?". We understand that the reviewer expects us to investigate robustness-related research questions. However, we would appreciate it if you could clarify the specific research questions the reviewer expect us to address.
We seek the reviewer's understanding that we are asking for specific expectations rather than directly supplementing experiments. This is because without clear experimental suggestions as guidance, we may conduct experiments that the reviewer considers irrelevant or repetitive, wasting valuable computational resources and funding(, considering that the experiment cost of this paper is already 1200 A6000 GPU hours, ~$1k). Only by clarifying the reviewer's expectations can we design and conduct targeted experiments.
Next, we will address the reviewer's questions.
Here, we address the reviewer's questions.
"On the Reliability of Watermarks for Large Language Models" mentions paraphrasing attacks, and copy-paste attacks, etc. Could this article potentially evaluate the robustness of the watermark under the two reweight framework?
"On the Reliability of Watermarks for Large Language Models" recruited human subjects to collect hand-written passages for both paraphrasing attacks and copy-paste attacks to study watermark robustness, which is costly.
Since the goal of our paper is to contribute to improving the speed of generating watermarked content, we cannot afford to invest significant time and financial resources in evaluating watermark robustness, as that paper did. Instead, we focus on the unique contribution of this paper: accelerating the generation of watermarked text.
We can provide a brief theoretical analysis:
- For the Maintain Watermark Strength method, since , the robustness is the same as the existing method, Vanilla Unbiased watermark.
- For the Maintain Sampling Efficiency method, since the watermark strength is lower compared to Vanilla Unbiased watermark, the watermark strength after content editing is also lower. Therefore, the robustness is lower than the existing method, Vanilla Unbiased watermark.
We believe that our paper and "On the Reliability of Watermarks for Large Language Models" both have unique contributions, each focusing on different directions, and both are original, novel, and significant.
I am intrigued by the potential impact on the quality of generated text when adjusting the balance between watermark strength and sampling efficiency.
We emphasize multiple times in the paper that it does not affect the quality. If you read Theorems 5 and 6, you will find a theoretical guarantee that the generation distribution is unbiased. Therefore, the expectation of any metric remains unchanged.
Thank you for telling us that you are intrigued by the potential impact on the quality of generated text, but the answer is that there is no impact.
Apart from LOGPPL, are there alternative evaluation metrics such as ROUGE or BLEU utilized for assessment?
Yes, there are other evaluations, such as ROUGE, BLEU, METEOR, GLEU, MAUVE, SQuAD, and their various variants. In case the reviewer regards measuring certain scores necessary to verify our findings, even though Theorems 5 and 6 have already provided theoretical guarantees, please let us know.
The paper uses Llama-7b as the target model and the Llama-68m as the draft model.
I would like to remind you once again that the above statement is wrong. We not only use Llama-7b as the target model but also a larger model, Llama-13b. Computational resource limitations prevent us from using even larger models.
How are the target model and draft model selected for the article? Why did the paper choose the Llama series of models? Are there any specific requirements for their selection?
There are no special selection requirements other than being an autoregressive language model. We chose the Llama series because it is a popular baseline in the community, as seen in [25,35,46], for example.
Consequently, if K were to be increased even further, how would the effects be?
When K (the number of draft tokens) increases:
- The number of accepted tokens also increases, leading to faster generation.
- However, the computational overhead of the draft model and target model rises, causing slower generation.
The average time to generate a token is determined by the competition between these two factors. The interplay between these two factors is illustrated through quantitative results in Section H of our paper.
Once again, thank you for the time and effort you have put into reviewing our work.
We sincerely hope to engage in positive academic exchanges with the reviewer. In order to meet the reviewer's expectations, we earnestly request that the reviewer provide specific expectations by answering the three questions outlined above. If there are any additional concerns, please feel free to ask, and we will be happy to answer all questions.
Thanks for the response. I've raised my score.
Thank your very much for going through our response and raising the score. We appreciate your prompt response.
This paper studies the trade-offs between sampling efficiency and watermark strength to see if LLMs can generate watermarked output efficiently. It is proven in this work that it is not possible to simultaneously maintain the highest watermark strength and the highest sampling efficiency. Therefore, upon the no-go theorem, the paper provides two methods to maintain either one of them and conducts experiments to validate the effectiveness methods.
优点
- This paper provides the first study into the relationship between sampling efficiency and watermark strength, which is a practical common interest.
- This paper provides proof of the no-go theorem that it is not possible to simultaneously maintain the highest watermark strength and the highest sampling efficiency.
- From experiments, the effectiveness of proposed methods are validated clearly with visualizations.
- Figure 1 provides an overview of the paper, which is clear and informative.
缺点
- Experiment section is relatively short and can include more analysis of the proposed methods on aspects such as ablation study.
问题
Is the no-go theorem also true for other watermarking algorithm such as KGW[1]?
[1] A watermark for large language models
局限性
Although the experiments demonstrated the no-go theorem and effectiveness of proposed methods, maybe more analysis can be provided.
Thank you for your valuable feedback and recognizing our research as the first study into the relationship between sampling efficiency and watermark strength.
Regarding the experiment section, we agree that it appears to be relatively short in the main paper due to space limit. To provide a more comprehensive evaluation, we have moved additional experiments to Appendix H, where we consider different models and tasks. To summarize, all conclusions in Section 6 are still maintained in these experiments: the MWS method achieves the same watermark strength as the VUW method, while the MSE method maintains the same sampling efficiency as the VSpS method, all without compromising the output quality.
Furthermore, our experiments allow us to observe the impact of entropy on watermark strength. We consider two tasks: open-ended text generation, which has higher entropy, and text summarization, which has lower entropy. We measure watermark strength using ANLPPT and find that it is higher for the open-ended text generation task. We also observe the influence of model size, with larger models having a slower Per Token Time (PTT) and different target model-draft model pairs resulting in different Speculative efficiency.
Overall, our experiments serve as a validation study, confirming that the properties stated in the Theorem 1,5,6 are indeed observed in practice. The total experimental cost amounts to 1,200 A6000 GPU hours.
Is the no-go theorem also true for other watermarking algorithm such as KGW[1]?
Since KGW is not an unbiased watermark, our no-go theorem does not directly apply. According to [2,3], KGW leads to a decrease in generation quality. Although KGW is not an unbiased watermark, the intuition guided by unbiased watermarks may still be applicable to biased watermarks, but the precise formulation remains unclear. The theorem cannot simply discard the unbiased reweight constraint, as it would weaken the condition and invalidate the conclusion. Characterizing non-trivial biased watermarks and proving a corresponding no-go theorem could be an interesting extension and future work.
maybe more analysis can be provided
We would be happy to include additional discussions on the experimental results using the additional page in the camera-ready version. Here are some details we would like to add if space permits: We have observed that MWS has smaller performance loss compared to VSpS, making it highly practical. In fact, we strongly recommend using MWS in practice. We will also state the impact of different models on entropy and the influence of entropy on watermark strength.
If you have any further questions or concerns, please do not hesitate to ask. We are more than happy to address all your inquiries.
Considering our additional explanations for the experiments, we would be greatly appreciative if you could re-evaluate our work.
Thank you once again for your valuable feedback.
[1] A watermark for large language models [2] Mark My Words: Analyzing and Evaluating Language Model Watermarks [3] Unbiased Watermark for Large Language Models
Thank you for your responding. i wil maintain my rating.
This paper proposes it is impossible to maintain the highest watermark strength and sampling efficiency simultaneously for content generation by considering integrating an unbiased watermarking method [1] and speculative sampling strategy [2] [3], where they provide rigorous theoretical analysis and empirical results.
[1] Hu, Zhengmian, et al. "Unbiased watermark for large language models." arXiv preprint arXiv:2310.10669 (2023). [2] Leviathan, Yaniv, Matan Kalman, and Yossi Matias. "Fast inference from transformers via speculative decoding." International Conference on Machine Learning. PMLR, 2023. [3] Chen, Charlie, et al. "Accelerating large language model decoding with speculative sampling." arXiv preprint arXiv:2302.01318 (2023).
优点
This paper focuses on an interesting direction integrating the watermarking method and speculative sampling to accelerate the sampling efficiency while maintaining the watermark strength. It helps us understand the interactions and tradeoffs between watermarking and sampling for content generation in LLM, which is significant.
缺点
-
Why do the authors mean by "naively applying speculative sampling to a watermarked target distribution may significantly reduce the overlap probability with the draft distribution Q." as in the 112-nd line?
-
In Fig.2 (a) (b), for MWS, the sampling efficiency (AATPS) is only a little smaller than that of VSpS and MSE. For example, for K=2 in (a) in terms of U score, the AATPS of MWS is smaller than that of VSpS or MSE by about less than 0.1, which should be acceptable since it achieves comparable watermark strength with VUW.
It seems inconsistent with the claim that simultaneously accelerating the sampling efficiency while maintaining the watermark strength is impossible. May the authors explain this result?
问题
Please refer to the Weakness part.
局限性
N/A
Thank you for your insightful comments and for recognizing the significance of our work. We are grateful for the opportunity to address your questions.
Why do the authors mean by "naively applying speculative sampling to a watermarked target distribution may significantly reduce the overlap probability with the draft distribution Q." as in the 112-nd line?
The overlap probability is defined as , where is the watermarked target distribution and is the draft distribution. Note that the total variation distance is a convex function, and . Applying Jensen's inequality, we have:
In practice, the gap in Jensen's inequality can be quite large, leading to a significant reduction in the average overlap probability.
In Fig.2 (a) (b), for MWS, the sampling efficiency (AATPS) is only a little smaller than that of VSpS and MSE. For example, for K=2 in (a) in terms of U score, the AATPS of MWS is smaller than that of VSpS or MSE by about less than 0.1, which should be acceptable since it achieves comparable watermark strength with VUW.
We agree with your observation that the sampling efficiency (AATPS) of MWS is only slightly lower than that of VSpS and MSE. We believe that MWS is highly suitable for practical use as it maintains watermark strength while achieving performance close to VSpS.
It seems inconsistent with the claim that simultaneously accelerating the sampling efficiency while maintaining the watermark strength is impossible. May the authors explain this result?
We would like to clarify that our findings do not contradict the no-go theorem. Even though the sampling efficiency of MWS is nearly as high as that of VSpS, there is still a statistically significant gap. The specific data can be found in Table 1, where the AATPS for MWS is 1.773 ± 0.003, and for VSpS, it is 1.857 ± 0.003. The no-go theorem states that theoretically, there will always be a gap, and our experiments demonstrate that this gap is small in practice. The two are not contradictory.
If you have any further questions or concerns, please do not hesitate to ask. We are more than happy to address all of your inquiries.
We would also like to emphasize that our work is not only significant but also highly original. We propose simultaneously accelerating sampling while incorporating watermarks, introduce the no-go theorem, and present two novel algorithms, MWS and MSE. All of these contributions are the first of its kind.
Considering our additional explanations, we would be extremely grateful if you could re-evaluate our work. Thank you once again for your valuable feedback and for your time in reviewing our paper.
1x SA, 1x A, 1x WA, and 1x BA. This paper studies the trade-offs between sampling efficiency and watermark strength via a no-go theorem. The reviewers agree on the (1) interesting topic, (2) novel study, and (3) rigorous theoretical proof. Most of the concerns, such as the insufficient comparisons and evaluation metrics, have been addressed by the rebuttal. Therefore, the AC leans to accept this submission.