PaperHub
4.7
/10
Rejected3 位审稿人
最低3最高6标准差1.2
6
3
5
4.0
置信度
ICLR 2024

Can AI-Generated Text be Reliably Detected?

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11
TL;DR

In this paper, we empirically show that AI-text detectors are not reliable in practical scenarios along with theoretical evidence.

摘要

关键词
AI text detectionreliable MLsecurityattacks

评审与讨论

审稿意见
6

The authors highlight the potential weaknesses of AI-generated text detectors such as neural-network based detectors, zero-shot AI text detection, watermarking, and information retrieval-based detectors. Specifically, the authors propose a recursive paraphrasing attack that recursively paraphrases an AI generated text until the text is likely to be classified as non AI-generated. Experimental results using 2000 text passages with each roughly 300 tokens in length from the XSum dataset and 2 different models used for paraphrasing suggest the attack is significantly effective against a wide range range of AI-generated text detectors.

优点

  • The authors address the important problem of reliable AI-generated text detection. This problem is likely to become increasingly important with the rapid rise in popularity of large language models in society.

  • The proposed approach is simple yet effective in undermining the reliability of current AI-generated text detectors. The simplicity of the attack may also make it more likely to generalize to different models and domains.

  • The authors perform a human evaluation using Amazon's Mechanical Turk (MTurk) to evaluate the resulting paraphrased text passages. Their results suggest the original content of the text passage is preserved in addition to the grammar and overall text quality.

  • The authors also discuss the overall hardness of AI text detection, providing a formal upper bound of detection performance based on the total variation (TV) between AI-generated and human-generated text distributions.

  • The paper is well-written and easy to follow.

缺点

  • The experiments only use text passages from a single dataset, and tend to only evaluate a single model for each detector type. Evaluations on a wider range of datasets and detectors would greatly strengthen any generalizable claims regarding the proposed paraphrasing attack.

  • Lack of baseline methods. In many of the experiments, the proposed attack is the only method being evaluated, have the authors compared their attack with similar attacks?

  • I appreciate the human evaluation in the "Watermarked AI Text" section, however, this type of evaluation is missing in the other experimental sections. For example, Figure 5 claims a significant drop in detector accuracy with minimal degradation in text quality; however, it's unclear to me how significantly text quality degrades based on a perplexity score increase from 6.15 to 13.55.

  • The insight that smaller TV between AI-generated and human-generated text distributions leads to more difficult detection problems seems rather obvious. Although the authors show that more complex models can lead to smaller TV distances, the authors do not provide any empirical evidence that smaller TV distances actually lead to more difficult AI text detection.

  • There is no empirical runtime evaluation of the proposed attack.

  • There are several grammatical errors throughout the paper, consider using a service like Grammarly to fix these issues.

  • Figure 9 is not colorblind friendly.

问题

  • How is TV distance defined, and why is it difficult to compute for larger datasets?

  • How did the authors determine 5 rounds of paraphrasing to be sufficient?

评论

Figure 9 is not colorblind friendly

We have updated Figure 9 with a colorblind-friendly color palette.


How is TV distance defined, and why is it difficult to compute for larger datasets?

We use the definition of TV that measures the distance as half of the difference between the probability density functions, i.e., the TV between two distributions P and Q is given by:

TV(P, Q) = 12xP(x)Q(x)\frac{1}{2}\sum_x |P(x) - Q(x)|

The difficulty of computing this distance for text distributions does not come from the size of the datasets. Instead, it is due to the size of the sample space. The distribution of text sequences of n tokens from the token set TT has the sample space T×T×T×T \times T \times T \times … (n times). The size of this space is Tn|T|^n, which is exponential in the sequence length n. In order to compute TV, we need to estimate the probability density for each element in the sample space with a high degree of confidence. This significantly increases the number of samples needed to estimate the TV for larger sequence lengths.


How did the authors determine 5 rounds of paraphrasing to be sufficient?

The number of recursion rounds is a hyperparameter, and we chose it to be 5 without any specific reason. More rounds of recursion might further deteriorate detection rates with further tradeoffs in text quality. Note that just 2 rounds of recursions are enough to degrade the watermark detection rates to below 50% in all the settings (revised Figure 10, Appendix A.1).

评论

We thank you for your review and for noting the problem that we address as “increasingly important”. We are delighted to know that you find our approach to be “simple yet effective” and our paper to be “well-written”. We address your comments below. We also invite you to read our global response, where we discuss common comments and new experiments in detail.


Evaluations on a wider range of datasets and detectors would greatly strengthen any generalizable claims regarding the proposed paraphrasing attack.

Thank you for your comment. In our revised Appendix A, we add more results of our attacks with all the detectors on different domains – PubMedQA (a medical text dataset) and Kafkai (Deepfake text detection by Pu et al., based on reviewer JvAi’s suggestion) – and models (OPT-1.3B and GPT-2-Medium). Consistent with our previous results in the main paper, we are able to break all the detectors in all the new experimental settings that we consider.


Lack of baseline methods. In many of the experiments, the proposed attack is the only method being evaluated, have the authors compared their attack with similar attacks?

Our work is the first to show the limitations of 4 different classes of detectors such as watermarking, retrieval-based, zero-shot, and trained detectors. To the best of our knowledge, we are not aware of other baseline attacks that break these detectors. As we discussed in our paper, previous works proposed weaker attacks via span replacements. However, Kitchenbauer et al. (2023) show that watermarking is robust to such attacks. Please find more details regarding our contributions in our Global response.


I appreciate the human evaluation in the "Watermarked AI Text" section, however, this type of evaluation is missing in the other experimental sections. For example, Figure 5 claims a significant drop in detector accuracy with minimal degradation in text quality; however, it's unclear to me how significantly text quality degrades based on a perplexity score increase from 6.15 to 13.55.

Thank you for this comment. As reviewer Ce9G notes, the results in Figure 5 were run on a shorter and smaller dataset when compared to the ones in the “Watermarked AI Text” section. Hence, it had a few very short text samples, which resulted in high perplexity scores. To be consistent with the “Watermarked AI Text” section (and obtain improved tradeoffs between our attack performance and text quality), we perform experiments on all the detectors with longer and larger datasets in the revision.

We also modified Figure 5 in the revision based on the new dataset. As shown in the Figure, the perplexity only drops from 7.6 to 9.3 (measured using a larger OPT-13B model), while the detection rate degrades from 100% to below 60%, showing a clearer tradeoff between our attack performance and text quality. We agree with the reviewer that having a complimentary human evaluation can be insightful for all detectors. Unfortunately, performing such large-scale human studies with all the datasets and detectors would be expensive for us. However, we have provided several examples of paraphrased texts in Appendix B.2. More experimental results are also provided in Appendix A.


Although the authors show that more complex models can lead to smaller TV distances, the authors do not provide any empirical evidence that smaller TV distances actually lead to more difficult AI text detection.

Our theoretical result (Theorem 1 and Figure 7) shows that as the total variation between human and AI-generated text decreases, the performance of even the best possible detector also decreases. This implies that smaller TV distances will lead to more difficult AI text detection.

Another way of understanding this phenomenon could be by considering the true and false positive rates (TPR and FPR). For a given FPR (say, 1%), the goal of a detector would be to maximize the TPR. However, the difference between TPR and FPR is bounded by the total variation. Thus, if the total variation decreases, the best-possible TPR also decreases, making detection harder. With this in mind, it is sufficient to empirically show that the TV decreases as AI models become more sophisticated.

Note that showing detection is difficult with respect to a particular detector does not imply the nonexistence of a better detector capable of distinguishing human and AI-generated texts. Our objective is to show a fundamental difficulty in detection. Thus, we first show that detection performance decreases with a decreasing TV (Theorem 1). Then, we provide empirical evidence that TV decreases as AI models become more sophisticated.


There is no empirical runtime evaluation of the proposed attack.

Thank you for the comment. We evaluate the time taken for our recursive paraphrase attacks (for 5 rounds) for AI passages of 300 tokens in length to be 36 seconds per passage. We add this to our revised Appendix A.1.

评论

Dear reviewer reDY,

We thank you for your thoughtful comments and feedback. Since we are nearing the end of the discussion phase, we would like to know if you have any remaining concerns regarding our revised paper or additional experiments. We’d be happy to address them.

评论

I thank the authors for their detailed response, however after reading the response and the concerns brought up by other reviewers, I am inclined to retain my original score.

评论

Once again, we thank you for your review and feedback.

审稿意见
3

The paper presents a study to showcase whether the AI-generated text can be reliably detected. For that, the authors have performed several experiments by transforming the text through recursive para-phrasing and showcasing the vulnerability of existing detection/defense algorithms. Further, the authors also showcase the spoofing attack which aims to label the genuine text into the AI-generated text by the defense algorithm.

优点

It is highly important to understand the limitations of current AI-generated text detection algorithms and to accurately identify the generated text to protect privacy and ethics.

缺点

  • The biggest concern is with the para-phrasing task. Should be called the para-phrased text generated by any LLMs? Does it not destroy the inherent characteristics of the generated model? Have the original texts also been paraphrased by the paraphrases? What's the impact of para-phrasing genuine texts?
  • Ablation study of different paraphrasers? What is the contribution here? Do the authors have directly used the existing algorithms for phrasing? With the existence of several studies, the current paper leaves a low contribution concerning the sensitivity of LLM detectors.

[1] Krishna K, Song Y, Karpinska M, Wieting J, Iyyer M. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. NeurIPS 2023 [2] Kumarage T, Sheth P, Moraffah R, Garland J, Liu H. How Reliable Are AI-Generated-Text Detectors? An Assessment Framework Using Evasive Soft Prompts. EMNLP. 2023.

  • The experimental setting is weak. It is not clear how many samples have been used (2000 or 1000).
  • Are these existing detectors trained using multiple augmentation strategies such as data of multiple LLMs and/or para-phrased samples?
  • For retrieval-based detectors only 100 samples have been used reflecting inconsistency in the experimental setup. Figure 5 also shows the drastic drop in PPI value.
  • A detailed experimental study is needed concerning experiments in sections 4 (i) and 4 (ii). Only 3-length tokens are used along with a single-layer LSTM network. Further, the decrement shown in Figure 9, is statistically significant?

问题

Please check the weakness section.

--------------- Post Rebuttal --------------

Thanks for responding. However, in light of serious concerns such as the destruction of inherent characteristics of LLMs through paraphrasing, existing works on similar themes, and limited evaluation, I would like to retain my original rating.

The authors can conduct the analysis when human texts are also paraphrased and while augmenting the data, these paraphrased texts can also be used. If we used original (human and AI) and augmented (paraphrased human and AI), will it still increase type-1 (or 2) but decrease the other one?

评论

We thank you for your review and for noting our research problem of analyzing the limitations of AI text detectors to be “highly important”. We address your comments below. We also invite you to read our global response, where we discuss common comments and new experiments in detail.


Have the original texts also been paraphrased by the paraphrases? What's the impact of para-phrasing genuine texts?

We only paraphrase the LLM generations and not the original human text. Paraphrases of original human text would be considered AI text since it is generated by a paraphrasing language model. Hence, we do not paraphrase them using a language model to retain them as genuine human texts.


The biggest concern is with the para-phrasing task. Should be called the para-phrased text generated by any LLMs?

The paraphrased text should be considered as AI text by definition since it is output by a language model. For example, if a user queries a watermarked ChatGPT to paraphrase their essay, the paraphrased output will be watermarked and hence detected as AI-generated.


Does it not destroy the inherent characteristics of the generated model?

Yes, it might destroy the inherent characteristics of the generated model. Intuitively, this might be the reason why our attack works. For example, paraphrasing a watermarked text might remove the watermark patterns, destroying the inherent watermarked LLM text characteristics. This is what an attacker desires.


What is the contribution here? Do the authors have directly used the existing algorithms for phrasing? With the existence of several studies, the current paper leaves a low contribution concerning the sensitivity of LLM detectors.

After receiving advice from the program chairs, we can note that a draft of our work had appeared on a public platform before the related papers were made public. To keep anonymity, we cannot reveal the name of the platform or give a link to it. Nevertheless, ours is the first work to comprehensively show the limitations of 4 different classes of text detectors – watermark-based, retrieval-based, zero-shot, and trained detectors. Though we use existing paraphraser models, we are the first to propose recursive paraphrasing attacks to effectively break the stronger retrieval-based detector by Krishna et al. (2023) and watermark-based detectors. We are the first to present spoofing attacks on text detectors that can potentially affect the reputation of LLM developers. We also present novel theoretical results that indicate that reliable AI text detection could get increasingly difficult as LLMs evolve. Please find our contribution details on page 3 (last paragraph).


Ablation study of different paraphrasers?

To our knowledge, DIPPER by Krishna et al. (2023) is the strongest open-source paraphraser that exists. We perform recursive paraphrase attacks only using DIPPER since weaker paraphrasers will generate texts of lower quality with recursion. We would like to highlight again that our empirical contributions here are the algorithms that we designed for the attacks (recursive paraphrasing and spoofing) and not the paraphraser model.


The experimental setting is weak. It is not clear how many samples have been used (2000 or 1000).

Thank you for the comment, we have revised the paper to make this clearer. We use a total of 2000 samples, 1000 per human and AI text classes.

We further perform more experiments on all the detectors with larger and longer samples in the revised Appendix A. We consider different domains – PubMedQA (a medical text dataset) and Kafkai (Deepfake text detection by Pu et al., based on reviewer JvAi) – and multiple target LLMs (OPT-1.3B and GPT-2-Medium). We consider 1000 to 2000 samples per experiment in all these settings. Consistent with our previous results, we are able to break all the detectors with a slight tradeoff in text quality.


Are these existing detectors trained using multiple augmentation strategies such as data of multiple LLMs and/or para-phrased samples?

To the best of our knowledge, the existing open-sourced trained detectors do not use augmented/paraphrased data for training. However, our Corollary 2 presents a fundamental tradeoff between type-1 and type-2 errors. It indicates that as detectors become more robust to paraphrases (by training on paraphrased samples), potentially more human passages will be wrongly flagged as AI text. Thus, reducing type-2 errors might lead to an increase in type-1 errors, which is not desirable.

Corollary 2 also indicates that as AI-paraphrasers become more human-like, this tradeoff will become more significant. Hence, the fundamentally hard task of text detection will become increasingly difficult as AI-paraphrasers evolve.

评论

For retrieval-based detectors only 100 samples have been used reflecting inconsistency in the experimental setup. Figure 5 also shows the drastic drop in PPI value.

Thanks for your comment. We acknowledge that the results in Figure 5 were run on a shorter and smaller dataset when compared to the ones in the “Watermarked AI Text” section. Hence, it had a few very short text samples, which resulted in high perplexity scores. To be consistent with the “Watermarked AI Text” section (and obtain improved tradeoffs between our attack performance and text quality), we perform experiments on all the detectors with longer and larger datasets in the revision.

We also modified Figure 5 in the revision based on the new dataset. As shown in the Figure, the perplexity only drops from 7.6 to 9.3 (measured using a larger OPT-13B model), while the detection rate degrades from 100% to below 60%, showing a clearer tradeoff between our attack performance and text quality.


A detailed experimental study is needed concerning experiments in sections 4 (i) and 4 (ii). Only 3-length tokens are used along with a single-layer LSTM network. Further, the decrement shown in Figure 9, is statistically significant?

Thanks for the comment. For 4(i), we add the plots for varying sequence lengths in the Appendix (Figure 14b). For 4(ii), Figure 9 shows TV estimates with sequence lengths of 3, 4, and 5. We have modified Figure 9 in the revision with error bars. In all these settings, we consistently observe that the TV estimates reduce as the model gets bigger. We are performing more experiments with different sequence lengths, and vocabulary combinations. However, they are computationally expensive, and hence, we will update the manuscript if we finish our experiments before the discussion period ends. Otherwise, we will add these results to the final version of the paper.

评论

Dear reviewer Ce9G,

We thank you for your thoughtful comments and feedback. Since we are nearing the end of the discussion phase, we would like to know if you have any remaining concerns regarding our revised paper or additional experiments. We’d be happy to address them.

评论

We thank you for engaging in the discussion. We have tried to address your concerns below. We hope our responses clarify your concerns.


serious concerns such as the destruction of inherent characteristics of LLMs through paraphrasing

As we discuss in our rebuttal response, we do not believe that destroying inherent characteristics of the target LLM outputs is a concern; rather, this is what an attacker desires. We want to highlight that the paraphrasing attacks we perform remove these inherent LLM signatures while maintaining the context, meaning, and quality of the text (please see our human study in Table 1 and Appendix B.1). Hence, an attacker can effectively evade detection via automated paraphrasing.

For instance, suppose an attacker uses a watermarked LLM to generate propaganda. They can paraphrase the watermarked propaganda to remove the watermark patterns (which is the inherent watermarked LLM characteristic) while maintaining the content and quality of the text. In this manner, the attacker can increase their chances of evading the watermark detector.

Please let us know if this explanation addresses your concern.


existing works on similar themes

Our work is the first to analyze the vulnerabilities of four different classes of existing detectors. As we mentioned previously, based on the advice we received from the Program Chair, we can reveal that a draft of our work was released on a public platform even before the related works (including Krishna et al. and Kumarage et al. mentioned by the reviewer) were made public. However, in order to preserve anonymity, we can not give more details or link to the draft.

In spite of this, ours is the first work to comprehensively analyze the limitations of 4 different categories of text detectors. We are the first to break and provide state-of-the-art attack results against the stronger watermarking (Kirchenbauer et al. 2023) and retrieval-based (Krishna et al. 2023) detectors. We are the first to reveal a new vulnerability of these text detectors to spoofing attacks where an adversarial human can write a text that is detected to be AI-generated. We are also the first to theoretically show results that indicate the hardness of AI text detection.


limited evaluation

Please note that based on your previous review, we added more experimental settings to Appendix A of our revised draft. Here, we analyze all the detectors with large and long datasets (datasets varying from 1000 to 2000 passages, with each passage 200 to 300 tokens in length). We also evaluate the text quality using perplexity and MTurk human studies. One can always add more datasets and models to a paper, but we feel that our experimental results are comprehensive enough to reveal the vulnerabilities of existing detectors.


The authors can conduct the analysis when human texts are also paraphrased, and while augmenting the data, these paraphrased texts can also be used.

If we understand the comment correctly, you are suggesting the use of data augmentations via paraphrased passages to train detectors. We believe this is an interesting idea for future work. But please note that our focus in this paper is to reveal vulnerabilities of existing detectors, and training new (and perhaps more robust) detectors is not in the scope of this paper.

审稿意见
5

The authors claim current methods for detecting AI-generated text from LLMs are ineffective, and their proposed recursive paraphrasing attack can bypass detectors. Watermarking techniques are also vulnerable and can be fooled by their proposed method for misidentifying human text as AI-generated. The authors claim, the challenge of distinguishing AI from human text is fundamentally difficult, as evidenced by a proposed theoretical model.

优点

  • Well-Written and Structured Content also the research tackles an increasingly important topic in the AI community.
  • The paper's focus on recursive paraphrasing attacks represents an innovative and practical contribution to the field of AI security by showing that these attacks can effectively remove watermarks from AI text.
  • The paper supports its practical experiments with theoretical proofs, providing a deep understanding of the problem space.

缺点

The paper does not include testing on a diverse array of datasets like the M4 or Deepfake text detection, which encompasses Multi-generator, Multi-domain, and Multi-lingual data. Incorporating these datasets could provide a more comprehensive evaluation of the paraphrasing model's effectiveness across different text generation sources, domains, and languages.

Pu, J., Sarwar, Z., Abdullah, S. M., Rehman, A., Kim, Y., Bhattacharya, P., ... & Viswanath, B. (2023, May). Deepfake text detection: Limitations and opportunities. In 2023 IEEE Symposium on Security and Privacy (SP) (pp. 1613-1630). IEEE.

Wang, Yuxia, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Chenxi Whitehouse et al. "M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection." arXiv preprint arXiv:2305.14902 (2023).

Recursively paraphrased text could potentially suffer from semantic drift, where the meaning changes or degrades with each paraphrase iteration. How do you address the concern of maintaining semantic integrity and coherence with only perplexity metrics in text after multiple rounds of paraphrasing?

问题

Question 1: Recursively paraphrased text could potentially suffer from semantic drift, where the meaning changes or degrades with each paraphrase iteration. How do you address the concern of maintaining semantic integrity and coherence in text after multiple rounds of paraphrasing?

Question 2: Could you elaborate on how perplexity (without semantic understanding) and other quality metrics have been validated to accurately reflect the readability and coherence of the paraphrased text?

Question 3: Given that detection methods are constantly evolving, how might adaptive detectors, which are designed to learn and counteract paraphrasing patterns over time, impact the effectiveness of the DIPPER paraphrasing model?

评论

We thank you for your review and for noting our paper to be “well-written and structured”. We are happy to find that the reviewer believes that we tackle an “increasingly important topic in the AI community”, and our attacks and theory are “innovative and practical”. We address your reviews below. We also invite you to read our global response, where we discuss common comments and new experiments in detail.


The paper does not include testing on a diverse array of datasets... Incorporating these datasets could provide a more comprehensive evaluation of the paraphrasing model's effectiveness across different text generation sources, domains, and languages.

Thank you for your comment. In our revised Appendix A, we add more results of our attacks on different domains – PubMedQA (a medical text dataset) and Kafkai (Deepfake text detection by Pu et al., based on your review). We also evaluate the attacks on two target LLMs – OPT-1.3B and GPT-2-Medium. Consistent with our previous results in the main paper, we are able to break all the detectors in all the new experimental settings that we consider.


How do you address the concern of maintaining semantic integrity and coherence with only perplexity metrics in text after multiple rounds of paraphrasing? Could you elaborate on how perplexity (without semantic understanding) and other quality metrics have been validated to accurately reflect the readability and coherence of the paraphrased text?

As you rightly note, metrics in NLP, such as perplexity, have their limitations. To evaluate the semantic quality of our recursive paraphrasing framework, we perform MTurk human evaluations. As shown in Section 2.2 (Table 1) and Appendix B.1, the human evaluators scored 70% of our recursive paraphrases to have high-quality content preservation. 89% of the recursive paraphrases were scored to have high text quality or grammar.


Given that detection methods are constantly evolving, how might adaptive detectors, which are designed to learn and counteract paraphrasing patterns over time, impact the effectiveness of the DIPPER paraphrasing model?

This is an interesting question that can be answered using Corollary 2 (Appendix C.2) in our paper. Here we discuss a fundamental tradeoff of AI text detection in the presence of paraphrasing. Corollary 2 indicates that if a detector becomes more robust to AI paraphrasing, type-I errors will increase, and more human passages will be wrongly flagged as AI text by the detector. This shows a tradeoff between type-1 and type-2 errors of AI text detectors.

Corollary 2 also indicates that as AI-paraphrasers become more human-like, this tradeoff will become more significant. Hence, the fundamentally hard task of text detection will become increasingly difficult as AI-paraphrasers evolve.

评论

Dear reviewer JvAi,

We thank you for your thoughtful comments and feedback. Since we are nearing the end of the discussion phase, we would like to know if you have any remaining concerns regarding our revised paper or additional experiments. We’d be happy to address them.

评论

We thank all the reviewers for their comments. All the reviewers find our work to be tackling an increasingly important topic in AI. We appreciate reviewers JvAi and reDY for commenting on our paper being well-written and our attacks being innovative and effective. Below we address some common concerns that the reviewers had.


New attack experiments

We add new experiment results to Appendix A, where we consider settings with longer and larger datasets (XSum, PubMedQA, Kafkai) and multiple target LLMs (OPT-1.3B and GPT-2-Medium). To be consistent in our experiments, we evaluate the performance of all the detectors with these new larger datasets. In all the experiments, we find that the existing detectors can be broken with our paraphrasing attacks. The results with PubMedQA (medical text dataset) and Kafkai (articles from 10 different domains such as cybersecurity, SEO, and marketing) datasets show the robustness of our attacks to distribution shifts. In all the experiments with XSum, we use 2000 samples, each 300 tokens in length. For all the other experiments, we consider datasets with 1000 samples that are 200 tokens long.


Effect of adaptive detectors to counteract paraphrase attacks

In Corollary 2 (Appendix C.2), we discuss a fundamental tradeoff of AI text detection in the presence of paraphrasing. Corollary 2 indicates that if a detector becomes more robust to AI paraphrasing, type-I errors will increase, and more human passages will be wrongly flagged as AI text by the detector. This reveals a fundamental tradeoff between type-1 and type-2 errors. Hence, making adaptive detectors or training them to be robust to paraphrased text samples might not be desirable.

Corollary 2 also indicates that as AI-paraphrasers become more human-like, this tradeoff will become more significant. Hence, the fundamentally hard task of text detection will become increasingly difficult as AI-paraphrasers evolve.


Contributions

After receiving advice from the program chairs, we can note that a draft of our work had appeared on a public platform before the related papers were made public. To keep anonymity, we cannot reveal the name of the platform or give a link to it. Nevertheless, ours is the first work to comprehensively show the limitations of 4 different classes of text detectors – watermark-based, retrieval-based, zero-shot, and trained detectors. Though we use existing paraphraser models, we are the first to propose recursive paraphrasing attacks to effectively break the stronger retrieval-based detector by Krishna et al. (2023) and watermark-based detectors. We are the first to present spoofing attacks on text detectors that can potentially affect the reputation of LLM developers. We also present novel theoretical results that indicate that reliable AI text detection could get increasingly difficult as LLMs evolve. Please find our contribution details on page 3 (last paragraph).

AC 元评审

Given the somewhat large discrepancy among reviewers, I read the paper myself to also weigh in on the decision.

Overall, I think this paper addresses an important and timely problem, but the current set of results could be improved before publication.

In particular:

  • it is unclear how the attack would work if the attacker's paraphraser is much weaker than the model used for generation. In the paper, the attacker uses a 11B model to paraphrase the outputs of a 1B model. But if the attacker has access to such a model, they could just use it in the first place. It would be interesting here to show that a much smaller and weaker paraphrasing model can break the watermark of a stronger model.
  • it is similarly unclear how much the paraphrasing degrades performance. The authors did a human study for this which is commendable, but it is hard to interpret these results. In addition to this study, the paper could take a number of text benchmarks (e.g., question answering datasets) and see if their recursive paraphrasing harms performance.
  • reporting the best of ppi results is somewhat misleading, as an attacker wouldn't know a priori which ppi is best.
  • the spoofing attacks are interesting, but I wonder how practical these would be. E.g., the attack on the Kirchenbauer scheme requires 1M queries to the model which seems excessive.
  • The result on TV distance is not novel. It is well-known that distributions with low TV distance become harder to distinguish. This is a standard result in many cryptography textbooks for instance.

为何不给更高分

See meta-review above.

为何不给更低分

N/A

最终决定

Reject