Large Language Models can be Guided to Evade AI-generated Text Detection
We propose a novel substitution-based in-context example optimization algorithm to automatically generate prompts which make LLMs evade AI-generated text detectors.
摘要
评审与讨论
This paper proposes SICO, a prompting-based approach to guide LLMs to evade machine-generated text detectors. SICO operates by extracting text features in natural language and optimizing in-context examples with token substitution. Experiments demonstrate that SICO outperforms paraphrasing-based baselines against six detectors across three tasks.
优点
- machine-generated text detection is an important research question
- the proposed approach is flexible and promising
缺点
-
The feature extraction part could be better motivated and investigated. The effectiveness of such a component reply on two assumptions: 1) LLMs could reliably summarize and extract the "features" of human-written texts from examples; 2) LLMs could use those style "features" to guide their generation. Both would need more support, in addition to showing that having the "feature extraction" component helps model performance.
-
It is not guaranteed that the synthetic token substitution process would lead to decreases in scores, right? Is the substitution performed only once for each passage? What are the stopping criteria? How often does this mechanism fail to find a new passage that has substantially lower scores? In addition, I won't say this is "optimize", as this process is not differentiable, simply trying over and over again with different synonyms.
-
Wordnet is used for token substitutions in this work. My understanding is that Wordnet could be quite noisy, containing words and tokens that are very infrequent in human-written text. As a result, it is surprising to see that Wordnet substitutions do not result in decreases in readability. I wonder if the authors might have conducted some filtering/selection of Wordnet to mitigate these issues.
-
I wonder if the authors might have conducted an analysis of the correlation between detector performance and SICO evasion rate. In other words, is SICO greatly advancing the state-of-the-art on the most capable detectors or the less capable ones?
-
I'm not sure if Section 4.4 is of great scientific value. The likes and dislikes barely indicate meaningful signals for research purposes as social media platforms could be quite random. The detection rate of 7.5% is arguably a hyper-loose lower bound of the actual detection rate by humans since it would require the users to go out of their way to respond to indicate their suspicion. A more rigorous study could be conducted by asking human readers to specifically perform the task of distinguishing human- and machine-generated text, similar to that of Section 4.3.
问题
please see above
Q3. Token substitutions constraints
Wordnet is used for token substitutions in this work. My understanding is that Wordnet could be quite noisy, containing words and tokens that are very infrequent in human-written text. As a result, it is surprising to see that Wordnet substitutions do not result in decreases in readability. I wonder if the authors might have conducted some filtering/selection of Wordnet to mitigate these issues.
Thanks for your thoughtful comment. Your concern about the potential noisiness of Wordnet is indeed valid. And yes, we employ a masked language model (MLM) to filter out the substitution words that do not fit the context. Specifically, we use the MLM to calculate the probability of substitution based on its context. Substitutions that fall below a predetermined probability threshold will be discarded. The details are shown in Appendix A.1.3.
Additionally, it is important to highlight the inherent capabilities of LLMs in maintaining the readability of the text. Even if some prompts include uncommon or unsmooth phrases, LLM still generated text exhibits high readability.
Q4. Correlation between detector performance and SICO evasion rate
I wonder if the authors might have conducted an analysis of the correlation between detector performance and SICO evasion rate. In other words, is SICO greatly advancing the state-of-the-art on the most capable detectors or the less capable ones?
Thank you for your valuable question. We conducted an analysis and found that SICO performs better against more capable detectors.
Here are the details of the analysis experiment: We use the detectors' performance on the original AI-generated text to represent their base capability. SICO advancing performance is measured by the AUC difference between SICO and the best-performing paraphraser baselines. The statistical analysis of our results, using the Pearson correlation coefficient, yielded an value of 0.47 with a p-value of 0.048, indicating a moderate positive correlation.
Q5. Concerns about Reddit test
I'm not sure if Section 4.4 is of great scientific value. The likes and dislikes barely indicate meaningful signals for research purposes as social media platforms could be quite random. The detection rate of 7.5% is arguably a hyper-loose lower bound of the actual detection rate by humans since it would require the users to go out of their way to respond to indicate their suspicion. A more rigorous study could be conducted by asking human readers to specifically perform the task of distinguishing human- and machine-generated text, similar to that of Section 4.3.
Thank you for your valuable feedback. While it's true that social media reactions can be somewhat random, the significantly higher number of likes over dislikes (40% vs 2.5%) suggests that these responses are generally well-received by real-world users.
Regarding the detection rate, we acknowledge your opinion about the 7.5% detection rate potentially being a lower estimate of human ability to identify AI. To address this concern, we conducted a more rigorous experiment, as you suggested. We randomly selected 150 examples (50 each from AI, SICO-Para, and human responses) from QA task. Four human annotators were then asked to identify whether each answer was AI-generated or human-written. The results were as follows, with 'AI %' indicating the proportion of texts identified as AI-generated:
| Text | AI % |
|---|---|
| AI | 54% |
| SICO | 26% |
| Human | 10% |
One can see that SICO remarkably reduces the probability of being recognized by humans. Additionally, the data also reveals that human detection capabilities are not highly accurate compared with machine detectors. Approximately 50% of AI-generated texts went undetected, and 10% of human-written texts were mistakenly identified as AI-generated.
I would like to thank the authors for the detailed response. I will closely monitor the interactions between the authors and other reviewers before proposing a final rating.
We sincerely appreciate your great efforts in reviewing this paper. If you have any further concerns, please let us know. We look forward to your final assessment and any further feedback.
Thank you for the feedback and the constructive questions. Below, we have responded to your concerns and questions. We hope our response addresses your concerns. If you have further questions that prevent you from raising your score, please let us know.
Q1: Concerns regarding feature extraction
1 ) LLMs could reliably summarize and extract the "features" of human-written texts from examples.
Thanks for your thoughtful comment. To test if LLM could reliably summarize and extract the "features" of human-written texts from examples, we adopted a simple test by running 3 feature extractions on different sets of human-written text. We use extracted features to guide LLM generation and test the AUC drop after adopting 3 different features compared with normally generated text. Results are evaluated by GPT3.5-D on writing task. One can see that the LLM consistently extracts useful features for detector evasion, even based on different examples.
| Feature 1 | Feature 2 | Feature 3 | |
|---|---|---|---|
| AUC Drop | - 0.288 | - 0.261 | - 0.142 |
2 ) LLMs could use those style "features" to guide their generation.
In fact, our ablation study in Table 4 (Page 8) demonstrates the effectiveness of feature guidance. The "w/o ICE" represents the method that only uses features to guide the LLM generation, which shows feature component is helpful to detector evasion.
Q2. Concerns regarding token substitution
It is not guaranteed that the synthetic token substitution process would lead to decreases in scores, right?
Yes, there is no theoretical guarantee. But in experiments, the token substitution process often decreases , because the greedy policy will select the token decreases most for each position.
Is the substitution performed only once for each passage?
If what you refer to as "passage" means "in-context example", the answer is No. The optimization process will repeat times as we illustrated in Algorithm 2 (Page 4).
What are the stopping criteria?
The optimization process stops after running times.
How often does this mechanism fail to find a new passage that has substantially lower scores?
For each in-context examples, in the first three iterations, the algorithm can always find a new in-context example with a lower score. However, in the last three iterations, multiple experiments show approximately 20% of cases where the algorithm cannot find.
However, as the prompt is composed of 8 in-context examples, the prompt text always changes during optimization.
In addition, I won't say this is "optimize", as this process is not differentiable, simply trying over and over again with different synonyms.
We respectfully appreciate the reviewer's perspective but wish to clarify the term "optimize" as used in our context. While the reviewer focuses on differentiability in optimization, it's important to note that not all optimization requires a differentiable objective function. For instance, in discrete optimization like TSP, where solutions are from a set of discrete choices, differentiability is not a prerequisite for optimization.
We formulate the problem as a combinatorial optimization to align with situations where the text is discrete. To navigate this discrete solution space, we employ a greedy search methodology.
This work proposes to utilize prompts to steer (the style of) the generator and deceive the AI-generated text detector. Substitution-based In-Context example Optimization method (SICO) is proposed to automatically construct such prompts. The methodology is feasible but this work lacks motivation for social benefits and automatic evaluation of the quality of the new generation.
优点
-
The technology for constructing prompts is interesting, although I am not an expert in this direction.
-
The paper is well-written and thus easy to follow.
缺点
-
This work proposes to deceive the detector from a malicious perspective. We are interested in how this work could help from a benign perspective, e.g., how it helps to train a more robust detector.
-
Several major concerns on evaluation limit the soundness of this work:
- For the results in Table 1, several baselines already achieve reasonable results, e.g., Parrot on Detect GPT, as about 0.5 AUC should have shown the outputs are quite obscured. Achieving a very low AUC is a negative signal for these experiments.
- For the quality of the generated outputs, only human evaluation on DIPPER is considered, leaving several concerns 1) how will it perform on other datasets, 2) how will it perform on other automatic evaluation metrics, 3) why some scores are higher then human outputs, e.g., Writing.
-
Considering the detectors are basically 'classifiers', finding adversarial is desperately possible. The main finding of this work is not excitingly novel to our community.
问题
See bullet points in weakness.
Q4: Concerns about novelty of finding
Considering the detectors are basically 'classifiers', finding adversarial is desperately possible. The main finding of this work is not excitingly novel to our community.
We respectfully disagree with the reviewer's remark that our work lacks novelty in finding adversarial examples (AEs). Our research introduces a novel approach that is much more challenging than traditional adversarial attack methods in both its goals and techniques. Below, we list the unique aspects:
- Unique and Ambitious Goal: Our research aims to find a universal prompt, which can adapt to various task inputs for evading detectors. However, a typical adversarial attack seeks to find an AE for a specific instance. Essentially, our work focuses on designing an algorithm that automatically constructs an AE generator.
- Direct Generation of AEs: Our method directly generates AEs based on the task input. However, typical adversarial attacks require iterative editing of existing instances. For example, given a question in QA task, our method will directly generate answers that evade detectors. In contrast, typical attacks need to edit the pre-existing, normal answers to achieve the adversarial goal. The direct generation approach is more challenging than iterative editing approach.
Our prompt optimization algorithm overcomes these challenges, successfully finding a universal prompt and turning the LLM into a flexible AE generator. Besides, our experimental results indicate the vulnerabilities of current AI text detection systems, which is significant in the era of LLM. We believe these contributions are novel and exciting to our community.
References
[1] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, Rob Fergus: Intriguing properties of neural networks. ICLR 2014
[2] Nicolas Papernot, Patrick D. McDaniel, Ananthram Swami, Richard E. Harang: Crafting adversarial input sequences for recurrent neural networks. MILCOM 2016
[3] Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Cho-Jui Hsieh: Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning. ACL 2018
[4] Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, Chelsea Finn: DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. ICML 2023
Thank you for the feedback and the constructive questions. Below, we have responded to your concerns and questions. We hope our response addresses your concerns. Please let us know if you have further questions preventing you from raising your score.
Q1: Concerns about benign applications of our work
This work proposes to deceive the detector from a malicious perspective. We are interested in how this work could help from a benign perspective, e.g., how it helps to train a more robust detector.
Thanks for your instructive comments. We fully agree with you and acknowledge the importance of exploring the benign applications of our work, particularly in terms of developing more robust AI detection systems.
However, it is crucial to recognize that the field of AI detection in LLM is still in the early stage. The typical evolutionary path of such systems involves an initial development phase, followed by the emergence of various attack methods, such as the work in other domains [1, 2, 3]. These attacks, in turn, boost the development of defensive strategies. Therefore, while our research primarily addresses the attacking perspective, it contributes to the field by highlighting potential weaknesses in existing AI detection systems. We believe not providing defensive methods does not weaken the value of proposed stronger attack baselines. Additionally, our approach has the potential to act as an adversarial generator to help model training. We leave other defensive techniques for future research.
Q2: Concerns about attack evaluation
For the results in Table 1, several baselines already achieve reasonable results, e.g., Parrot on Detect GPT, as about 0.5 AUC should have shown the outputs are quite obscured. Achieving a very low AUC is a negative signal for these experiments.
We respectfully disagree with your argument. First, we want to clarify that these detectors do not simply label text as 'AI-generated' or 'human-written'; instead, they assign an AI score that quantifies the likelihood of the text being AI-generated. A higher score means high probability that the text is generated by AI.
Then, we want to explain the cases where the AUC is less than 0.5. This case indicates the detector often assign lower AI score to SICO-generated text compared with the actual human-written text. In this case, no matter which classification threshold is set, the detectors cannot distinguish two types of text, thus performing worse than a random guess. Therefore, a low AUC is not a negative signal, instead highlighting the limitations and potential unreliability of those detectors.
For your reference, we present average AI scores assigned by DetectGPT and Log-Rank to three different types of text in the writing task. One can see that human-written text receives a higher AI score than SICO-generated text on average.
| Text | DetectGPT | Log-Rank |
|---|---|---|
| Orig. AI-Generated | 0.947 | -1.16 |
| SICO-Para | -0.233 | -2.10 |
| Human-Written | 0.396 | -1.62 |
several baselines already achieve reasonable results, e.g., Parrot on Detect GPT
Some baseline methods work well in a few cases, but in many other cases, they perform poorly. For example, across three datasets, Parrot scored 0.819 and 0.805 in AUC on GPT3.5-D and GPT2-D, respectively. However, our method consistently outperforms other baselines regarding evasion, indicating it is a more powerful test method.
Q3: Concerns on human evaluation
For the quality of the generated outputs, only human evaluation on DIPPER is considered, leaving several concerns 1) how will it perform on other datasets
We considered all three datasets we used in experiments, including writing, QA, and review generation. Please check the results in Table 2.
2 ) how will it perform on other automatic evaluation metrics
Thanks for your thoughtful question. In our study, we used automatic evaluation to examine the semantic preserving capabilities of different methods. The detailed findings can be found in Table 12 (Page 18).
For quality metrics like readability or task completion rate, the lack of "gold reference" of LLM-generated text limits the effectiveness of traditional automatic evaluation methods. Moreover, we should notice that the goal of automatic evaluation is to approximate human preferences in text quality. Therefore, we believe that human evaluation is the best way to quantify text usability.
3 ) why some scores are higher then human outputs, e.g., Writing.
Thank you for your thoughtful questions. Here are two cases where the score of human outputs is lower than SICO-Gen; we will explain it one-by-one.
-
Readability on review generation task. We check the human-written reviews annotated as relatively low readability (score 2 or 3). We observe that those reviews are very informal and have some grammar errors, which is reasonable as the reviews are sourced from the internet. Here we show some examples:
Eg.1: Basically the food here is great, everything else is not. a "pasty", apparently, is a shepherd's pie (or any other one of the 40 variations on a meat pie) ... Eg.2: ... It's not just a regular pizza place either, this is all organic - local - farm fresh - deliciousness! Guilt free carbs if you will ... -
Task completion rate on academic writing task. We examined human-written essays annotated as not as incomplete for the task. We found that format errors made annotators think the essays are not appropriate for academic purposes. These errors originated from our use of Wikipedia text, following the implementation of [4]. We intend to address and correct these formatting issues in future human evaluations. However, this issue does not affect the validity of the results obtained through other methods. Here are some examples for reference:
Eg.1: ... but the court of St. Petersburg ignored their suggestions.:143 ... Eg.2: Antarctica (US English ... )[Note 1] is Earth's southernmost continent, containing ...
We sincerely appreciate your great efforts in reviewing this paper. We believe we have addressed your concerns and clarified the misunderstandings. Given the approaching deadline, we would be grateful if you could let us know if you have any further concerns. We are open to all possible discussions and value your input.
Thank you for your responses. I acknowledge they are taken into consideration.
This paper proposes to equip LLMs with prompts, rather than relying on an external paraphraser, to evaluate the vulnerability of AI text detectors. Specifically, SICO iteratively substitutes words and sentences within the in-context examples to provide more representative demonstrations for LLMs to generate text that cannot be detected. Besides quantitative experiments, human evaluation and validation on online social platform prove the effectiveness of the proposed method.
优点
- The proposed in-context example method is cost efficient and universally useful against a wide range of detectors.
- Quantitative experiments show that the proposed method outperforms existing paraphraser baselines across three datasets and multiple detectors.
- The evaluation on Reddit and relevant analysis make the effectiveness of the proposed method more vivid and realistic.
缺点
Although SICO only requires 40 human-generated examples and a limited number of LLM inferences, but it still costs approximately 1 USD using the GPT-3.5 API.
问题
-
Since each evasion only runs once during evaluation, then what's the arguments for GPT-3.5 in this paper, such as the temperature and top-p values.
-
As shown in Table 1, the proposed SICO and human prompt obtain much better performance. DIPPER also achieves good performance on QA and Review dataset. However, the third best performance on Writing dataset is Parrot, which is much better than DIPPER. I am wondering, in Section 4.3, why DIPPER is considered for Writing dataset not Parrot.
Suggestions
1.For values in Table 2, since the higher the better, perhaps you could mark best performance per column in boldface.
- In Section 4.4, it would be interesting if another competitive baseline (such as Human-Written) is used to answer the same question as the proposed SICO on Reddit. In that case, we may have a better idea which one is more preferred by users or which one is easier to be recognized as chatgpt.
Thank you for the feedback and the constructive questions. Below, we have responded to your concerns and questions. We hope our response addresses your concerns. Please let us know if you have further questions preventing you from raising your score.
Weakness: SICO still costs 1 USD
Although SICO only requires 40 human-generated examples and a limited number of LLM inferences, but it still costs approximately 1 USD using the GPT-3.5 API.
We believe your concern about cost can be alleviated in practice. In practical applications, users might not need to spend money to construct prompts on their own. There will be prompt providers, such as ChatGPT plugin, to provide effective prompts to users. Users simply choose prompts from the provider and combine them with their own inputs. We argue that 1 USD expense for the prompt provider is a negligible cost.
Furthermore, if users want to customize their own SICO prompts, the 5 USD free quota provided by OpenAI is enough for several runs. Additionally, users have the option to utilize open-source LLMs. Our experiments with the typical open-source model Vicuna-13B demonstrate that SICO effectively evade most detectors (Table 5 on Page 9).
Q1: GPT-3.5 argument
what's the arguments for GPT-3.5 in this paper, such as the temperature and top-p values.
Thanks for your thoughtful question. During evaluation, we set all the arguments of GPT-3.5 API to default, as we mentioned in the experimental settings of Section 4.1(Page 6). Specifically, and . Other default settings can be found in OpenAI documents: https://platform.openai.com/docs/api-reference.
Q2: Why not consider Parrot in human evaluation?
As shown in Table 1, the proposed SICO and human prompt obtain much better performance. DIPPER also achieves good performance on QA and Review dataset. However, the third best performance on Writing dataset is Parrot, which is much better than DIPPER. I am wondering, in Section 4.3, why DIPPER is considered for Writing dataset not Parrot.
Thanks for your thoughtful question. We totally agree with the reviewer's opinion. The primary reason for selecting DIPPER over Parrot is rooted in their initial training purposes. DIPPER was specifically trained to evade AI detectors, whereas Parrot, originally designed for paraphrasing, was only recently discovered to be able to evade detectors.
To examine the text from Parrot on the writing task in human evaluation, we conducted a small experiment. We randomly sampled 120 examples (40 per method) from the writing task and asked two human annotators to evaluate them. The experiment result shows that SICO still outperforms Parrot by a large margin.
| Method | Readability | Compeletion |
|---|---|---|
| Parrot | 3.35 | 70.0 |
| SICO-Para | 3.80 | 82.5 |
| SICO-Gen | 3.90 | 85.0 |
Suggestions 1: Mark best performance in Table 2
For values in Table 2, since the higher the better, perhaps you could mark best performance per column in boldface.
Thanks for your thoughtful advice, we will modify Table 2.
Suggestions 2: Add baselines in Reddit test
In Section 4.4, it would be interesting if another competitive baseline (such as Human-Written) is used to answer the same question as the proposed SICO on Reddit. In that case, we may have a better idea which one is more preferred by users or which one is easier to be recognized as chatgpt.
We fully agree with your suggestion, recognizing its potential to illustrate which baseline is most favored by real-world users. However, the primary objective of the Reddit test was to verify the applicability of SICO in the real world, rather than to conduct comparative analyses with other baselines.
Additionally, the human-evaluation results presented in Table 2 already indicate that human-written texts surpass SICO in terms of usability. So, we didn't consider it as a baseline in the Reddit test.
Furthermore, some questions sourced from Reddit are very challenging, such as the one depicted on the left in Figure 3. Therefore, it would be more expensive to hire experts to answer those questions.
We sincerely appreciate your great efforts in reviewing this paper. We believe we have addressed your concerns and clarified the misunderstandings. Given the approaching deadline, we would be grateful if you could let us know if you have any further concerns. We are open to all possible discussions and value your input.
Summary: This paper aims to evade existing text detection systems through in-context optimization, making the AI-generated text more similar to humans. The method works by initializing the in-context prompts by some human-crafted pairs and gradually replacing the words or sentences to minimize the detectability. The final optimized in-context examples and task prompts are used as a combined new prompt to generate undetectable text given a new query. Extensive experiments through diverse datasets and models validate the effectiveness of this novel attack over several existing baselines. The authors also claim that the resulting attack prompt can be generalized to other detectors.
优点
Strength: The attack on detectors is an important task and the authors propose a novel prompt attack by in-context optimization. On tested datasets and models, this approach works well compared with several baselines under their setting. The experiments are comprehensive and solid to support their main results, though some weaknesses remain. The whole paper is written well and easy-to-follow. The experimental settings are clear and easy to reproduce.
缺点
Weakness: There are three main weaknesses. The first weakness is the diversity. I understand that the method aims at finding an optimized prompt to make the model learn human-style writing. However, the authors only show the detection result or the readability. One crucial problem is whether the generated text will lose diversity. It might simply learn the features provided in the prompt and universally apply those features to all generated text. Then, although the detection performance drops the each single text remains readable, the overall generations might lack inter diversity. Thus, this will limit its overall utilization. The second main weakness is transferability over data. The proposed method requires in-context examples from a given task and its corresponding dataset. However, in practice, this setting is too ideal since many users will use ChatGPT to generate various outputs under different conditions. Thus, the task and in-context examples are not given (How can you obtain 32 in-context examples for each new query?). Besides, the author did not mention how the result on one dataset generalizes to other datasets. It make sense for the optimized prompt in one dataset to beat other baselines since one specific dataset and task has their intrinsic similarity. But most other baselines work in general. So, the comparison is not fair, and whether the approach really works is questionable. The third weakness is the experiments are only conducted on GPT-3.5. I am expecting on more models, including open-sourced models for easy and cheap reproduction. Other weaknesses include: the method does not test on non-English text, limiting its utility. Also, the method requires many in-context examples, but it might exceed the context window or limit the generation length due to the context window limit. The methods did not report the length effect on generated text, which is crucial for detection.
问题
Questions: See weakness.
伦理问题详情
n/a
Q4: Concerns about other LLM
The third weakness is the experiments are only conducted on GPT-3.5. I am expecting on more models, including open-sourced models for easy and cheap reproduction.
Thanks for your thoughtful suggestions. We actually adopted experiments on a typical open-sourced model Vicuna-13B, and reported the result in Table 5 (Page 9). Please check the results.
Q5: Concerns about non-English text
the method does not test on non-English text, limiting its utility.
Thanks for your thoughtful suggestions. We appreciate the significance of testing on non-English text. However, it is important to note that the field of AI detection in LLM is still in the early stage. Consequently, the majority of research, including ours, primarily relies on English corpora, owing to their richness and maturity [1-4]. We leave the research of non-English text for future exploration.
Q6: Concerns about prompt length
the method requires many in-context examples, but it might exceed the context window or limit the generation length due to the context window limit.
As discussed in Q2, we guess you raise this concern due to the misunderstanding of using in-context examples in SICO. Actually we use in-context examples and here show the maximum/average token number of the whole SICO prompts we got during experiments, which is much less than the context windows of popular LLMs.
| Max Token Num | Avg Token Num | LLaMa 1 | LLaMa 2 | GPT-3.5 |
|---|---|---|---|---|
| 1015 | 873 | 2048 | 4096 | 4096 |
Q7: Concerns about length effect
The methods did not report the length effect on generated text, which is crucial for detection.
Thank you for your valuable suggestions. To address them, we did a series of experiments to explore the impact of text length.
We first analyzed the detector's bias varied with text length. We sorted SICO-generated texts by word number and divided them into four groups. Then, we calculated the average of each group using GPT3.5-D. The results, shown in the table below, indicate that the detector tends to assign a higher to longer texts.
| Avg. Length | Avg. |
|---|---|
| 94.7 | 0.022 |
| 114.9 | 0.044 |
| 130.8 | 0.064 |
| 156.5 | 0.063 |
To further explore whether SICO simply generates shorter text to evade detection, we analyzed the average word number of text from AI, human and SICO-Para across three tasks. The results are presented in the following table:
| Task | AI | Human | SICO |
|---|---|---|---|
| Writing | 122.9 | 119.1 | 129.4 |
| QA | 93.2 | 111.7 | 104.9 |
| Review | 86.2 | 130.6 | 133.4 |
In writing task, three types of generated text have similar length. However, in QA and review generation task, text from human and SICO is significantly longer than AI-generated text. This finding implies that SICO does not generate shorter text to evade detection.
References
[1] Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, Mohit Iyyer: Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. NIPS 2023
[2] Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, Chelsea Finn: DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. ICML 2023
[3] Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, Soheil Feizi: Can AI-Generated Text be Reliably Detected? arxiv 2023
[4] Xinlei He, Xinyue Shen, Zeyuan Chen, Michael Backes, Yang Zhang: MGTBench: Benchmarking Machine-Generated Text Detection. arxiv 2023
Q2: SICO needs different in-context examples for different task
How can you obtain 32 in-context examples for each new query?
Before addressing your questions in detail, we would like to first clarify some points.
It seems to us that this question stems from the misunderstanding of SICO usage. After SICO optimization is finished, the content of the prompt, including the in-context examples, is fixed. The SICO prompt then serves as a template, allowing users to put different task inputs into it. Therefore, users don't need to select different in-context examples from dataset for new input (query). Reviewers can refer to the Figure 1 for illustration, where in the constructed prompt, the purple text is the only part to change, while other parts (feature, task instruction, in-context examples) are fixed after optimization.
Besides, SICO only needs in-context examples instead of . The number comes from task inputs for prompt evaluation (). The details can be found in Section 3 (Page 3) and experimental settings in Section 4.1 (Page 6). We also discussed in Section 4.5 (Page 8). We kindly suggest that you refer to these sections.
The proposed method requires in-context examples from a given task and its corresponding dataset. However, in practice, this setting is too ideal since many users will use ChatGPT to generate various outputs under different conditions. Thus, the task and in-context examples are not given.
We agree with the reviewer that SICO prompt construction requires examples for a given task. However, there are points we want to clarify:
- First, as we discussed in the main response, not every task is safety-concerned and needs detection. For example, translating websites has low misuse risk, whereas generating fake reviews, a profitable misuse, makes AI detection crucial. Although "users will use ChatGPT to generate various outputs under different conditions", they only care about evading those safety-concerned tasks under detection.
- Secondly, if it is necessary to deploy detectors for a task, evading detectors is valuable enough to collect in-context examples.
- For the practical application of SICO, as mentioned before, users may not need to collect the data and construct SICO prompt by themselves. The prompt providers will construct prompts, allowing users to utilize them to generate human-like texts.
Q3: Concerns about transferability over tasks
It make sense for the optimized prompt in one dataset to beat other baselines since one specific dataset and task has their intrinsic similarity. But most other baselines work in general. So, the comparison is not fair, and whether the approach really works is questionable.
Thank you for your insightful feedback. The suggestion to use a transferable prompt effective for various tasks is valuable. At first, we did consider this approach but did not choose it in order to prioritize the usability of generated text. Because the similar text style may harm the usability of texts for different tasks. For instance, academic essay writing typically requires formal and structured text, which might not align well with more casual or informal styles used in review generation.
To further test the transferability of our approach, we conducted an additional experiment. We constructed a prompt based on the text from Writing and QA task ("SICO-Para-T"), and evaluated the evasion performance on review generation task. As shown in the table, the results indicate that SICO-Para-T still outperforms other baselines significantly. The difference in performance between SICO-Para and SICO-Para-T could be due to the difference in the distribution of in-context examples and test inputs. We didn't test the API-based detectors to keep the consistency of results.
| Method | GPT3.5-D | GPT2-D | DetectGPT | Log-Rank |
|---|---|---|---|---|
| - | 0.925 | 0.952 | 0.808 | 0.982 |
| Parrot | 0.871 | 0.934 | 0.654 | 0.893 |
| DIPPER | 0.987 | 0.984 | 0.515 | 0.814 |
| SICO-Para | 0.465 | 0.264 | 0.270 | 0.300 |
| SICO-Para-T | 0.496 | 0.487 | 0.447 | 0.490 |
Thank you for the feedback. Below, we have responded to your concerns and questions. We hope our response addresses your concerns. Please let us know if you have further questions preventing you from raising your score.
Q1: Concerns about text diversity
One crucial problem is whether the generated text will lose diversity. It might simply learn the features provided in the prompt and universally apply those features to all generated text. Then, although the detection performance drops the each single text remains readable, the overall generations might lack inter diversity. Thus, this will limit its overall utilization.
Diversity across tasks
First, we want to clarify that texts generated by SICO illustrate high diversity across different tasks, and the feature is coherent with the task context. Here are the typical examples generated by SICO for academic writing task and review generation task. One can see that generated review is more informal and colloquial, while the academic essay is in formal styles, illustrating a high diversity across tasks.
| Task | SICO Generated Text |
|---|---|
| Writing | With the initial publication of The Unicode Standard in 1991, The Consortium assumed its position as a leading authority on Unicode standards. To this day, the organization progresses to develop Unicode specifications ... |
| Review | Who doesn't love Earl's? Their hot sandwich was a delightfully mouthwatering experience, especially the turkey sandwich. The chicken noodle soup was warm and comforting and the PB&J sandwich was just the right amount of sweetness ... |
Diversity within tasks
We speculate that you refer to the style diversity between texts for the same tasks. While it's true that a single SICO might generate text that loses diversity, it can be easily mitigated in practice.
-
Utilization of Multiple Prompts: As discussed in Section 6 (Page 9), we believe the diversity issue can be easily addressed by constructing multiple SICO prompts. The low cost makes this approach feasible.
Besides, different SICO prompts can be obtained based on the same data. For empirical evidence, we constructed two effective prompts with the same experimental settings and data, and use them to generate two answers for the same questions. One can observe that prompt 1 generate more conversational and informal answers compared with another one.
Prompt SICO generated text Prompt-1 We've all felt it - the emotional pain and heartache that comes with tough times. Am I right? This feeling is totally legit and caused by our bodies' release of stress hormones ... Prompt-2 The sensation commonly referred to as "heartache" or "emotional pain" is an outcome of a physiological process triggered by the release of stress hormones, including cortisol and adrenaline ... -
Prompt Provider: In practical applications, user might not have to construct multiple prompt. There will be prompt providers, such as ChatGPT plugin, to construct multiple prompts to generate diverse text. So users can just utilize them to for their task.
We sincerely appreciate your great efforts in reviewing this paper. We believe we have addressed your concerns and clarified the misunderstandings. Given the approaching deadline, we would be grateful if you could let us know if you have any further concerns. We are open to all possible discussions and value your input.
Thanks for the clarification. I still would stand by my original score considering the lack of theoretical guidance and insufficient open-source LLMs used in the validation.
Thanks for your instructive feedback. We want to further address your new concerns.
Lack of theoretical guidance.
We acknowledge the importance of theoretical guidance. However, we wish to emphasize that prompt tuning is a relatively new area in LLM, and often relies more on experimental insights and practical effectiveness. Many works in this field do not provide theoretical guidance, such as CoT prompt[1, 2]. We believe our extensive experiments has shown the effectiveness of our approach. And we will conduct the theory exploration in our future work.
Open-source LLMs
We appreciate your thoughtful concern. To address it, we conducted extra experiments to examine SICO-Para with two typical open-source LLMs, Vicuna-13B and WizardLM-13B. These LLMs have smaller parameters and are less capable than GPT-3.5. The experiments are adopted on three tasks.
The results are summarized in the table below. It indicates that SICO using open-source LLMs still performs better than paraphraser baselines in most cases. We hope it will address your concern about the effectiveness of SICO using open-source LLMs.
Results on Writing task
| Method | GPT3-D | GPT2-D | DetectGPT | Log-Rank | AVG |
|---|---|---|---|---|---|
| - | 0.908 | 0.848 | 0.834 | 0.914 | 0.876 |
| Parrot | 0.666 | 0.645 | 0.502 | 0.577 | 0.598 |
| DIPPER | 0.736 | 0.907 | 0.550 | 0.684 | 0.719 |
| SICO-Wizard | 0.571 | 0.582 | 0.510 | 0.481 | 0.536 |
| SICO-Vicuna | 0.393 | 0.777 | 0.337 | 0.375 | 0.470 |
| SICO-GPT | 0.239 | 0.332 | 0.149 | 0.147 | 0.217 |
Results on QA task
| Method | GPT3-D | GPT2-D | DetectGPT | Log-Rank | AVG |
|---|---|---|---|---|---|
| - | 0.981 | 0.906 | 0.876 | 0.956 | 0.930 |
| Parrot | 0.922 | 0.837 | 0.689 | 0.806 | 0.814 |
| DIPPER | 0.888 | 0.962 | 0.604 | 0.782 | 0.809 |
| SICO-Wizard | 0.721 | 0.467 | 0.561 | 0.404 | 0.538 |
| SICO-Vicuna | 0.698 | 0.522 | 0.544 | 0.383 | 0.537 |
| SICO-GPT | 0.402 | 0.566 | 0.178 | 0.183 | 0.332 |
Results on Review task
| Method | GPT3-D | GPT2-D | DetectGPT | Log-Rank | AVG |
|---|---|---|---|---|---|
| - | 0.925 | 0.952 | 0.808 | 0.982 | 0.917 |
| Parrot | 0.871 | 0.934 | 0.654 | 0.893 | 0.838 |
| DIPPER | 0.875 | 0.984 | 0.515 | 0.814 | 0.797 |
| SICO-Wizard | 0.707 | 0.402 | 0.570 | 0.542 | 0.555 |
| SICO-Vicuna | 0.524 | 0.567 | 0.619 | 0.535 | 0.561 |
| SICO-GPT | 0.465 | 0.264 | 0.270 | 0.300 | 0.324 |
Reference
[1] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, Denny Zhou: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022
[2] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou: Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023
We sincerely thank the reviewers for taking the time to review our work. We would like to provide a general response regarding 3 points (usage, cost, and applicaton of SICO) commonly raised by the reviewers. Later, we will respond to all the concerns raised by each reviewer during the response period.
We are glad most of the reviewers pointed out the importance of evading AI detection system (KWZu, iD8f) and recognized our prompt-based approach as "novel" (KWZu), "insteresting" (nfun) and ''promising'' (iD8f). The reviewers also found "our experiments is comprehensive and solid"(KWZu), especially considering our method "is cost efficient and universally useful against a wide range of detectors"(Jo4D) as one of the strengths.
Clarification of SICO Usage
We want to clarify the usage of SICO prompt. Once SICO prompt is constructed, the optimized in-context examples are fixed. The prompt acts as a template, allowing users to put various task inputs into it. And the LLMs will generate human-like responses. Reviewers can refer to Figure 1 for illustration, where in the constructed prompt, the purple text is the only part to change, while other parts (feature, task instruction, in-context examples) are fixed after optimization.
Concerns regarding SICO cost
SICO only requires 40 examples and minimal usage of LLM (1 USD cost for GPT-3.5) to build a prompt effective for various detectors. This cost-efficiency makes SICO more practical for real-life applications.
Additionally, we would like to emphasize that not every task is safety-concerned and needs detection. Specifically, we should focus on scenarios where the misuse of LLMs could lead to significant financial or safety issues. To illustrate, translating web content carries a low risk of misuse. In contrast, the generation of fake reviews is a scenario where malicious users profit, making AI detection valuable and necessary. Therefore, if there is a need to deploy detectors for a specific task, it suggests that avoiding detection in that task is highly valuable. In such situations, the minimal costs associated with SICO are significantly lower than the potential costs of getting detected.
Possible application scenario of SICO
We would like to show that the Provider-User Pattern is a notably efficient and practical way to implement SICO. In this pattern, prompt providers (like ChatGPT plugin developers) can efficiently create SICO prompts for multiple tasks. Then, users can specify their requirements to the provider and utilize the appropriate SICO prompts.
Note that SICO prompts are universally effective against various detectors, so the users don't need to figure out which detectors they want to evade. Furthermore, the low cost of LLM calls and small amount of data makes this approach highly feasible for providers.