6.3

/10

Poster4 位审稿人

最低5最高7标准差0.8

4.3

置信度

正确性3.0

贡献度3.0

表达3.3

NeurIPS 2024

BiScope: AI-generated Text Detection by Checking Memorization of Preceding Tokens

Hanxi Guo,Siyuan Cheng,Xiaolong Jin,ZHUO ZHANG,Kaiyuan Zhang,Guanhong Tao,Guangyu Shen,Xiangyu Zhang

OpenReview PDF

提交: 2024-05-16更新: 2024-11-06

TL;DR

We propose BiScope, leveraging a novel bi-directional cross-entropy calculation method to detect AI-generated texts.

摘要

关键词

Large Language ModelsAI-text DetectionParaphraseTrustworthy AI

评审与讨论

审稿意见

评分: 5置信度: 42024-07-06

This paper presents BiScope, a new algorithm for AI-generated text detection, that leverages logits from a tect expansion task to detect AI-generated text. The algorithm proceeds in three steps:

Given a candidate input text T, break it into two segments (seg1, seg2, such that seg1 + seg2 = T). A "text completion" prompt is created where given <summary(T), seg1>, the goal is to predict seg2. The rationale for this is to give the detection algorithm full context awareness of the candidate input text.
These completion prompts are passed through several open-source LLMs with seg2 tokens present in a teacher forcing manner. The logits are used as features for subsequent downstream classification.
The features are used to train a binary classifier to detect AI-generated text.

The authors perform several experiments and ablations in their paper, and find that BiScope is effective and efficient compared to many baselines (including GPTZero).

优点

The paper presents a novel algorithm for AI-generated text detection, a research area growing in importance due to the development of powerful LLMs. Utilizing bidirectional logit information to discriminate between LLM-generated and human-written text is an interesting idea, and the authors have done a good job figuring out the technical details for an effective implementation.
The paper performs thorough experiments on five AI-generated text detection datasets, five popular commerical LLMs, and in both IID / OOD settings. The paper compares their method to both academic and commerical AI-generated text detectors, including GPTZero, a popular commerical software for AI-generated text detection. Additionally, the paper showcases the robustness of BiScope to paraphrasing attacks.
The paper does some additional ablation studies on the choice of open source LLMs for step 2 above, an efficiency analysis, and the choice of classifier features.

缺点

The out of distribution (OOD) experiments in the paper are quite limited, which put the baselines at an unfair disadvantage. The authors experiment with two OOD settings (L280-290), making either the text domain OOD or the generative LLM, but not both. Hence, I don't think the OOD settings are fully OOD, I think this puts BiScope at an unfair advantage over all baselines, since many of them have not received any in-distribution training.
The paper would be stronger with more baselines, ideally some watermarking methods too. I am curious to know how BiScope compares to newer methods like Binoculars [5], as well as text watermarking algorithms like KGW [1], EXP-Edit [2], SemStamp [3], or [4]?
A few more ablation experiments would make the paper stronger. These were some additional questions I had: a) how much does the number of segments (n) in Section 3.5 matter? b) What is the performance like if a classifier is trained just on the feed-forward logits of the same open-source LLMs (using the same training data), without any completion prompts?

[1] - https://arxiv.org/abs/2301.10226
[2] - https://arxiv.org/abs/2307.15593
[3] - https://arxiv.org/abs/2310.03991
[4] - https://arxiv.org/abs/2305.08883
[5] - https://arxiv.org/pdf/2401.12070

After rebuttal: Thank you for the response. I've raised my score to 5 due to extra baselines like Ghostbusters and Binoculars.

问题

AI-generated text detectors typically need to operate in a low FPR setting, to minimize the risks of labeling innocent text as AI-generated. Given this, what are the Table 1 true positive rates at a low FPR of say 0-1% (or equivalently, an AUC-ROC curve in that range)? TPR at low FPR ranges (0-1%) is a standard metric for evaluating AI-generated text detectors which has been used in many previous papers and blogs: https://arxiv.org/abs/2303.13408, https://openreview.net/pdf?id=DEJIDCmWOz, https://arxiv.org/pdf/2401.12070, https://foundation.mozilla.org/en/blog/who-wrote-that-evaluating-tools-to-detect-ai-generated-text/
How in-distribution is the RADAR training data compared to the evaluation data used? Since RADAR was not retrained, I'm worried the baseline maybe put at an unfair disadvantage compared to BiScope.
In Table 1, I'm a bit surprised that there's almost no drop in detection accuracy before/after paraphrasing for most baselines in the non-Essay settings (line 300-301). What is the reason for this? This seems in contradiction with findings in multiple prior works.
In Table 2, there's a strong length bias in the two domains. Does that make the classification task easier than it would have been in a length-controlled setting (take first K tokens of AI-generated / human-generated text and classify them)?

Presentation / nits:

Figure 4 is hard to read, could it be converted to a table?
Move the baselines description (L112-117) to a dedicated subsection / paragraph in Section 4.1
In Section 3.4, add the list of open-source LLMs used in main body

局限性

Yes, adequately addressed

作者回复

2024-08-07

Thanks for your valuable review and suggestions. Here are our point-by-point responses:

W1: Regarding the OOD evaluation, previous studies [1, 5, 7, 8] shifted either the data (cross-dataset) or the generative models (cross-model).. We strictly follow their settings in our OOD evaluation. Our comparison with baselines is fair because we retrained all the baselines (except RADAR which is not open-source) and BiScope using the same datasets. And the testing was performed on the same datasets too. Please let us know if we misunderstand your question.

W2: Thank you for your suggestions. We have included three more baselines in the Table 6 (in the submitted PDF file), including Binoculars [5], GhostBuster [1], and OpenAI Detector [6]. The results illustrate that BiScope outperforms all three new baselines in most cases across both the normal and paraphrased datasets. On the other hand, watermarking techniques allow the detection of generated text by limiting the generation within some special vocabulary. They require modifications to the decoding strategy of the generative model. This does not align with the application scenario of BiScope and the baselines as they do not require any modification to the LLM’s generation process and can work in a black-box setting. Although it is difficult to compare the two, we consider them complementary. We will cite [10-13] and include the above discussion.

W3: Thank you for your suggestion. We have implemented these two ablation experiments, and the results are shown in Table 7 and Table 8 in the uploaded PDF file, respectively.

Table 7 presents the ablation results with different segmentation strategies in the multi-point splitting of BiScope. We tested three strategies: splitting at every 50% text length, every 25% text length, and every 10% text length (as used in our paper). The results indicate that a more fine-grained splitting interval generally improves BiScope's performance. However, in a small number of cases, a smaller splitting interval may degrade performance. We chose to use 10% as it achieves the highest detection scores in most cases while reaching low degradation in corner cases.

Table 8 presents the comparison results when using and not using the completion prompt in BiScope. The results show that in 25 of 45 cases, using the completion prompt performs better. Additionally, the completion prompt is more compatible with the summary procedure. Thus, we chose to use the completion prompt in BiScope.

Q1: Thanks for the suggestion. We further present the AUC-ROC curve of our method, shown in Figure 9 (in the submitted PDF file). The results show that our BiScope reaches over 0.8 detection TPR on average when the FPR is only 0.01, outperforming all the baselines on all the five generative models’s data.

Q2: RADAR is officially trained on the Openwebtext dataset, which contains over 8 million human-written texts and over 8 million AI-generated texts crafted by RADAR’s authors. These texts are very similar to the data in the Yelp dataset. Since the RADAR’s authors do not provide their training code, we did not further fine-tune RADAR on our datasets, but when calculating the F1 score, we did search the best threshold for RADAR’s output on our datasets. Considering the substantial amount of RADAR's pre-training data and our in-distribution threshold searching, the comparison in our paper does not put RADAR in an unfair position.

Q3: In Table 1, we present the results of BiScope on the paraphrased datasets under both the in-distribution (seen) setting and OOD (unseen) setting. Under the in-distribution setting, BiScope and all the baselines are trained and tested on the paraphrased dataset, resulting in a very small performance drop for BiScope (as mentioned in line 300-301). However, under the OOD setting, BiScope and all the baselines are trained on the normal dataset and tested on the paraphrased dataset, as shown in the right-most column in Table 1. We observe a significant F1 score drop for all methods, including BiScope. For example, BiScope’s detection F1 score drops by around 0.1 on the Arxiv dataset. Notably, BiScope’s performance drop is the smallest among all the detectors in most cases.

Q4: Thank you for pointing out the potential influence caused by the text lengths. We note that the length bias can be clearly identified in the Yelp, Essay, and Creative datasets, where the human-written texts can be twice as long as the AI-generated texts. Thus, we re-ran BiScope on these three datasets, using the first K characters as the input, where K equals the average text length of each individual dataset as shown in Table 2. The results are presented in the following table. For the Yelp dataset, the average detection F1 score drops within 0.01, while for the Creative and Essay datasets, there is even a slight increase in the average detection F1 score. Therefore, the length bias in the dataset does not provide BiScope with unfair advantages.

		Normal Dataset In-Distribution					Paraphrased Dataset In-Distribution				Normal	Paraphrased
Dataset	Input	GPT-3.5-Turbo	GPT-4-Turbo	Claude-3-Sonnet	Claude-3-Opus	Gemini-1.0-Pro	GPT-3.5-Turbo	GPT-4-Turbo	Claude-3-Sonnet	Claude-3-Opus	Avg.	Avg.
Yelp	Clipped	0.8968	0.9256	0.9574	0.9456	0.9436	0.9003	0.9329	0.9760	0.9711	0.9338	0.9451
	Unclipped	0.9023	0.9405	0.9652	0.9532	0.9486	0.9064	0.9473	0.9814	0.9789	0.9420	0.9535
Creative	Clipped	0.9980	0.9960	0.9954	0.9954	0.9975	0.9955	0.9955	0.9954	0.9945	0.9965	0.9952
	Unclipped	0.9985	0.9950	0.9960	0.9930	0.9964	0.9955	0.9945	0.9955	0.9940	0.9958	0.9949
Essay	Clipped	1.0000	0.9990	0.9990	0.9975	1.0000	0.9990	0.9990	1.0000	0.9985	0.9991	0.9991
	Unclipped	1.0000	0.9990	0.9985	0.9970	0.9994	0.9965	0.9990	0.9990	0.9980	0.9988	0.9981

Thank you for all the presentation suggestions. We will further modify our paper based on them.

评论- Thank you for the response. I've raised my score to 5

2024-08-10

Thank you for the response and extra experiments! I've raised my score to 5 due to extra baseline and ablation experiments in the PDF (Ghostbusters and Binoculars).

评论- Thanks for your feedback.

2024-08-11

Thanks for your feedback and appreciation! We will include all the experiments in our next version and further polish our paper based on your suggestions.

审稿意见

评分: 7置信度: 42024-07-10

The paper describes work on detecting machine-generated texts using a proposed method called BiScope which exploits a model’s states by considering both the preceding token information and the next token information via an bi-directional cross-entropy loss calculation method. The proposed BiScope method does not make use of any additional finetuning and instead leverages on the calculated forward and backward cross-entropy losses as features for the binary classifier. The performance of said classifier can also be improved through the use of summaries. Results show that using BiScope with summaries outperform existing SOTA methods for AI text generators across aspects such as in and out-of-distribution results, intentional paraphrasing, and efficiency. Overall, I believe the study has the level of completeness, technical rigor, and impact required for a NeurIPS paper.

优点

The paper is well-written, easy to follow, and has the level of completeness required for NeurIPS. The proposed bi-directional calculation of losses which the authors hypothesized to contain information preceding token and next token has been properly motivated with a clear research framing. The proposed method has also been extensively and rigorously compared with a number of SOTA baselines, a large compiled multi-domain dataset, and was able to prove its superiority over previous approaches.

缺点

I do not see any strong cases of technical issues. However, some points can be considered to improve the overall quality of the work and support the realistic application of the method:

It might be best to further emphasize what specific additions or changes were made to the compiled existing dataset (Arxiv, Yelp, code, etc) in the main paper and possibly name this. Moreover, you should also provide immediate details about the datasets used particularly on the length (whether they are essay-length, paragraph-length, etc), language coverage, as well as register or domain.

To strengthen the contribution of the study, I strongly suggest the authors to do the same experiment as done by Tulchinskii et al (https://proceedings.neurips.cc/paper_files/paper/2023/hash/7baa48bc166aa2013d78cbdc15010530-Abstract-Conference.html) and on evaluating the limitation of the proposed BiScope method in detecting whether non-native written texts are falsely identified as machine-generated. If BiScope is still able to clearly differentiate non-native written texts from machine-generated texts, then this is an advantage for the authors. Moreover, this particular advantage may also give another favorable reason to use the model other than its performance.

问题

Is there a limit or a threshold for the number of approximator models as described in Step 2 for calculating the bi-directional cross-entropy losses? Moreover, is there a criteria for the selection of these models to ensure the converged values of the losses are met? Can users just go for open models? I think this part is not much discussed/clarified.
Is the combination of FCE and BCE statistically significant over using just one of them? The values presented in Table 4 seem to be very close with each other particularly on text-based datasets.
Following #1, is there a particular model or combination of models that gives the best approximation of FCE and BCE loss values and particular text split combination for classification? Future researchers may want to only test with the best setup combinations for comparing SOTA or baselines.

局限性

The paper can benefit from a much clearer discussion of limitations of the work, particular as it has not emphasized aspects such as language and ability of the proposed method to be robust on non-native speaker written texts as already explored by previous works (see cited work above).

作者回复

2024-08-07

Thanks for your appreciation and suggestions. Here are our point-by-point responses:

W1: Thank you for pointing out the problem. We generated our datasets using five of the latest commercial LLMs, following the generation methods outlined in previous studies [1, 7, 8]. Due to the page limit of the initial submission, we placed the details of the generated datasets (e.g., data amount, length statistics) in Appendix C. We will add more details to the main text in our next version.

W2: Thank you for your suggestion. Due to the time constraints, we were unable to generate non-native language data for our datasets to provide a fair comparison during the rebuttal period. However, the setting you recommended is certainly an important OOD setting for all detectors. We will include [9] in our paper’s discussion and provide more detailed results in the next version.

Q1: Thank you for the question. We presented a detailed ablation study using various open-source surrogate LLMs in Table 5 in Appendix E.1. According to the results, the performance of BiScope improves when using more surrogate detection models. Notably, BiScope can maintain over 98% of the best (ensembled) performance even when only using a single surrogate model (e.g., Llama2-7B, Mistral-7B). This consistent performance of BiScope across various open-source LLMs illustrates its scalability and compatibility with different open-source LLMs, providing more flexible options for users. There is no limit or threshold for the number of surrogate models, and BiScope can be compatible with any combination of surrogate models.

Q2: We presented a more detailed comparison of the FCE-only, BCE-only, and FCE+BCE versions of BiScope in Table 4 of Appendix E.2. The combination of FCE and BCE outperforms either FCE only version or BCE only version in 64% of cases, showing >0.35 and >0.09 maximal detection F1 score improvement, respectively. Such results demonstrate the necessity of this combination.

Q3: Thank you for your suggestion. For simplicity, we recommend using either Llama2-7B or Llama2-13B as the surrogate model, while the ensemble of more surrogate models is always welcomed. For the text split method, we recommend splitting the text at every 10% length, as used in our paper. We will open-source our code and datasets for future researchers.

评论- Thanks for the response

2024-08-08

Thanks for the response, authors. My questions have been clarified. Please ensure that this will be included in the main paper, particularly the selection and recommendation of what surrogate model to use, as this is one of the first things that came to me when reading the paper. I like the proposed BiScope method as it is simple to understand and seems to be effective compared to other baselines, as shown in the experiments, hence my favorable score.

评论- Thanks for your feedback

2024-08-09

Thanks for your appreciation. We will include all the clarifications in our next version.

审稿意见

评分: 7置信度: 52024-07-13

This paper proposes extracting various features from predictive distributions of surrogate LLMs to detect LLM-generated text. Relative to prior work, the main novelty appears to be the use of bi-directional cross-entropy losses to extract features. These features are then fed into a traditional supervised classifier to make predictions, which is estimated on a labelled dataset of human and machine-written text. Evaluations on a new dataset (expanded version of existing datasets) shows that the method is competitive with some prior approaches.

优点

The paper tackles an important problem: automatic detection of LLM-written text.
The paper contributes datasets that cover several state-of-the-art LLMs and multiple domains, including code.
The paper includes some results for the more realistic out-of-distribution condition, where novel LLMs / genres are introduced at test time relative to the training data. The approach appears to be quite robust to new LLMs, in some genres.
The approach appears to be quite robust to paraphrased text.

缺点

There are some missing comparisons, e.g. https://arxiv.org/abs/2401.12070.
The performance in the most important setting (OOD) is mixed.
The presentation is somewhat confusing. For example, it’s unclear from what exposition (e.g., Figure 3) which steps occur at training time and which steps occur at test time.
The proposed approach amounts to a supervised binary classifier. However, the obvious baseline (fine-tuned BERT) is not included in the comparisons, even though this approach (“OpenAI classifier”) is prominent and discussed in related work. Why not?
Important details are not included in the main text. For example, for the cross-model evaluation, which models are trained on and which models are held-out? Hopefully, the latest LLMs are held-out and previous generation LLMs (e.g., GPT-2) are used for training.
Why is F1 used as the metric? Usually, in detection scenarios we are interested in detection accuracy while maintaining a low false alarm rate, which suggests using ROC-based metrics, e.g. AUC restricted to the low false alarm region.
The bold-underlines in Table 1 seem a bit random. For example, there are cases where two identical values occur in the same column but both are not bolded (0.9955). Ideally, these would represent statistical tests of which relative improvements are significant.
The discussion of limitations is completely lacking. For example, it seems like the proposed approach requires evaluating multiple LLMs in parallel. Also, I’m unclear on why there is not an evaluation setting in which both data and model shift occurs, relative to the training data.

问题

See "Weaknesses"

局限性

No; see "Weaknesses"

作者回复

2024-08-07

Thanks for your insightful review. Here are our point-by-point responses:

W1: We have included three more baselines in the comparison in Table 6 (in the submitted PDF file): Binoculars [5], GhostBuster [1], and OpenAI Detector [6]. The results show that BiScope outperforms all three baselines on both the normal and paraphrased datasets.

W2: We have presented the experimental results in both cross-model and cross-dataset OOD settings, as well as the evaluation on unseen paraphrased data, in the right three columns of Table 1 in our main text. Additionally, more detailed OOD results are provided in Table 3 in the appendix. We observe that BiScope performs the best or the second best in more than 75% cases.

W3: In Figure 3, the first three steps are used in both the classifier training and testing periods to extract features. During the classifier training period, the extracted features are used to train a classifier in step 4. In contrast, during the testing period, the classifier in step 4 is fixed, and we use this trained classifier to make predictions based on the extracted test sample’s features. We will further modify Figure 3 to make this process clearer.

W4: Thank you for pointing out the OpenAI Detector. We have included RADAR [7] in our main text’s results. RADAR is a fine-tuned RoBERTa, outperforming the OpenAI Detector [6] in most cases. Therefore, we initially chose not to include the OpenAI Detector in our paper. To further address your concern, we have included the OpenAI Detector in Table 6 (in the submitted PDF file). The results show that BiScope outperforms the OpenAI Detector with more than 0.2 F1 score improvement on average on both the normal and paraphrased datasets.

W5: In our cross-model OOD evaluation (Section 4.2), we trained the classifier on human data and AI-generated data from one LLM and then tested the trained classifier on AI-generated texts from the other four LLMs and calculated the average F1 score. This process was repeated for all five generative LLMs, and the average scores are reported in Table 1 in the main text. More detailed results are presented in Table 3 in our appendix. Results show that BiScope performs the best or the second best in more than 75% cases. Specifically, when we trained BiScope on the oldest GPT-3.5-Turbo’s data and tested it on all the other four latest LLMs’ data, the detection F1 score exceeded 0.92 on average.

W6: Thanks for pointing out additional metrics. We use the F1 score in our paper since it is a commonly used metric in previous papers [1, 8], considering both precision and recall. We further present the TPR-FPR (ROC) curve of BiScope in Figure 9 (in the submitted PDF file). We observe that our BiScope reaches over 0.8 detection TPR on average when the FPR is only 0.01, outperforming all the baselines on all the five generative models’s data.

W7: Thanks for your recommendation. We will further polish our tables based on your suggestion.

W8: Thank you for pointing out the limitation problem. Due to the page limit of the submission, we had to place the limitation section in Appendix F in our initial submission. We will move it to the main text. Regarding the OOD evaluation, previous studies [1, 5, 7, 8] shifted either the data (cross-dataset) or the generative models (cross-model). We followed their settings to test our OOD performance.

评论- Thanks!

2024-08-09

Thanks for the response. My concerns are largely addressed and I have updated my recommendation accordingly. Nice work!

评论- Thanks for your feedback!

2024-08-09

Thank you very much for your appreciation. We will include all the rebuttal experiments in our main paper and polish our paper based on your suggestions.

审稿意见

评分: 6置信度: 42024-07-15

This paper develops an AI-generated text detection method called BiScope. The key idea is to formulate the detection task as a guided text completion task. The generated text and the original text are used to calculate two types of cross-entropy losses, which are used to extract features for classification.

优点

(S1) The idea of taking the preceding token account for AI-generated text detection is interesting.
(S2) The paper is well-written and easy to follow.

缺点

(W1) The feature extraction is computationally expensive, as it needs to run inference at least twice with LLMs for summary generation and text completion. It’s not clear if the feature extraction cost is worth for the improvements. (Similar to the comment below).
(W2) Missing references and comparisons, especially against the following methods. Among them, [Ref 2] should be a solid baseline method for the classification-based approach. [Ref 4] is recent work and is optional to compare but reports that simply using n-gram and POS features would be sufficient to detect machine-generated text and thus interesting to compare as a baseline.
- [Ref1] OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples https://arxiv.org/abs/2307.11729
- [Ref 2] Ghostbuster: Detecting Text Ghostwritten by Large Language Models https://arxiv.org/abs/2305.15047
- [Ref 3] Smaller Language Models are Better Zero-shot Machine-Generated Text Detectors https://aclanthology.org/2024.eacl-short.25.pdf
- [Ref 4] Your Large Language Models Are Leaving Fingerprints https://arxiv.org/abs/2405.14057

问题

Please respond to the weaknesses raised above.

局限性

I believe it’s important to mention the computational cost in the body text as a limitation.

作者回复

2024-08-07

Thanks for your insightful review. Here are our detailed point-by-point feedbacks for your questions:

W1: As mentioned in Section 3.3., the summary generation is not necessary. We have two designs for generating the completion prompt: one with a summary and the other without. The latter has a significantly shorter processing time. As illustrated in Figure 5, this adjustment allows BiScope to achieve over 8.6x shorter processing time per sample. As shown in Table 1, BiScope without summary only results in an average detection F1 score degradation of less than 0.015 and still substantially outperforms the baselines. Users can choose between these two designs to balance detection accuracy and efficiency according to their specific needs. We will further modify our main text to further clarify this.

W2: Thank you for suggesting the references and baselines. We present a comparison between BiScope and GhostBuster [1] in Table 6, demonstrating that BiScope outperforms GhostBuster with a 0.06 average F1 score increase on both the normal and paraphrased datasets. Additionally, GhostBuster is 3x slower than BiScope in processing a single sample and requires several hours to identify the optimal feature compositions. We did not compare with [2] due to the lack of open-source implementation. Instead, as mentioned in the global response, we compared with two other baselines, Binoculars [5] and OpenAI Detector [6], which have comparable or better results than [2-4]. We will reference [1-4] in our paper and discuss them in the related work.

2024-08-12

Thank you for the clarification. Both my concerns have been addressed. I updated the score accordingly.

评论- Thanks for your feedback!

2024-08-12

Thanks for your appreciation. We will include all the clarification in our main text.

作者回复

2024-08-07

We thank all the reviewers for your thoughtful comments! We are glad that the reviewers found our paper “tackles an important problem” with a novel idea. We also thank you for your appreciation of our dataset contribution, method’s robustness, and paper presentation.

To further address your concerns, we conducted more experiments and provided more detailed evidence to support our proposed method. Here is a summary of the supplementary information provided in the rebuttal materials:

We compare our BiScope with three more baselines that are recommended by the reviewers, including Binoculars, GhostBuster, and OpenAI Detector. We observe that BiScope outperforms all three baselines on all five datasets with more than 0.06 average detection F1 score, even when the data is intentionally paraphrased.
We test BiScope under a length-controlled setting, presenting the low sensitivity of BiScope to the length of the input text with less than 0.01 detection F1 score degradation.
We evaluate BiScope without any completion prompt and compare its performance with the original version in our paper, illustrating that the completion prompt allows BiScope to perform better in 56% of cases.
We evaluate BiScope with different numbers of segments during the feature extraction step, justifying the necessity of our proposed multi-point splitting. The results show the trend that a finer-grained segmentation interval leads to a higher detection F1 score.
We provided the TPR-FPR(ROC) curves of BiScope and compared them with the baseline methods’, showing that BiScope achieves the highest TPR (more than 0.8 on average) in a low FPR setting (FPR=0.01).
We also present point-by-point responses to all the other questions and concerns from all the reviewers on the rebuttal page under each review.

Due to the length limit of each individual rebuttal, we present the most requested experimental results in the supplementary PDF file.

We also list the general references used across all the rebuttal materials here.

References

[1] Verma, Vivek, et al. "Ghostbuster: Detecting Text Ghostwritten by Large Language Models." NAACL. 2024.

[2] McGovern, Hope, et al. "Your Large Language Models Are Leaving Fingerprints." arXiv preprint arXiv:2405.14057 (2024).

[3] Koike, Ryuto, Masahiro Kaneko, and Naoaki Okazaki. "Outfox: Llm-generated essay detection through in-context learning with adversarially generated examples." AAAI. 2024.

[4] Mireshghallah, Niloofar, et al. "Smaller Language Models are Better Zero-shot Machine-Generated Text Detectors." EACL. 2024.

[5] Hans, Abhimanyu, et al. "Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text." ICML. 2024.

[6] Solaiman, Irene, et al. "Release strategies and the social impacts of language models." arXiv preprint arXiv:1908.09203. 2019.

[7] Hu, Xiaomeng, Pin-Yu Chen, and Tsung-Yi Ho. "Radar: Robust ai-text detection via adversarial learning." NeurIPS. 2023.

[8] Mao, Chengzhi, et al. "Raidar: geneRative AI Detection viA Rewriting." ICLR. 2024.

[9] Tulchinskii, Eduard, et al. "Intrinsic dimension estimation for robust detection of ai-generated texts." NeruIPS. 2023.

[10] Kirchenbauer, John, et al. "A watermark for large language models." ICML. 2023.

[11] Kuditipudi, Rohith, et al. "Robust distortion-free watermarks for language models." arXiv preprint arXiv:2307.15593. 2023.

[12] Hou, Abe, et al. "SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation." NAACL. 2024.

[13] Yang, Xi, et al. "Watermarking text generated by black-box language models." arXiv preprint arXiv:2305.08883. 2023.

最终决定Accept (poster)

2024-09-25

This paper proposes BiScope, a method for detecting AI-generated text. The core idea is to leverage bi-directional cross-entropy calculations from surrogate LLMs to capture unique patterns in AI-generated text. The paper demonstrates strong empirical results, outperforming several existing baselines, including commercial tools. While reviewers had some concerns, the authors addressed them effectively during the rebuttal.

The scope and breadth of the experiments is a real strength of the paper. These experiments include in-distribution and out-of-distribution evaluations and ablation studies. The method shows strong detection accuracy and robustness compared to various baselines across multiple datasets and LLMs. The method also shows strong resilience to paraphrasing attacks.

The reviewers had concerns about some strong, recent baselines that were missing, but these experiments were added in the rebuttal. There were also concerns about efficiency, but this was addressed adequately as well (the method is still slow but in line with prior work and a speed-accuracy trade-off can be made that only marginally hurts performance).

Overall, this is a complete paper that makes a solid contribution to this increasingly important research area as the internet becomes more and more flooded with AI generated text.