IPAD: Inverse Prompt for AI Detection - A Robust and Interpretable LLM-Generated Text Detector
This paper presents IPAD, a novel AI detection framework that identifies predicted prompts that could have generated the input text. IPAD outperforms the baselines by 9.73% on in-distribution data and 12.65% on OOD data.
摘要
评审与讨论
The paper proposes a methodology for detection of AI-generated text involving 3 components. The first is a prompt inverter that generates the prompt P that is most likely to have generated the input text T (the text to be classified as AI-generated or not). The underlying model is based on Phi3-medium-128k-Instruct & is trained using datasets containing prompt, text pairs. The second component, also based on Phi3 medium Instruct, is a fine-tuned classifier that verifies if T could have been generated using P. This is followed by a third component that first uses an LLM (gpt-3.5 turbo) to generate text T’ using the (generated) prompt P. Thereafter, a fine-tuned classifier (based on Phi3 medium Instruct) verifies if T & T’ could have been generated using the same prompt P. The final classification is dependent on a weighted combination of the probabilities obtained from both the verification classifier components.
In comparison with related baselines, the method shows significant improvement in AUROC on both in & OOD datasets. The paper includes experiments to demonstrate the necessity of each of the verifier components for the task, and also attempts to individually measure the performance of each of the 3 components against specific baselines.
优缺点分析
Strengths:
- The paper is well-written & clearly outlines the methodology, and experiments conducted for the task.
- The proposed approach is novel, and the paper contains detailed experiments on both in & OOD datasets for the problem, with current baselines used for comparison. Experiments have also been included to highlight the necessity of each of the verifier components with the prompt inverter (Sec 3.2.1). Further, attempts have been made to compare each individual component separately against relevant baselines which is commendable.
Weaknesses:
- The motivation behind using Phi3 medium Instruct as the underlying model for all the components is not provided. Were other models considered and discarded?
- The statistics of the datasets used are not detailed. It would be useful to understand the proportion of human generated & AI generated texts in them, to better comprehend the results.
- The experimental results focus on AUROC & Recall as metrics. To take into account dataset imbalance, F1 scores should also have been included.
- Sec 3.2.2 Lines 200-209 & Sec 3.3 would benefit from further clarifications on the experiment setup. (Pl refer to Questions Section below for details)
问题
- Lines 200-209: When comparing against SBERT & BART-large-cnn, were the thresholds determined using a validation dataset? Which dataset(s) did this come from?
- Lines 200-209: In comparing with ChatGPT, how did the authors arrive at the prompt used (Appendix D)? Were other less convoluted prompts tried as well, especially those that break the task down into steps and involve querying the LLM at each step? The gap in performance reported in Table 6 between ChatGPT & IPAD is large, and given that a) IPAD was fine-tuned, b) a validation set was used for threshold selection, it would be a fairer comparison to try to use a more optimal prompt with ChatGPT & perhaps report the average performance across a variety of prompts.
- Sec 2.2: Some clarity on the validation set here would be helpful
- Lines 152-156: Could you elaborate on why F1 scores were not also reported?
- Sec 3.3: What was meant by IPAD version 2 in line 219? Also, w.r.t the study here, it would help to know the human judgements obtained by the baselines, for each of the explainability dimensions. Was the rating on a 1-5 scale? Further, how were judgements from the 10 participants combined for the final rating?
局限性
Yes
最终评判理由
I appreciate the detailed results presented by the authors in the rebuttal. I already have assigned a positive score (Accept) that remains unchanged at this time.
格式问题
N/A
Thank you for your valuable feedback and the time dedicated to reviewing our work. We address your concerns and questions as follows.
[W1] The motivation behind using Phi3 medium Instruct as the underlying model for all the components is not provided. Were other models considered and discarded?
- Low Computational Costs compared with other similar-sized models.
-
As discussed in Section 2.3, the inference procedure of our framework is designed to be computationally efficient and accessible:
-
It involves three calls to
phi-3-medium-128k-instruct, a lightweight, open-sourced LLM. - It includes one call to GPT-3.5-turbo via API, which introduces fixed latency but no local compute cost. -
The total computational complexity is bounded by: O(3 · L · n² · d + OpenAIapi) where
L = 32is the number of layers, andd = 3072is the hidden dimension of phi-3. To support this, we compare the per-layer parameter cost (L × d) of phi-3 with other similarly sized models:
-
| Model | Layers (L) | Hidden_Dim (d) | L × d |
|---|---|---|---|
| Phi-3-14B-medium | 32 | 3072 | 98,304 |
| LLaMA-3-8B | 32 | 4096 | 131,072 |
| Qwen2-7B | 28 | 3584 | 100,352 |
[W2 Q4] The experimental results focus on AUROC & Recall as metrics. To take into account dataset imbalance, F1 scores should also have been included.
- For Tabel 1, we could include the F1 scores since we referred to OUTFOX which does include the F1-score in their results (OUTFOX Table 3).
| Original Generator | Re-Generator | HumanRec | MachineRec | AvgRec | AUROC | F1 |
|---|---|---|---|---|---|---|
| gpt-3.5-turbo | gpt-3.5-turbo | 98.50 | 100.00 | 99.25 | 100.00 | 98.50 |
| gpt-4 | gpt-4 | 98.70 | 100.00 | 99.35 | 100.00 | 98.70 |
| gpt-3.5-turbo | 96.10 | 100.00 | 98.05 | 99.96 | 96.10 | |
| Qwen-turbo | Qwen-turbo | 98.60 | 99.80 | 99.20 | 99.96 | 98.40 |
| gpt-3.5-turbo | 98.40 | 99.50 | 98.50 | 99.86 | 97.91 | |
| LLaMA-3-70B | LLaMA-3-70B | 98.70 | 100.00 | 99.35 | 100.00 | 98.70 |
| gpt-3.5-turbo | 98.60 | 100.00 | 99.30 | 100.00 | 98.60 |
- For Table 2 and 3, we referred to DetectRL, which does not include F1-score in their results (Table 5). Therefore, we reimplemented the baseline results in Table2 and 3 using the datasets described in the response of reviewer uVgU A2 (500 LGT and 500 HWT under each dataset) :
| Method | Arxiv AUROC | Arxiv F1 | Xsum AUROC | Xsum F1 | Writing AUROC | Writing F1 | Review AUROC | Review F1 |
|---|---|---|---|---|---|---|---|---|
| DetectLLM (LRR) | 48.15 | 69.70 | 49.20 | 67.70 | 58.67 | 74.38 | 58.21 | 75.42 |
| DetectLLM (NPR) | 53.80 | 73.12 | 56.95 | 76.05 | 54.90 | 73.13 | 50.12 | 68.02 |
| Binoculars | 83.99 | 91.60 | 88.88 | 94.66 | 94.37 | 97.18 | 89.91 | 94.73 |
| Fast-DetectGPT | 42.00 | 60.81 | 40.90 | 66.91 | 51.09 | 71.98 | 54.61 | 73.72 |
| Rob-Base | 81.14 | 90.28 | 90.04 | 94.97 | 86.24 | 93.07 | 87.85 | 93.54 |
| IPAD(Ours) | 100.00 | 100.00 | 99.85 | 99.92 | 99.40 | 99.70 | 98.25 | 99.10 |
| Method | Prompt Attack AUROC | F1 | Paraphrase Attack AUROC | F1 | Perturbation Attack AUROC | F1 |
|---|---|---|---|---|---|---|
| DetectLLM (LRR) | 50.12 | 70.68 | 49.20 | 67.70 | 53.97 | 72.88 |
| DetectLLM (NPR) | 75.03 | 86.02 | 56.95 | 76.05 | 6.79 | 59.52 |
| Binoculars | 78.11 | 89.49 | 88.88 | 94.66 | 76.91 | 87.64 |
| Fast-DetectGPT | 43.79 | 64.86 | 40.90 | 66.91 | 44.39 | 71.94 |
| Rob-Base | 92.16 | 96.02 | 90.04 | 94.97 | 92.04 | 95.88 |
| IPAD(Ours) | 97.30 | 98.59 | 96 | 98.04 | 98.10 | 99.10 |
- For Table 4, we only have LGT for the three structured datasets, therefore we only include the Machine-rec score.
[Q1] Validation set in lines 200-209
- We use the in-distribution 1,000 LGT and 1,000 HWT which are both generated by gpt-3.5-turbo. And we split them into 500 validation set and 500 testing set each. We used the validation set to determine the threshold.
[Q2] In comparing with ChatGPT, how did the authors arrive at the prompt used (Appendix D)?
- The prompt in IPAD's Appendix D is from DPIC's Appendix B Table 5.
[Q2] Were other less convoluted prompts tried as well, especially those that break the task down into steps and involve querying the LLM at each step?
- Yes, we did explore alternative prompts beyond the original DPIC. In particular, we evaluated three prompts: RPE [1], DP1[2], and DP2 [3]. The results show that IPAD's Prompt Inverter has better performance.
-
Prompts | Source| Prompt|
| --- |--| | [1] RPE | You are an expert prompt detective. Given the following LLM response, guess a likely prompt that could have produced it. Return only the prompt.<INPUT>| | [2] DP1 | What question are you asked if you can generate the following answer?<INPUT>| | [2] DP2 | Assume you are the prompt reconstructor who can help the user generate the prompts based on the given answers. Can you generate the original prompts for the following answer?<INPUT>| | DPIC | I want you to play the role of the questioner. I will type an answer in English, and you will ask me a question based on the answer in the same language. Don’t write any explanations or other text, just give me the question.<INPUT>| -
LGT Results | Metric | RPE | DP1 | DP2 | DPIC | Prompt Avg | IPAD | | ---------------- | ------------ | -------- | -------- | -------- | ---------- | ----- | | BART-large-CNN ↓ | -1.98 | -2.43 | -2.18 | -2.12 | -2.1775 | -1.84 | | Sentence-BERT ↑ | 0.58 | 0.41 | 0.43 | 0.46 | 0.47 | 0.69 | | BLEU ↑ | 7.28E-03 | 3.46E-05 | 9.32E-05 | 5.61E-05 | 1.87E-03 | 0.24 | | ROUGE-1 ↑ | 0.17 | 0.09 | 0.08 | 0.04 | 0.095 | 0.51 |
-
HWT Results | Metric | RPE | DP1 | DP2 | DPIC | Prompt Avg | IPAD | | ---------------- | ------------ | -------- | -------- | -------- | ---------- | ----- | | BART-large-CNN ↓ | -2.78 | -3.03 | -2.85 | -2.47 | -2.7825 | -2.22 | | Sentence-BERT ↑ | 0.47 | 0.35 | 0.39 | 0.42 | 0.4075 | 0.57 | | BLEU ↑ | 8.42E-04 | 3.65E-06 | 3.25E-05 | 8.75E-06 | 2.22E-04 | 0.13 | | ROUGE-1 ↑ | 0.12 | 0.12 | 0.04 | 0.06 | 0.085 | 0.39 |
[1] Li, H., & Klabjan, D. (2024). Reverse Prompt Engineering. arXiv e-prints, arXiv-2411.
[2] Sha, Z., & Zhang, Y. (2024). Prompt Stealing Attacks Against Large Language Models. CoRR.
[Q5] Was the rating on a 1-5 scale? Further, how were judgements from the 10 participants combined for the final rating?
- Yes it is on a 1-5 scale. The assessment questions are listed as follows: | Aspects | Questions | Rates (1-5) | |-----------------------------------------------------|---------------------------------------------------------------------------------------------------|-------------| | 1. Transparency, Scrutability, and Education | Do you think IPAD provides clearer and more understandable explanations for its decisions?** | | | 2. Trust and Persuasiveness | Do you find IPAD’s outputs more trustworthy and convincing? | | | 3. Satisfaction, Effectiveness, and Efficiency | Do you think IPAD is more effective and efficient in performing its detection tasks? | | | 4. Debugging and Error Handling | Does IPAD allow you to identify and correct its mistakes more easily? | |
[Q5] What was meant by IPAD version 2 in line 219?
- Sorry, this is a typo. It should be just "IPAD". Thank you for pointing it out. We will revise it accordingly.
I appreciate the detailed results presented by the authors in the rebuttal. My already positive score remains unchanged at this time.
This paper design an AI detection pipeline centered around a reverse-prompting approach. It involves training three separate components — a Prompt Inverter, a Prompt-Text Consistency Verifier (PTCV), and a Regeneration Comparator (RC) based on a lightweight open model — which are then integrated into a unified detection pipeline. This approach helps to reveal the individual judgments made by each of the three components during the detection process, thereby improving the interpretability of the pipeline and supporting its overall reliability.
优缺点分析
Strengths:
- The design of the proposed pipeline effectively improves the interpretability of the detection process by making the decisions of individual components more transparent.
- The integration of prompt inversion with the detection pipeline is an interesting and promising idea that enriches the overall methodology.
Weaknesses:
-
The experimental comparison is not sufficiently convincing. The baselines selected are quite outdated, and stronger baselines from 2024 and 2025 should be included for a more meaningful evaluation. Besides, the related work section is far from well-studied and lacks sufficient depth and coverage of recent literature.
-
The core idea of the paper—reverse prompting—is very similar to existing work like DPIC, as the paper also mentioned. Although the authors briefly discuss differences in their Prompt Inverter (e.g., fine-tuning rather than zero-shot) and the introduction of an additional module, they do not provide sufficient evidence that these modifications address the major limitations of DPIC. The comparison mainly focuses on results, without clarifying how the proposed approach overcomes the known weaknesses of DPIC. Given that fine-tuned models are expected to outperform zero-shot models, the results alone do not convincingly demonstrate the superiority of the method.
-
The writing quality of the paper is poor in several key aspects. For example, the experimental section should present the main results in a clear and direct way.
问题
Questions for the authors:
-
Could you please provide results with more recent and stronger baselines, for example from 2024 or 2025, to better demonstrate the advantages of your approach?
-
The ablation study mainly focuses on data structures. Would you consider adding an ablation on the effects of the Prompt Inverter and the new modules compared to DPIC to strengthen your analysis?
-
Could you provide more details about the fine-tuning data and any analysis on OOD scenarios or datasets, to better support the claims of generalization and robustness? I could not clearly find this information in the current manuscript.
局限性
yes
最终评判理由
All my concern has been addressed. I'd like to raise my score to 4.
格式问题
no
Thank you for your valuable feedback and the time dedicated to reviewing our work. We address your concerns and questions as follows.
[W1][Q1] The experimental comparison is outdated.
- We have already included two 2024 baselines (Binoculars [1] and Fast-DetectGPT [2]) in Table 2–4,
- We have expanded the comparisons to include five recent baselines from 2024 and 2025.
| Method | Arxiv AUROC | Arxiv F1 | Xsum AUROC | Xsum F1 | Writing AUROC | Writing F1 | Review AUROC | Review F1 |
|---|---|---|---|---|---|---|---|---|
| DNA-GPT (2024) [3] | 81.45 | 90.27 | 76.06 | 86.22 | 75.93 | 88.20 | 66.24 | 80.30 |
| RAIDAR (2024) [4] | 68.03 | 82.16 | 60.38 | 67.31 | 79.82 | 88.84 | 60.19 | 77.95 |
| FAID (2025) [5] | 79.21 | 89.07 | 75.34 | 86.27 | 78.92 | 89.25 | 68.99 | 82.67 |
| Advacheck (germgr) (2025) [6] | 86.28 | 93.02 | 84.20 | 95.92 | 89.68 | 94.67 | 79.21 | 89.07 |
| Glimpse+Fast-DetectGPT (2025) [7] | 93.16 | 96.55 | 96.23 | 98.11 | 90.25 | 94.90 | 89.17 | 94.28 |
| IPAD (Ours) | 100.00 | 100.00 | 99.85 | 99.92 | 99.40 | 99.70 | 98.25 | 99.10 |
| Method | PromptAUROC | PromptF1 | ParaphraseAUROC | ParaphraseF1 | PerturbationAUROC | PerturbationF1 |
|---|---|---|---|---|---|---|
| DNA-GPT (2024) | 79.92 | 89.38 | 80.09 | 89.61 | 78.67 | 88.80 |
| RAIDAR (2024) | 63.27 | 78.42 | 63.48 | 81.34 | 49.20 | 72.78 |
| FAID (2025) | 72.57 | 86.49 | 72.91 | 85.60 | 72.02 | 84.51 |
| Advacheck (germgr) (2025) | 92.67 | 96.21 | 86.46 | 93.11 | 84.99 | 92.09 |
| Glimpse+Fast-DetectGPT (2025) | 90.29 | 95.01 | 93.24 | 96.53 | 86.27 | 92.77 |
| IPAD (Ours) | 97.30 | 98.59 | 96 | 98.04 | 98.10 | 99.10 |
| Method | LongWriter | Code-Feedback | Math |
|---|---|---|---|
| DNA-GPT (2024) | 65.9 | 40.3 | 50.6 |
| RAIDAR (2024) | 52.3 | 48.7 | 67.9 |
| FAID (2025) | 59.8 | 47.2 | 52.8 |
| Advacheck(germgr) (2025) | 89.7 | 87.2 | 77.1 |
| Glimpse+Fast-DetectGPT (2025) | 90.2 | 87.7 | 89.2 |
| IPAD (Ours) | 97.5 | 92.7 | 95.6 |
- These results show that the updated baselines perform reasonably well, but still fall short of IPAD.
[W2][Q2] Sufficient Evidence and Ablation Study Comparing with DPIC.
-
Both IPAD and DPIC consist of a Prompt Inverter (PI) and a corresponding classifier, and they have clear differences on both stages.
-
PI:
- DPIC uses a fixed prompt for zero-shot prompt generation without any supervision signal, and we found that zero-shot prompt generation performs significantly worse than IPAD's fine-tuned PI (see Table 5).
- IPAD fine-tunes the PI using prompt-generated text pairs by using both general and essay-style data, which results in more plausible prompts that benefits the AI detection performance.
-
PI-Based AI Detection:
- DPIC simply adopts a DNN model using text embeddings as input to distinguish the human-written and AI-generated text.
- IPAD uses two fine-tuned LLMs to perform AI detection, including:
- (a) Prompt-Text Consistency Verifier (PTCV): measures the alignment between the predicted prompt via IPAD's PI and input raw text, i.e., "Can LLM generate [text] via prompt [IPAD's predicted prompt]?".
- (b) Regeneration Comparator (RC): measures the alignment between the regenerated text via IPAD's PI and input raw text, i.e., "[Generated text using IPAD's PI as prompt] is LLM-generated, determine whether [text] is also LLM-generated?".
- By using both PTCV and RC, our IPAD has more superior AI detection performance with promising OOD performance and robustness.
-
As suggested by the reviewer, the fine-tuned Prompt Inverter are always expected to outperform the zero-shot one as in DPIC. To show the effectiveness of our AI detection modules PTCV and RC, we compare our methods with the modified DPIC using our fine-tuned PI instead of the zero-shot PI as in the original DPIC.
| Prompt Inverter (PI) | PI-Based AI Detection | Arxiv | Xsum | Writing | Review |
|---|---|---|---|---|---|
| IPAD | DPIC | 54.3 | 49.8 | 82.3 | 57.2 |
| IPAD | PTCV (IPAD w/o RC) | 92.4 | 98.3 | 98.7 | 95.6 |
| IPAD | RC (IPAD w/o PTCV) | 98.7 | 97.3 | 94.5 | 96.7 |
| IPAD | PTCV+RC (IPAD) | 100 | 99.85 | 99.4 | 98.2 |
| PI | PI-Based AI Detection | PromptAttack | ParaphraseAttack | PerturbationAttack |
|---|---|---|---|---|
| IPAD | DPIC | 46.2 | 42.3 | 41.5 |
| IPAD | PTCV | 91.1 | 96.9 | 95.7 |
| IPAD | RC | 91.3 | 93.1 | 90.3 |
| IPAD | PTCV+RC (IPAD) | 97.3 | 96.0 | 98.1 |
| PI | PI-Based AI Detection | LongWriter | Code-Feedback | Math |
|---|---|---|---|---|
| IPAD | DPIC | 57.2 | 62.8 | 49.3 |
| IPAD | PTCV | 87.5 | 92.4 | 93.8 |
| IPAD | RC | 94.7 | 91.8 | 89.2 |
| IPAD | PTCV+RC (IPAD) | 97.5 | 92.7 | 95.6 |
- From above, IPAD's PTCV and RC each bring significant gains, and their combination as the complete IPAD method achieves the best.
[Q3] Details and analysis on fine-tuning data and OOD data.
-
Generalization: To simulate biased training data selection, each training sample , which is the embedding of the input text obtained from
Qwen1.5-7B-instruct, is assigned a sampling probability , where and denotes the sigmoid function.- We randomly sample 5,000 training samples according to , and use a fixed set of 1,000 test samples for evaluation on both training PTCV and RC.
| Method | Arxiv AUROC | Arxiv F1 | Xsum AUROC | Xsum F1 | Writing AUROC | Writing F1 | Review AUROC | Review F1 |
|---|---|---|---|---|---|---|---|---|
| a=5 | 91.74 | 97.18 | 92.98 | 96.72 | 91.32 | 95.20 | 92.99 | 95.30 |
| a=15 | 91.08 | 96.71 | 92.13 | 96.52 | 90.39 | 95.47 | 93.95 | 95.28 |
| Random | 92.28 | 97.52 | 93.75 | 96.85 | 91.40 | 95.70 | 93.21 | 96.10 |
- We additionally introduce label noise by randomly flipping 5% and 15% of the training labels within the above sampled training sets to evaluate the robustness of our method.
| Method | Arxiv AUROC | Arxiv F1 | Xsum AUROC | Xsum F1 | Writing AUROC | Writing F1 | Review AUROC | Review F1 |
|---|---|---|---|---|---|---|---|---|
| Flip 5 % | 89.12 | 92.48 | 87.26 | 91.35 | 91.72 | 94.28 | 90.83 | 94.20 |
| Flip 15 % | 88.01 | 91.28 | 86.53 | 90.83 | 88.27 | 93.04 | 87.88 | 91.7 |
| Random | 92.28 | 97.52 | 93.75 | 96.85 | 91.40 | 95.70 | 93.21 | 96.10 |
- The results indicate that our method generalizes well even under biased sampling strategies, which induce distributional shifts. Moreover, the model remains robust under label noise, with only moderate degradation at 5% and 15% noise levels.
[W3] Poor writing quality Thanks for pointing this issue. We will integrate the main results into a clear table at the beginning of Section 3.
[W1] Related Work
Thank you for your kind suggestion! We will add the following related work in our revised version. The state-of-the-art AI detection methods can be broadly divided into four categories:
-
Probability‑based approaches treat a language model’s output distribution as a signal. DetectGPT [8] and Fast‑DetectGPT [2] notes that LGT tends to lie in regions of negative curvature of a language model’s log‑probability function. Lastde [9] mines token probability sequences using time‑series analysis to capture temporal dynamics. Glimpse [7] extends white-box detectors to proprietary models by predicting full distributions from partial observations.
-
Regeneration‑based methods generate alternative versions of the text and compare them with the original. RAIDAR [4] and MAGRET [10] observe that LLMs modify human‑written text more than AI‑generated text when tasked with rewriting. DNA‑GPT [3] generates the remaining part of a truncated text and compares n‑gram differences. TOCSIN [11] measures semantic difference after random deletion of tokens and combining this signal with an existing zero‑shot detector.
-
Syntax‑based methods leverage structural linguistic cues rather than probability estimates. PRDetect [12] notes that HWT and LGT differ in their syntactic structures, so it extracts syntax‑tree features to achieve high accuracy and robustness. Text Fluoroscopy [13] extracts the layer with the largest distribution difference between first and last layers to detect synthetic texts across domains.
-
Contrastive‑learning approaches train detectors to learn writing styles or domain‑invariant features. FAID [5] collects a multilingual, multi‑domain dataset and introduces a fine‑grained classification framework. DeTeCtive [14] argues that detection should distinguish writing styles rather than simply classify LGT versus HWT. Advacheck’s multi‑task system [6] shares a transformer encoder among several heads, with one binary head for classification and auxiliary heads for domain classification.
References
[1] Hans et al. Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text. ICML 2024.
[2] Bao et al. Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. ICLR 2024.
[3] Yang et al. DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text. ICLR 2024.
[4] Mao et al. Raidar: generative ai detection via rewriting. ICLR 2024.
[5] Ta et al. FAID: Fine-grained AI-generated Text Detection using Multi-task Auxiliary and Multi-level Contrastive Learning. arxiv 2025.
[6] Gritsai et al. Advacheck at GenAI Detection Task 1: AI Detection Powered by Domain-Aware Multi-Tasking. 2025.
[7] Bao et al. Glimpse: Enabling White-Box Methods to Use Proprietary Models for Zero-Shot LLM-Generated Text Detection. ICLR 2025.
[8] Mitchell et al. DetectGPT: zero-shot machine-generated text detection using probability curvature. ICML 2023.
[9] Xu et al. Training-free LLM-generated Text Detection by Mining Token Probability Sequences. ICLR 2025.
[10] Huang et al. MAGRET:Machine-generated Text Detection with Rewritten Texts. ACL 2025.
[11] Ma et al. Zero-Shot Detection of LLM-Generated Text using Token Cohesiveness.EMNLP 2024.
[12] Li et al. PRDetect: Perturbation-Robust LLM-generated Text Detection Based on Syntax Tree. NAACL 2025.
[13] Yu et al. Text fluoroscopy: Detecting LLM-generated text through intrinsic features. EMNLP 2024.
[14] Guo et al. Detective: Detecting ai-generated text via multi-level contrastive learning. NeurIPS 2024.
We supplemented the updated Pompr Inverter Methods
- Optimization-Based Methods aim to reconstruct prompts by directly optimizing input tokens or embeddings to minimize the difference between generated outputs and the target. Techniques such as Prompt Inversion Attack (PIA) [3] and SODA [4] use gradient-based optimization or discrete search to reconstruct prompts from internal model representations or output probabilities.
- Embedding-Matching & Generative Methods utilize learned mappings from LLM outputs, hidden states, or embeddings to prompt text. Methods like Logit2Prompt [1], Output2Prompt [2], PILS [5], and generative embedding inversion [8] train encoders or decoders to predict prompts from distributions or embeddings.
- Contrastive Decoding Methods analyze how slight variations in prompts affect outputs to infer salient prompt features. ROSE [10] uses contrastive prompts to identify and suppress harmful behaviors, while prompt classification methods like [7] extract prompt styles by contrasting against known patterns. Though typically not reconstructing verbatim prompts, these methods are valuable for recovering structural or intent-level information, often in low-resource or stealthy scenarios.
- Diffusion-Based Approaches remain largely conceptual in textual prompt inversion. While diffusion methods are well-established in image prompt recovery, they have yet to gain traction for discrete language tasks. Nonetheless, some conceptual work (e.g., ROSE [10]) hints at iterative refinement or contrastive correction mechanisms that could serve as proto-diffusion analogues for text, suggesting an interesting future direction.
We also explored three different prompts in the performance of prompt inverter.
Prompts
| Source | Prompt |
|---|---|
| [11] RPE | You are an expert prompt detective. Given the following LLM response, guess a likely prompt that could have produced it. Return only the prompt. <INPUT> |
| [7] DP1 | What question are you asked if you can generate the following answer? <INPUT> |
| [7] DP2 | Assume you are the prompt reconstructor who can help the user generate the prompts based on the given answers. Can you generate the original prompts for the following answer? <INPUT> |
| DPIC | I want you to play the role of the questioner. I will type an answer in English, and you will ask me a question based on the answer in the same language. Don’t write any explanations or other text, just give me the question. <INPUT> |
LGT Results
| Metric | RPE | DP1 | DP2 | DPIC | Prompt Avg | IPAD |
|---|---|---|---|---|---|---|
| BART-large-CNN ↓ | -1.98 | -2.43 | -2.18 | -2.12 | -2.1775 | -1.84 |
| Sentence-BERT ↑ | 0.58 | 0.41 | 0.43 | 0.46 | 0.47 | 0.69 |
| BLEU ↑ | 7.28E-03 | 3.46E-05 | 9.32E-05 | 5.61E-05 | 1.87E-03 | 0.24 |
| ROUGE-1 ↑ | 0.17 | 0.09 | 0.08 | 0.04 | 0.095 | 0.51 |
HWT Results
| Metric | RPE | DP1 | DP2 | DPIC | Prompt Avg | IPAD |
|---|---|---|---|---|---|---|
| BART-large-CNN ↓ | -2.78 | -3.03 | -2.85 | -2.47 | -2.7825 | -2.22 |
| Sentence-BERT ↑ | 0.47 | 0.35 | 0.39 | 0.42 | 0.4075 | 0.57 |
| BLEU ↑ | 8.42E-04 | 3.65E-06 | 3.25E-05 | 8.75E-06 | 2.22E-04 | 0.13 |
| ROUGE-1 ↑ | 0.12 | 0.12 | 0.04 | 0.06 | 0.085 | 0.39 |
[1] Morris et al. Language Model Inversion. ICLR 2024.
[2] Zhang et al. Extracting Prompts by Inverting LLM Outputs. EMNLP 2024.
[3] Qu et al. Prompt Inversion Attack against Collaborative Inference of Large Language Models. SP 2025.
[4] Skapars et al. GPT, But Backwards: Exactly Inverting Language Model Outputs. In The 2nd Workshop on Reliable and Responsible Foundation Models at ICML 2025.
[5] Nazir et al. Better Language Model Inversion by Compactly Representing Next-Token Distributions. arXiv:2506.17090.
[6] Gao et al. DORY: Deliberative Prompt Recovery for LLM. Findings of ACL 2024.
[7] Sha et al. Prompt Stealing Attacks Against Large Language Models. CoRR 2024.
[8] Li et al. Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence. Findings of ACL 2023.
[9] Huang et al. InversionView: A General-Purpose Method for Reading Information from Neural Activations. NeurIPS 2024.
[10] Zhong et al. ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding. Findings of ACL 2024.
[11] Li, H., & Klabjan, D. (2024). Reverse Prompt Engineering. arXiv e-prints, arXiv-2411.
Dear Reviewer LAhb,
As the discussion deadline approaches, we are wondering whether our responses have properly addressed your concerns? Your feedback would be extremely helpful to us. If you have further comments or questions, we hope for the opportunity to respond to them.
Many thanks,
28109 Authors
Thank you for your detailed rebuttal. Most of my concerns are addressed. However, some concerns still remain. Specifically, I don't get the motivation behind using two finetuned LLMs instead of using a DNN model. Also, such operation is not theoretically grounded. After checking other reviews and considering all, I would like to keep my current score.
Dear reviewer,
I would really benefit from an updated response and engagement with the authors rebuttal. In their response, they seem to have addressed several of your concerns, but I would really like to have your input if there are still important missing data or insights that you would like expanded in the current reviewer-author discussion phase.
The discussion period has been extended 2 days, but it would be great to see some engagement today.
Thank you, Your AC.
Dear Reviewer LAhb,
Thank you for your feedback and we are happy that most of your concerns are addressed.
Specifically, I don't get the motivation behind using two finetuned LLMs instead of using a DNN model.
We respectfully believe that we have thoroughly discussed the motivation of using two mutually-beneficial fine-tuned LLMs, but we still add more experiments to fully address your concern.
-
Unlike DPIC simply adopts a DNN model on the text embeddings, IPAD uses two fine-tuned LLMs to perform AI detection, most importantly, they are mutually-beneficial (we also validate this via extensive experiments):
- (a) Prompt-Text Consistency Verifier (PTCV): measures the alignment between the predicted prompt via IPAD's PI and input raw text, i.e., "Can LLM generate [text] via prompt [IPAD's predicted prompt]?".
- (b) Regeneration Comparator (RC): measures the alignment between the regenerated text via IPAD's PI and input raw text, i.e., "[Generated text using IPAD's PI as prompt] is LLM-generated, determine whether [text] is also LLM-generated?".
-
Importantly, the 2 LLMs performs AI detection from different prospectives. Specifically, LLM 1 (PTCV) focuses on "whether a prompt can generated a given text" and LLM 2 (RC) focus on "whether the generated text is same as the raw text", making they mutually-beneficial.
| Method | Arxiv | Xsum | Writing | Review |
|---|---|---|---|---|
| DNN | 54.3 | 49.8 | 82.3 | 57.2 |
| LLM 1 (PTCV) | 92.4 | 98.3 | 98.7 | 95.6 |
| LLM 2 (RC) | 98.7 | 97.3 | 94.5 | 96.7 |
| LLM 1 + LLM 2 (=IPAD) | 100 | 99.85 | 99.4 | 98.2 |
| Method | PromptAttack | ParaphraseAttack | PerturbationAttack |
|---|---|---|---|
| DNN | 46.2 | 42.3 | 41.5 |
| LLM 1 (PTCV) | 91.1 | 96.9 | 95.7 |
| LLM 2 (RC) | 91.3 | 93.1 | 90.3 |
| LLM 1 + LLM 2 (=IPAD) | 97.3 | 96.0 | 98.1 |
| Method | LongWriter | Code-Feedback | Math |
|---|---|---|---|
| DNN | 57.2 | 62.8 | 49.3 |
| LLM 1 (PTCV) | 87.5 | 92.4 | 93.8 |
| LLM 2 (RC) | 94.7 | 91.8 | 89.2 |
| LLM 1 + LLM 2 (=IPAD) | 97.5 | 92.7 | 95.6 |
- The average performance of DNN is only 54.29%, while using LLM 1 and LLM 2 separately can achieve 94.24% and 93.76%, respectively, and using 2 LLMs simultaneously can achieve 97.47% - these results show that using LLMs for detecting LLMs (that is, AI for AI) brings very strong performance gain!
-
To fully support our claim that LLMs significantly outperform DNN, we add experiments on training varying layers DNNs using the same training data.
-
For a fair comparison, we train all DNNs and LLMs with the same encoder (
gte-Qwen1.5-7B-instruct).
| Model | Training Datasets | GPT-3.5 | GPT-4 | Qwen-turbo | LLaMA-3-70B | Arxiv | XSum | Writing | Yelp | Prompt Attack | Paraphrase Attack | Perturbation Attack |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3-layer DNN | PTCV | 44.1 | 67.4 | 69.8 | 50.3 | 49.5 | 50.3 | 81.9 | 58.1 | 52.3 | 34.7 | 35.2 |
| RC | 67.2 | 47.8 | 72.1 | 58.2 | 57.3 | 48.7 | 56.3 | 47.2 | 47.1 | 48.3 | 39.5 | |
| PTCV + RC | 65.2 | 58.3 | 67.3 | 64.9 | 54.3 | 49.8 | 82.3 | 57.2 | 46.2 | 42.3 | 41.5 | |
| 5-layer DNN | PTCV | 62.8 | 62.3 | 69.2 | 52.6 | 53.7 | 51.4 | 49.8 | 56.1 | 54.1 | 55.2 | 58.0 |
| RC | 52.9 | 58.9 | 72.4 | 60.3 | 59.2 | 58.2 | 48.3 | 58.4 | 49.7 | 56.0 | 57.2 | |
| PTCV + RC | 69.4 | 61.6 | 62.8 | 59.5 | 58.9 | 60.0 | 52.4 | 58.3 | 53.5 | 57.8 | 59.2 | |
| LLM | PTCV | 98.4 | 99.9 | 99.4 | 99.6 | 92.4 | 98.3 | 98.7 | 95.6 | 91.1 | 96.9 | 95.7 |
| RC | 97.8 | 98.5 | 93.6 | 94.6 | 98.7 | 97.3 | 94.5 | 96.7 | 91.3 | 93.1 | 90.3 | |
| PTCV + RC | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 99.9 | 99.4 | 98.5 | 97.3 | 96.0 | 98.1 |
- As shown, even the best-performing DNN falls short across in-distribution, OOD, and attacked settings.
Also, such operation is not theoretically grounded.
We respectfully believe that theoretically show LLMs outperform DNN is meaningless. To our best knowledge, the community of AI detection is not theoretically grounded. Despite feeling not necessary, we still add theories to show that using two different perspective LLMs simultaneously has a tighter error bound than using just one LLM.
First of all, we formulate the problem and provide all the definition required.
- We consider a binary classification problem, with input feature space , output space and true data distribution .
- is the conditional probability that the label is positive given input feature .
- Let and be the output probabilities of two classifiers and , respectively.
- The weighted average classifier has output probability , where weight .
- We use the squared loss to measure the generalization error of a classifier.
- Let be the idea classifier, which has output probability equals to .
- Bayes risk is the generalization error of .
We can theoretically show that, based on two diverse classifiers, the weighted average classifier has smaller generalization error.
Theorem 1 Let and be two different classifiers that meet the following diversity condition: where are the model biases, are the reducible model error.
Then there must exist a weight such that the reducible error of the weighted average classifier is strictly less than the reducible error of the better of the two single classifiers. Formally, This theorem means that there exist a weighted average classifier that is closer to the Bayesian optimal classifier than any single component.
Proof. The model bias of the weighted average classifier is: By devide into two parts, we have Now, we can compute the reducible error as:
Let the covariate of bias be , then we have:
is a quadratic function of the weight (a parabola opening upwards). To find the weight minimize , we take the derivative of it and set it to zero: Solving this equation gives us the optimal weight :
Since we have and , this optimal weight is between 0 and 1. To prove that a combined model may be superior to a single model, we don't need to calculate the minimum model error value. Instead, we only need to prove that there is a lower point between the two endpoints ( and ) of the parabola.
Let's examine the derivatives at endpoints and :
Since , we have , . As a result, we can infer that decreases in , and increases in . Then we know the minimum value is smaller than the values at two endpoints, i.e., .
As , the decrease of directly leads to the decrease of generalization error , making it more closer to the irreducible bayes risk .
This ends the proof of Theorem 1.
- True Error : For a hypothesis h, its true generalization error is the probability of misclassifying a random example drawn from D. Formally, .
- Hypothesis Weight : Each hypothesis is assigned a weight based on its true error, determined by a parameter . The weight is defined as .
- True Log-Ratio : For any instance , the true log-ratio is a weighted comparison of the hypotheses that predict +1 versus those that predict −1. It is defined as .
- Averaged Classifier : The final prediction of the averaged classifier is the sign of the true log-ratio
Theorem 2 Let the hypothesis class be partitioned into a single best hypothesis with error , and a set of suboptimal hypotheses in the set .
Let the following conditions hold:
- All suboptimal hypotheses have the same true error, for all , where .
- For any instance , the set of suboptimal hypotheses is collectively balanced.
Specifically, a constant fraction of the hypotheses in correctly classifies , where . That is, for any example , . - The number of suboptimal hypotheses is sufficiently large relative to the errors and the parameter , such that:
Then, the generalization error of the averaged classifier is zero. Thus, its performance is strictly superior to the best single hypothesis, i.e., .
Proof. The error of the averaged classifier is . A prediction is incorrect if , which is equivalent to , or . To prove that the error is zero, we must show that for any example drawn from the distribution .
The expression for is:
For the prediction to be correct, the ratio inside the logarithm must be greater than 1. We analyze this ratio by considering two disjoint cases that cover all possibilities for any given example .
We first consider the case where the best hypothesis is correct (i.e., ).
In this case, the set of hypotheses that correctly predict y includes and the fraction f of hypotheses from that are correct by assumption. The set of incorrect hypotheses consists of the remaining fraction from . The sum of weights of the correct hypotheses is:.
The sum of weights of the incorrect hypotheses is: . The ratio of weights is:
By the theorem's condition, , which implies , and therefore . Since the first term is positive, the entire ratio is strictly greater than 1.
Thus, has the same sign as , the prediction is correct, and .
Then we consider the case where the best hypothesis is incorrect (i.e., ).
This is the critical case where the averaged classifier must outperform the best single hypothesis. Here, the set of correct hypotheses consists only of the fraction of hypotheses from . The set of incorrect hypotheses includes and the remaining of .
The sum of weights of the correct hypotheses is:
W_{\text{correct}} = | \set{h\in \mathcal{H}': h(x) = y \}| \cdot w(h_{\text{sub}}) = f |\mathcal{H}'| e^{-\eta \epsilon_{\text{sub}}}.
The sum of weights of the incorrect hypotheses is:
W_{\text{incorrect}} = w(h^\*) + | \set{ h \in \mathcal{H}' : h(x) \neq y \}| \cdot w(h_{\text{sub}}) = e^{-\eta \epsilon^*} + (1 - f) |\mathcal{H}'| e^{-\eta \epsilon_{\text{sub}}}.
The ratio of weights is:
For the prediction to be correct, this ratio must be greater than 1.
Rearranging the terms to solve for :
Simplifying the left - hand side:
Since , the term is positive, so we can divide without changing the inequality's direction:
This is precisely the condition required by the theorem. Under this condition, the ratio of weights is greater than 1, meaning has the same sign as , and the prediction is correct.
This finished the proof of theorem 2.
Please let us know if we have resolved your concerns – thank you!
Dear reviewer LAhb,
Since the discussion period will end in a few hours, we will be online waiting for your feedback on our rebuttal, which we believe has fully addressed your concerns.
We would highly appreciate it if you could take into account our response when updating the rating and having discussions with AC and other reviewers.
Thank you so much for your time and efforts. Sorry for our repetitive messages, but we're eager to ensure everything is addressed.
Authors of # 28109
The authors propose IPAD (Inverse Prompt for AI Detection), a framework consisting of a Prompt Inverter that identifies predicted prompts that could have generated the input text, and two Distinguishers that examine the probability that the input texts align with the predicted prompts. Their experiments show that IPAD outperforms the strongest baselines on in-distribution data, out-of-distribution (OOD), and attacked data.
优缺点分析
Strengths :
-
There are experiments for testing the method's performance in several settings : in-distribution, out of distribution, structured prompts, as well as adversarial inputs. This lends more credibility and depth to the technique.
-
The results are pretty solid, and demonstrate the effectiveness of this method compared to existing detection methods.
Weaknesses :
- Not necessarily a weakness, but a limitation of the proposed method is that it requires fine-tuning multiple models, which is a costly task, which makes it less accessible for easy adoption. Besides that, the methods requires multiple calls to LLMs for every example, which again is a costly operation.
问题
- Does OOD distribution only contain HWT samples? Would recommend clarifying the composition of this dataset in the paper.
- In Figure 3, while there is a general trend of the models performing worse when evaluated under adversarial attacks, there are some instances where the model actually performs better with the attack vs without. Is this a evaluation error, or is this expected?
局限性
Yes
最终评判理由
The authors clarified most of the questions I had regarding the results and the latency impacts of their method. I stand by my original ratings, as I feel they are still reflective of my assessment of the paper.
格式问题
None
Thank you for your valuable feedback and the time dedicated to reviewing our work. We address your concerns and questions as follows.
[W] Cost of Fine-tuning & Multi-Call Overhead
- Single-time training: We only need to train our model once, and it can generalize across various LLMs. For instance, in addition to the four common LLMs shown in Table 1, we also include results from two more models used in [1] to support this argument:
| Original Generator | Re-Generator | HumanRec | MachineRec | AvgRec | AUROC | F1 |
|---|---|---|---|---|---|---|
| Claude Sonnet 3.5 | Claude Sonnet 3.5 | 98.2 | 100 | 99.1 | 0.982 | 0.9909 |
| Claude Sonnet 3.5 | GPT-3.5-Turbo | 97.5 | 99.6 | 98.55 | 0.9711 | 0.9853 |
| Gemini-1.5-Pro | Gemini-1.5-Pro | 98.4 | 100 | 99.2 | 0.984 | 0.9919 |
| Gemini-1.5-Pro | GPT-3.5-Turbo | 96.3 | 99.2 | 97.75 | 0.9553 | 0.9772 |
- Efficient runtime: Our model achieves high runtime efficiency compared to similarly sized models.
-
As discussed in Section 2.3, the inference procedure of our framework is designed to be computationally efficient and accessible:
-
It involves three calls to
phi-3-medium-128k-instruct, a lightweight, open-sourced LLM. -
It includes one call to GPT-3.5-turbo via API, which introduces fixed latency but no local compute cost.
-
The total computational complexity is bounded by: O(3 · L · n² · d + OpenAIapi) where
L = 32is the number of layers, andd = 3072is the hidden dimension of phi-3. To support this, we compare the per-layer parameter cost (L × d) of phi-3 with other similarly sized models:
-
| Model | Layers (L) | Hidden Dim (d) | L × d |
|---|---|---|---|
| Phi-3-14B-medium | 32 | 3072 | 98,304 |
| LLaMA-3-8B | 32 | 4096 | 131,072 |
| Qwen2-7B | 28 | 3584 | 100,352 |
- Better than zero-shot methods:
On average, our fine-tuned approach consistently outperforms zero-shot baselines.
| Setting | Arxiv AUROC | Arxiv F1 | XSum AUROC | XSum F1 | Writing AUROC | Writing F1 | Review AUROC | Review F1 | Prompt Attack AUROC | Prompt Attack F1 | Paraphrase Attack AUROC | Paraphrase Attack F1 | Perturbation Attack AUROC | Perturbation Attack F1 | LongWriter MachineRec | Code-Feedback MachineRec | Math MachineRec |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Zero-shot | 67.22 | 80.60 | 67.54 | 81.83 | 72.15 | 84.09 | 66.92 | 80.63 | 68.65 | 81.98 | 67.54 | 81.83 | 56.60 | 78.05 | 53.69 | 49.61 | 56.91 |
| Fine-tuned | 82.21 | 90.79 | 83.14 | 91.23 | 84.95 | 92.33 | 78.68 | 88.43 | 85.80 | 92.90 | 83.14 | 91.23 | 83.01 | 90.83 | 77 | 74.53 | 70.60 |
[Q1] Does the OOD Distribution Only Contain HWT Samples?
- The OOD Distribution in Table 2 and 3 contians both HWT and LGT, while Table 4 only contains the LGT. Below we detail the dataset composition for OOD evaluation:
| Table / Figure | Original Generator | Re-Generator | HWT Samples | LGT Samples | Source (Empty means the same as the line above) |
|---|---|---|---|---|---|
| Table 1 | GPT-3.5-Turbo | GPT-3.5-Turbo | 1000 | 1000 | From OUTFOX |
| GPT-4 | GPT-4 | 1000 | 1000 | Prompt from OUTFOX, use corresponding LLM to generate | |
| GPT-3.5-Turbo | GPT-4 | 1000 | 1000 | ||
| Qwen-Turbo | Qwen-Turbo | 1000 | 1000 | ||
| Qwen-Turbo | GPT-3.5-Turbo | 1000 | 1000 | ||
| LLaMA-3-70B | LLaMA-3-70B | 1000 | 1000 | ||
| LLaMA-3-70B | GPT-3.5-Turbo | 1000 | 1000 | ||
| Claude Sonnet 3.5 | Claude Sonnet 3.5 | 1000 | 1000 | ||
| Claude Sonnet 3.5 | GPT-3.5-Turbo | 1000 | 1000 | ||
| Gemini-1.5-Pro | Gemini-1.5-Pro | 1000 | 1000 | ||
| Gemini-1.5-Pro | GPT-3.5-Turbo | 1000 | 1000 | ||
| Figure 2 | GPT-3.5-Turbo | GPT-3.5-Turbo | 500 | 500 | From OUTFOX |
| GPT-4 | GPT-4 | 500 | 500 | ||
| GPT-3.5-Turbo | GPT-3.5-Turbo + DIPPER Attack | 500 | 500 | From OUTFOX | |
| GPT-3.5-Turbo | GPT-3.5-Turbo + OUTFOX | 500 | 500 | From OUTFOX | |
| Table 2 | ArXiv, XSum, Writing, Review | 500 each | 500 each | From DetectRL | |
| Table 3 | Prompt, Paraphrase, and Perturbation Attack | 500 each | 500 each | From DetectRL | |
| Table 4 | LongWriter,Code-Feedback, Math | 500 each | 0 | From Corresponding Dataset | |
| Figure 3 | GPT-3.5-Turbo | GPT-3.5-Turbo | 1000 | 1000 | Same as Table 1 Line 1 |
| Tables 5–6 | GPT-3.5-Turbo | GPT-3.5-Turbo | 1000 | 1000 | Same as Table 1 Line 1 |
[Q2] Figure 3: Why Does Adversarial Attack Sometimes Improve Performance?
- This is not an evaluation error. We observed that Prompt Attack and Paraphrase Attack datasets yield slightly higher AUROC than the non-attacked Yelp dataset under the FT Prompt setting.
- The arxiv, xsum, writing, and review datasets represent standard OOD.
- The prompt, paraphrase, and perturbation datasets are attacked OOD.
- These categories are orthogonal, so only averages across all settings are comparable.
Most importantly, the average AUROC without attacks is higher than with attacks, which aligns with expected adversarial degradation.
Thanks for the clarification about the latency concerns, and for sharing the results on more extensive experiments.
regarding this question : " Figure 3: Why Does Adversarial Attack Sometimes Improve Performance?" I actually quoted the wrong figure number. I meant to refer to Figure 2 where you have results for gpt3.5-turbo, gpt4, gpt3.5-turbo + DIPPER, and gpt3.5-turbo + OUTFOX. If we compare the gpt3.5-turbo results, the results with the DIPPER and OUTFOX attacks should be lower than those without the attacks? But I see some cases where a higher score is achieved in the presence of an attack.
Thank you for your valuable feedback and kind recognition of our work. We further address your questions as follows.
Figure 2 results.
Even though attacks like DIPPER and OUTFOX are generally designed to reduce detection accuracy, we observe a few cases where the detector’s scores actually increase under attack.
| detector | attacker | metrics | score |
|---|---|---|---|
| OUTFOX | DIPPER | HumanRec | 98.0>97.8 |
| OUTFOX | DIPPER | RoBERTa-large | 97.0>90.0 |
| OUTFOX | OUTFOX | HumanRec | 98.8>97.8 |
-
The OUTFOX detector is specially designed to anticipate and counteract attacks, especially those from the OUTFOX attacker itself. By including adversarial examples in its prompts, it becomes more robust and sometimes even performs better on attacked texts than on clean ones.
-
In the case of the DIPPER attack, although its goal is to evade detection by paraphrasing, it can inadvertently alter text in ways that certain detectors (like OUTFOX) find easier to identify.
Dear reviewer uVgU,
Since the discussion period will end in a few hours, we will be online waiting for your feedback on our rebuttal, which we believe has fully addressed your concerns.
We would highly appreciate it if you could take into account our response when updating the rating and having discussions with AC and other reviewers.
Thank you so much for your time and efforts. Sorry for our repetitive messages, but we're eager to ensure everything is addressed.
Authors of # 28109
The paper proposes a method for LLM-generated text detection that extends the existing previous work on inverse prompt regeneration for the same task (already proposed by DPIC) with two distinguishers: 1) Prompt-Text Consistency Verifier (PTCV) for assessing the alignment between the predicted prompt and text to be verified, 2) Regeneration Comparator (RC) which measures the consistency between the input text and a regenerated text given the inverted prompt using a fast closed source but powerful (for everyday tasks, especially the ones used in the evaluation datasets) model gpt-3.5-turbo. The proposed method has some similarities with DPIC, which is used as a baseline, but the performance is improved mainly because the prompt inverter is finetuned / trained and also due to the improved distinguishers.
The main strengths of the paper are the clear approach, which extends and adds some degree of novelty to DPIC, good experimental setup focusing on in-domain, OOD, and adversarial attacks for LLM generated text detection. The paper also has good ablations and the authors have provided additional data in the rebuttal that can be used to improve the paper and also increase the scores of at least a reviewer - I expect this new results to be added to the camera ready.
At the same time, the paper has the following weaknesses. On my side, the main problem is that it does not use several existing datasets for the detection of LLM generated texts to show that IPAD is better. I would have liked to see results using at least one of RAID (Dugan et al., 2024), M4GT (Wang et al., 2024a) or MAGE (Li et al., 2024). Some of these datasets actually have out-of-domain data, e.g. completely different prompts from different domains - I feel that the OOD data chosen in the paper is used a lot for LLM alignment and thus may not be representative for real-world usage plus it might exhibit important biases. Second, the degree of novelty compared to DPIC is not substantial. Third, the experimental setup is difficult to follow at times and the paper has some presentation issues as also signaled by at least a reviewer.
Given all these points and the feedback from the reviewers, I feel the paper is borderline above the acceptance threshold and I hope the authors are able to improve it for camera ready.