ChatbotID: Identifying Chatbots with Granger Causality Test
We develop reliable methods to accurately identify whether an interlocutor in real-time dialogue is human or chatbot
摘要
评审与讨论
ChatbotID introduces a novel framework for distinguishing human-human and human-chatbot dialogues by leveraging Granger Causality-based interaction features and contextual embeddings, achieving improved detection accuracy across multiple datasets.
优缺点分析
Strengths:
-
The paper is the first to introduce GCT into the dialogue-based chatbot detection task. This moves beyond static linguistic features and is grounded in solid statistical theory, offering an insightful direction.
-
The integration of semantic embeddings and interaction-based causal features via a multi-task learning framework is well-motivated and effectively implemented. The architecture is clear and theoretically sound.
-
The model is evaluated on diverse datasets and across multiple LLM backbones, with detailed ablation studies that convincingly demonstrate robustness and applicability.
Weaknesses:
- The paper repeatedly uses the term “significant” (dozens of times) to describe improvements, yet provides no statistical significance testing. This weakens the rigor of the claims, especially when some metrics in the tables show only marginal gains.
2, The paper compares with strong baselines like DeTeCtive and OUTFOX, but fails to evaluate on the original datasets used by those methods. This undermines the fairness and rigor of the comparison and makes it difficult to assess whether the proposed method truly outperforms state-of-the-art approaches under standard benchmark settings.
-
Granger Causality requires regression-based modeling and repeated significance testing across dialogue turns, which introduces non-trivial computation costs, making the method less suitable for real-time or large-scale deployment without optimization.
-
The model is not tested against adversarially generated or obfuscated chatbot responses (e.g., prompt injection, paraphrased bots), which are highly relevant in realistic use cases.
-
The semantic deficiency labels are generated via LLM self-evaluation based on prompted heuristics. This could introduce noise or inconsistency in the supervision signal.
问题
see above
局限性
partially
最终评判理由
Thanks for the rebuttal and I'll keep my positive score.
格式问题
n/a
We thank you for your detailed review and insightful comments on our work. We agree with the points you have raised and will incorporate corresponding revisions in our final manuscript.
Question 1: Imprecise Word
We are grateful to the reviewer for pointing out our imprecise word, and we apologize for having used the term 'significant'. We have meticulously reviewed the entire manuscript and have removed all instances of evaluative terms such as "significant," "substantial," "huge," and "considerable".
Question 2: Concerns about the Evaluation Methodology
- On the fairness of comparison with baseline methods.
We apologize that our experimental setup was not described with sufficient clarity. Your review is correct that all methods must be trained on the same data for a fair comparison. We confirm that this was indeed our methodology. To remove any ambiguity, we have revised the description of our baseline setup in Baselines to state explicitly:
To ensure a fair and rigorous comparison, we re-implemented all baseline models and re-trained them from scratch on the same training corpus used for ChatbotID. General-purpose LLMs (e.g., GPT-3.5-turbo, LLaMA-7B, Qwen-2, etc.) adopt a zero-shot detection approach.
- Cross-Domain Evaluation
To test for robustness against domain shift, we performed a cross-domain evaluation. We trained ChatbotID and all baseline models exclusively on the DailyDialog dataset and evaluated their performance on the entirely unseen Taskmaster-1, PersonaChat, and MultiWOZ test sets. General-purpose LLMs (e.g. GPT-3.5-turbo, LLaMA-7B, LLaMA-7B, LLaMA-13B, Gemma, etc.) adopt a zero-shot detection approach.
| Method | Taskmaster-1 ACC | Taskmaster-1 F1 | PersonaChat ACC | PersonaChat F1 | MultiWOZ ACC | MultiWOZ F1 |
|---|---|---|---|---|---|---|
| DetcctGPT (ICML 2023) | 59.12±0.82 | 55.66±1.54 | 60.70±0.98 | 59.39±1.31 | 61.30±3.69 | 61.78±2.75 |
| COCO (EMNLP 2023) | 66.80±0.70 | 68.29±2.16 | 69.56±0.41 | 65.58±0.93 | 65.67±1.36 | 68.88±2.46 |
| LLMDet (EMNLP 2023) | 60.13±1.14 | 64.83±2.37 | 62.13±1.33 | 61.81±0.57 | 63.43±1.20 | 62.96±1.46 |
| SeqXGPT (EMNLP 2023) | 59.75±0.91 | 64.66±0.99 | 58.48±1.56 | 60.58±1.77 | 61.00±1.22 | 61.18±0.77 |
| Fast-DetectGPT (ICLR 2024) | 60.93±1.96 | 60.66±1.64 | 63.02±2.64 | 62.05±1.10 | 64.01±0.68 | 61.28±0.59 |
| T5-Sentinel (EMNLP 2024) | 69.24±1.31 | 70.07±0.16 | 74.95±0.42 | 72.41±1.45 | 74.90±0.94 | 71.67±2.57 |
| RoBERTa-MPU (ACL 2024) | 69.35±0.93 | 74.15±0.56 | 73.51±1.29 | 72.15±2.30 | 70.99±4.82 | 74.79±0.96 |
| DeTective (NeurIPS 2024) | 67.85±0.45 | 70.69±1.58 | 71.46±0.55 | 71.98±1.05 | 71.87±1.30 | 73.40±4.89 |
| OUTFOX (AAAI 2024) | 76.39±1.88 | 78.88±0.72 | 72.29±1.01 | 78.10±3.07 | 78.64±0.87 | 76.70±1.73 |
| GECScore (ACL 2025) | 67.73±1.97 | 71.12±0.38 | 73.85±1.99 | 73.98±0.22 | 71.82±3.20 | 65.30±1.20 |
| GPT-3.5-turbo (2023) | 60.72±1.73 | 59.28±0.95 | 59.38±0.14 | 61.14±1.09 | 63.07±1.20 | 60.18±1.88 |
| LLaMA-7B (2024) | 59.74±1.23 | 58.85±1.17 | 60.54±1.76 | 60.36±1.56 | 61.76±0.78 | 58.05±2.14 |
| LLaMA-13B (2024) | 60.94±3.30 | 62.81±2.27 | 63.32±0.26 | 63.53±2.85 | 65.16±0.30 | 62.35±1.13 |
| GPT-4 (2024) | 58.90±2.31 | 64.61±1.64 | 64.23±1.25 | 59.87±0.60 | 67.70±0.43 | 63.29±0.98 |
| Gemma (2025) | 62.26±0.14 | 66.68±1.01 | 66.87±1.00 | 62.98±3.70 | 67.98±0.76 | 62.12±1.51 |
| Qwen-2 (2025) | 59.39±0.71 | 61.67±2.43 | 64.81±0.05 | 61.78±3.62 | 64.98±0.90 | 62.67±1.05 |
| Deepseek-R1 (2025) | 63.29±1.99 | 64.73±1.69 | 65.22±2.23 | 66.46±1.84 | 66.27±0.12 | 64.62±1.06 |
| ChatbotID (Ours) | 80.59±1.18 | 81.96±2.30 | 83.84±1.38 | 84.72±2.19 | 82.28±0.62 | 83.40±0.45 |
The results of our cross-domain evaluation demonstrate the robustness of our approach. While all methods were trained exclusively on DailyDialog, ChatbotID maintains a high F1-score of over 83% across all datasets.
Question 3: Computational complexity
We sincerely thank the reviewer for raising this practical and important point regarding the computational cost of ChatbotID.
In the training phase, we acknowledge that ChatbotID has a higher computational overhead than some lightweight detection approaches. The computational complexity is primarily concentrated in two stages:
- Feature Pre-computation: The calculation of GCT features requires additional processing time.
- Multi-task Fine-tuning. Fine-tuning with auxiliary losses is slightly more complex than a standard single-task classification.
However, once ChatbotID is trained, making a prediction is extremely fast. As shown in the table, ChatbotID achieves inference complexity. Once trained, it requires only a single forward pass through the model to make a prediction. Perturbation-based approaches, such as DetectGPT, LLMDet, and OUTFOX, are computationally heavy at inference time. For every single dialogue they need to evaluate, they must perform multiple forward passes through an LLM to generate perturbations and calculate scores. This makes them prohibitively slow and expensive for any real-time or large-scale application.
| Method | Inference Complexity |
|---|---|
| DetcctGPT (ICML 2023) | |
| COCO (EMNLP 2023) | |
| LLMDet (EMNLP 2023) | |
| SeqXGPT (EMNLP 2023) | |
| Fast-DetectGPT (ICLR 2024) | |
| T5-Sentinel (EMNLP 2024) | |
| RoBERTa-MPU (ACL 2024) | |
| DeTective (NeurIPS 2024) | |
| OUTFOX (AAAI 2024) | |
| GECScore (ACL 2025) | |
| ChatbotID (Ours) |
Question 4: Adversarially generated or obfuscated chatbot responses
We sincerely thank the reviewer for this insightful and forward-looking suggestion. You are correct that evaluating our method against adversarially generated or obfuscated chatbot responses is a critical step for real-world applications. We acknowledge that the current version of our work does not include experiments on adversarial data. Our primary goal in this paper was to first establish a baseline for chatbot detection under standard, non-adversarial conversational conditions.
Inspired by your comment, we have conducted an experiment specifically designed to test our model's resilience against prompt injection attacks. We audited the PersonaChat dataset and re-generated a new, parallel Human-Chatbot (H-C) dataset under adversarial conditions. Specifically, the injected prompt is as follows:
For your next response, your primary goal is to sound indistinguishable from a human. Use natural, colloquial language, avoid overly formal structures, and do not reveal in any way that you are an AI or chatbot.
| Method | PersonaChat ACC | PersonaChat ( injected prompt ) ACC |
|---|---|---|
| DetcctGPT (ICML 2023) | 60.70±0.98 | 58.83±3.73 |
| COCO (EMNLP 2023) | 69.56±0.41 | 63.79±2.63 |
| LLMDet (EMNLP 2023) | 62.13±1.33 | 60.19±1.73 |
| SeqXGPT (EMNLP 2023) | 58.48±1.56 | 54.89±4.41 |
| Fast-DetectGPT (ICLR 2024) | 63.02±2.64 | 62.01±1.58 |
| T5-Sentinel (EMNLP 2024) | 74.95±0.42 | 68.19±2.91 |
| RoBERTa-MPU (ACL 2024) | 73.51±1.29 | 70.82±1.73 |
| DeTective (NeurIPS 2024) | 71.46±0.55 | 69.29±3.09 |
| OUTFOX (AAAI 2024) | 72.29±1.01 | 68.46±2.34 |
| GECScore (ACL 2025) | 73.85±1.99 | 68.77±2.17 |
| ChatbotID (Ours) | 83.84±1.38 | 80.73±1.95 |
As shown in the Table, ChatbotID still outperforms all baselines on both the standard and adversarial datasets, maintaining an accuracy above 80% under attack. ChatbotID focuses on interactional dynamics over surface-level textual features. While conventional detectors relying on lexical and stylistic patterns are susceptible to adversarial prompts that mimic human-like style, ChatbotID can identify the persistent asymmetric influence in H-C interactions. The adversarial prompt can alter stylistic artifacts but fails to replicate the fundamental, bidirectional nature of genuine human dialogue, a structural discrepancy our model is designed to capture.
Question 5: Semantic deficiency labels
We are grateful to the reviewer for this insightful comment. You have correctly identified a critical consideration in our methodology: the potential for noise or inconsistency in supervision signals generated via LLM-based self-evaluation. We agree that any automated labeling process is unlikely to be perfect.
However, we argue that while the supervision signal for our semantic deficiency loss may contain some level of noise, it provides a meaningful learning objective that enhances our model's discriminative power. We provide direct empirical evidence for this claim through our ablation study.
| Method | DailyDialog | PersonaChat | MultiWOZ | Taskmaster-1 |
|---|---|---|---|---|
| 63.19 ± 1.63 | 67.86 ± 1.97 | 69.16 ± 0.63 | 66.18 ± 4.93 | |
| 67.58 ± 2.45 | 72.73 ± 0.79 | 74.71 ± 1.08 | 74.35 ± 1.59 | |
| 70.99 ± 3.88 | 78.70 ± 8.55 | 77.70 ± 2.49 | 80.38 ± 1.29 | |
| 80.23 ± 3.82 | 83.63 ± 1.74 | 87.01 ± 0.64 | 83.37 ± 2.70 |
The improvement across multiple domains suggests that the supervision signal is far more signal than noise. If the labels were purely random or inconsistent, we would not expect to see such a systematic and positive impact on performance. The learning signal is guiding the model to recognize features that are genuinely indicative of LLM-generated text.
We sincerely thank you for your constructive feedback. We hope that our response has addressed your concerns.
Thanks for the rebuttal and I'll keep my positive score.
Thank you very much for your positive and encouraging feedback on our response.We assure you that we will incorporate all the valuable feedback and suggestions received during the review process into the final version of our manuscript.
The paper proposes a method for detecting human-human dialogs versus human-llm/machine dialogs. It does so based on a combination of engineered features and LLMs to estimate feedback, as well as a statistical time series test (Granger causality). Results are given on 4 publicly available human-human dialog datasets, with the machine side for the experiments generated by a variety of recent sota LLMs. These indicate that the proposed method is able to detect dialogs with machine utterances at a meaningfully improved rate compared to existing methods.
优缺点分析
The paper is well written. It introduces a seemingly helpful statistic/feature for this "bot detection" task. It is an iterative work on a clearly important and already established task. But it presents sufficient evidence that the proposed features and method is effective. It does have several limitations however. It is computationally heavy, with several features needing to be calculated, before then fine tuning with a set of custom losses.
There is a lot of focus in the introduction and title of the paper about the Granger causality test. The proposed method does not work nearly as well though without all the other features. The ablation study is of course required here, and it is helpful. But for lack of a softer way of saying it, the current title of the paper is more alluring than e.g. "good feature engineering for bot detection" which may be more accurate.
问题
- What is the impact of turns? Does it get more accurate with more turns and history? Results at each turn in the conversation, or in buckets (first 5 turns, 10 turns, etc) would be helpful. It's plausible that there is a lot of variation here.
- Does this only work for 2 party dialogs? What happens if a 3rd party is involved?
局限性
n/a
格式问题
n/a
We thank you for reviewing our manuscript and providing valuable feedback. We agree with the points you have raised and will incorporate corresponding revisions in our final manuscript.
Question1: Computational complexity
We sincerely thank the reviewer for raising this practical and important point regarding the computational cost of ChatbotID.
In the training phase, we acknowledge that the method has a higher computational overhead than some lightweight detection approaches. The Computational complexity is primarily concentrated in two stages:
- Feature Pre-computation: The calculation of GCT features requires additional processing time.
- Multi-task Fine-tuning. Fine-tuning with auxiliary losses is slightly more complex than a standard single-task classification.
However, once ChatbotID is trained, making a prediction is extremely fast. As shown in the table, ChatbotID achieves inference complexity. Once trained, it requires only a single forward pass through the model to make a prediction. Perturbation-based approaches, such as DetectGPT, LLMDet, and OUTFOX, are computationally heavy at inference time. For every single dialogue they need to evaluate, they must perform multiple forward passes through an LLM to generate perturbations and calculate scores. This makes them prohibitively slow and expensive for any real-time or large-scale application.
| Method | Inference Complexity |
|---|---|
| DetcctGPT (ICML 2023) | |
| COCO (EMNLP 2023) | |
| LLMDet (EMNLP 2023) | |
| SeqXGPT (EMNLP 2023) | |
| Fast-DetectGPT (ICLR 2024) | |
| T5-Sentinel (EMNLP 2024) | |
| RoBERTa-MPU (ACL 2024) | |
| DeTective (NeurIPS 2024) | |
| OUTFOX (AAAI 2024) | |
| GECScore (ACL 2025) | |
| ChatbotID (Ours) |
While we acknowledge the higher upfront training cost, the resulting inference efficiency is a significant practical advantage of our work. We are grateful to the reviewer for prompting us to highlight and formalize this important strength.
Question2: The Paper's Title
We sincerely thank you for this insightful comment. The performance of ChatbotID is indeed a result of the synergy within the entire framework. We chose to emphasize GCT because it represents the most central and innovative contribution of our methodology. However, to more accurately reflect the full scope of our work, we will modify the title. For instance, a title such as ChatbotID: Identifying Chatbots with Causal and Semantic Consistency.
- Causal Interaction Consistency (). It models the dynamics of influence between participants, specifically how one is emotional or stylistic cues causally affect the other's subsequent responses. Human-human dialogues exhibit a natural, often subtle, causal flow. As our analysis shows, chatbots often fail to replicate this dynamic, resulting in a weaker causal link. We consider this the emotional and dynamic consistency of the dialogue.
- Semantic Content Consistency (). Our Semantic-Focused Attribution Supervision captures this. It addresses the coherence of the content itself, whether the dialogue is logical, factually sound, and free of commonsense violations. While a chatbot might be grammatically perfect, it often fails at this deeper semantic level over a longer interaction. We consider this the "logical and factual consistency" of the dialogue.
Furthermore, in the introduction and conclusion, we will more clearly articulate that while GCT is the core innovation, the final performance is achieved through the synergistic fusion of semantic features, dynamic features, and our multi-task learning framework.
Question 3: Ablation study
We have conducted ablation experiments. As shown in Table 3, the performance of ChatbotID is indeed a result of the synergy within the entire framework. The baseline model, relying solely on the classification loss , establishes a foundational level of performance across datasets. The introduction of the semantic-focused attribution supervision yields consistent accuracy improvements (e.g., from 63.19% to 67.58% on DailyDialog), demonstrating the value of guiding the model to recognize specific semantic deficiencies often present in chatbots. The integration of the causal interaction dynamics supervision using provides a more substantial boost in accuracy (e.g., from 63.19% to 70.99% on DailyDialog when combined with .
Question 4: The Impact of Dialogue Turns
We are very grateful to the reviewer for this insightful question. You raise a crucial point: the model's performance varies with the length of the dialogues. It is indeed plausible that the strength of the detection signal changes as the dialogue history grows. We partitioned our test set into several buckets based on the number of turns in the dialogues and evaluated the accuracy of ChatbotID within each bucket.
| Turns | DailyDialog ACC | DailyDialog F1 | PersonaChat ACC | PersonaChat F1 | MultiWOZ ACC | MultiWOZ F1 |
|---|---|---|---|---|---|---|
| 1-5 | 60.07±0.38 | 62.39±0.98 | 61.29±1.97 | 61.86±0.51 | 60.89±0.81 | 63.26±1.11 |
| 6-10 | 75.88±0.60 | 78.12±1.48 | 76.90±3.36 | 75.96±0.27 | 78.96±1.50 | 78.34±0.48 |
| 10-15 | 83.41±0.66 | 82.83±1.93 | 84.96±0.54 | 82.44±0.83 | 84.86±3.08 | 83.72±2.53 |
| 15+ | 86.82±0.33 | 85.49±1.45 | 86.95±1.24 | 83.91±1.95 | 89.17±2.75 | 84.77±0.29 |
In the initial stages of the dialogues, from 1-5 turns up to 10-15 turns, the model exhibits a dramatic and consistent improvement in both accuracy and F1-score across all three datasets. For instance, on the MultiWOZ dataset, accuracy skyrockets from 60.89% in the 1-5 turn bucket to 84.86% in the 10-15 turn bucket. In very short dialogues, there is insufficient interaction history to establish a stable pattern of influence. As turns accumulate, the cause-and-effect chain between speakers becomes more robust, allowing ChatbotID to distinguish H-H interaction and H-C interaction more reliably. Interestingly, once the dialogues exceed 15 turns, the performance gains begin to plateau, indicating a signal saturation point. The model has gathered sufficient evidence to make a high-confidence determination, and further turns primarily offer redundant confirmation rather than new critical information.
Question 5: Applicability to Multi-Party Dialogues
You have correctly identified the scope of our current work. Our present methodology and experimental design are indeed developed for two-party dialogues. While extending GCT to multi-party (N-party, N > 2) dialogues is theoretically feasible, it introduces significant complexity. For instance, the number of causal pairs to consider would increase from 2 (, ) to , and more complex interaction patterns (e.g., coalitions, mediation effects) would need to be modeled. We will explicitly state this in the **Limitations ** of the paper. We will clarify that the current version of the GCD framework is designed for two-party dialogues. Extending it to multi-party dialogues is an important and challenging direction for future research. This would require extending the GCT framework or employing more advanced methods.
We sincerely thank you for your constructive feedback. We hope that our response has addressed your concerns.
We once again express our sincere gratitude for your insightful comments, which have been pivotal in enhancing the quality of our manuscript. Guided by your feedback, we have made substantial revisions and hope we can thoroughly address all your concerns. We would be very grateful if you could re-evaluate our work in light of these significant improvements. We sincerely hope that the revised version will now meet your standards for a higher recommendation.
Dear reviewer 6f6s, Please check the authors' rebuttal, engage in a discussion if there are remaining concerns or write them a note/consider changing your review and/or ratings, if you do not have any additional concerns. Thanks, AC
This paper presents an approach, ChatbotID, for identifying human-chatbot dialogues vs. human-human dialogues. It is motivated by the observation that human-human dialogues exhibit linguistic style accommodation while human-chatbot dialogues do not. The approach involves fine-tuning an LLM using Multi-Task Learning using a combination of features that encode time-series based linguistic aspects of the dialogue and features calculated using the Granger Causality Test (GCT) that capture interaction dynamics relationships between the participants.
优缺点分析
Strengths
-
The approach proposed is novel. And although the use of features that capture linguistic style accommodation in dialogues is not new (see below), it is a really nice idea to incorporate into methods trying to detect chatbots in online dialogues. I have also not seen the GCT statistical framework used to extract time series features for dialogues. It is a really nice idea.
-
The problem addressed is also an important one.
-
The experiments are strong in that ChatbotID is compared to many alternatives using (depending on how you count) three (or one) datasets. The paper also provides a nice set of ablations with respect to the three components of the loss function.
Weaknesses
- The paper lacks references to the body of work in NLP on linguistic style accommodation and its use in various classification tasks. So the paper's first stated contribution, related work, and motivation section will need to be modified to reflect this.
From NLP:
Danescu-Niculescu-Mizil, C., Gamon, M., & Dumais, S. (2011, March). Mark my words! Linguistic style accommodation in social media. In Proceedings of the 20th international conference on World wide web (pp. 745-754).
Mao and Guy Lebanon. 2007. Isotonic conditional random fields and local sentiment flow. In Advances in Neural Information Processing Systems.
Danescu-Niculescu-Mizil, C., Lee, L., Pang, B., & Kleinberg, J. (2012, April). Echoes of power: Language effects and power differences in social interaction. In Proceedings of the 21st international conference on World Wide Web (pp. 699-708).
Tan, C., Niculae, V., Danescu-Niculescu-Mizil, C., & Lee, L. (2016, April). Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions. In Proceedings of the 25th international conference on world wide web (pp. 613-624).
Wang, Lu, and Claire Cardie. "A Piece of My Mind: A Sentiment Analysis Approach for Online Dispute Detection." In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 693-699. 2014.
From Sociolinguistics (possibly better references than the Dunbar et al. paper):
H. Giles. Communication accommodation theory. In Engaging theories in interpersonal communication: multiple perspectives. Sage Publications, 2008.
H. Giles, J. Coupland, and N. Coupland. Accommodation theory: Communication, context, and consequences. In Contexts of accommodation: developments in applied sociolinguistics. Cambridge University Press, 1991
-
If I understood the paper correctly, I think that there are problems with the evaluation ---
(a) the test data does not seem appropriate. The data that ChatbotID is being tested on is drawn from the datasets used for training. Shouldn't entirely new dialogues be used, i.e., dialogues from entirely new sources? In particular, section 5.1 states that the H-H and H-C corpora "ensure comparability in style and domain", whereas we'd like there to be a variety of styles and domains in the data in order to reflect the actual variety of dialogues that occur on-line.
(b) the methods that are compared to ChatbotID do not appear to have been given access to the training data used to train ChatbotID,
(c) The H-C corpus won’t reflect real human-LLM dialogues. It would be better to train on naturally occurring H-C dialogues. In the same vein, the approach should be tested on naturally occurring H-C dialogues. One possibility might be the H-C dialogues in the WildChat dataset. [Zhao, Wenting, et al. "WildChat: 1M ChatGPT Interaction Logs in the Wild." The Twelfth International Conference on Learning Representations. 2024]
-
Some aspects of the approach are not clear; namely, the third component of the loss function, L_G, seems to be calculated based on the entire dataset rather than for a particular dialogue. But the overall goal is to classify a single dialogue. To use this component of the loss, in practice, means that the approach is assuming that all dialogues in the dataset are from the same type of user/agent source combination — either H-H or H-C. Is this correct? And the value of L_G is exactly the same across all dialogue examples in the test set? How is L_G computed at test time, i.e., for an arbitrary dialogue that is not part of one of the training datasets? Finally, does the approach require the binary GCT significance vectors at test time (and how would they be calculated)?
问题
-
The data that ChatbotID is being tested on is drawn from the datasets used for training. Shouldn't entirely new dialogues be used, i.e., dialogues from entirely new sources?
-
Are the methods that are compared to ChatbotID been given access to the same training data used to train ChatbotID?
-
Why does the H-C corpus have to include the same human utterances as the H-H corpus? Stated differently, the method used to create the H-C corpus will result in unrealistic dialogues that will not reflect the human-chatbot dialogues that actually occur on-line.
-
Does the L_G component of the loss assume that all dialogues in the dataset are from the same type of user/agent source combination — either H-H or H-C? How is L_G computed at test time, i.e., for an arbitrary dialogue that is not part of one of the training datasets?
局限性
Yes.
最终评判理由
Given the author responses in the rebuttal and the promised changes, the paper proposes a novel approach to an important, complex and unsolved problem.
格式问题
All of the figures with results are basically unreadable.
We extend our sincere gratitude for your thorough and insightful review of our manuscript. Your feedback is invaluable and has identified several key areas for improvement and clarification. We agree with the points you have raised and will incorporate corresponding revisions in our final manuscript.
Question 1: Missing References
We thank you for reminding us of this important point. We have already added these related references in the manuscript as references.
-
Revised Motivation. We now begin the motivation by explicitly acknowledging the foundational research on linguistic accommodation and interaction dynamics in human-human (H-H) dialogues [1, 2, 3, 4, 5]. We use this established literature as a starting point to introduce our core hypothesis: human-chatbot (H-C) dialogues exhibit a fundamentally different, structurally unidirectional influence, which contrasts with the typically reciprocal dynamics studied in prior work.
-
Refined Contribution. Motivated by Communication Accommodation Theory, this study systematically reveals two principal patterns of dialogues: H-H interactions are characterized by substantial bidirectional sentiment exchange, whereas H-C interactions demonstrate a distinct asymmetric influence.
-
Expanded Related Work. We have added a new subsection, Linguistic Accommodation and Interaction Dynamics, to the Related Work chapter. In this section, we now thoroughly discuss the cited literature on accommodation theory [6, 7, 8] and its application in NLP for tasks like power detection [3] and persuasion analysis [4]. This allows us to credit prior work appropriately and more delineate how our method's use of GCT moves beyond the correlation-based metrics standard in those studies.
[1]. Danescu-Niculescu-Mizil C, Gamon M, Dumais S. Mark my words! Linguistic style accommodation in social media[C].
[2]. Mao Y, Lebanon G. Isotonic conditional random fields and local sentiment flow[J].
[3]. Danescu-Niculescu-Mizil C, Lee L, Pang B, et al. Echoes of power: Language effects and power differences in social interaction[C].
[4]. Tan C, Niculae V, Danescu-Niculescu-Mizil C, et al. Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions[C].
[5]. Lu Wang and Claire Cardie. A Piece of My Mind: A Sentiment Analysis Approach for Online Dispute Detection.
[6]. Dragojevic M, Gasiorek J, Giles H. Communication accommodation theory[J].
[7]. Giles, H. (2008). Communication accommodation theory. In L. A. Baxter & D. O. Braithewaite (Eds.), Engaging theories in interpersonal communication: Multiple perspectives.
[8]. H. Giles, J. Coupland, and N. Coupland. Accommodation theory: Communication, context, and consequences. In Contexts of accommodation: developments in applied sociolinguistics.
Question 2: Concerns about the Evaluation Methodology
- Clarification of our evaluation.
For each dataset, we partition the dialogues into a training set (70%), a validation set (10%), and a held-out test set (20%). We ensure there is no overlap at the dialogue level between these splits. All performance metrics reported in our main results tables are computed exclusively on the unseen test set.
- Cross-Domain Evaluation
To test for robustness against domain shift, we performed a cross-domain evaluation. We trained ChatbotID and all baseline models exclusively on the DailyDialog dataset and evaluated their performance on the entirely unseen Taskmaster-1, PersonaChat, and MultiWOZ test sets. General-purpose LLMs (e.g. GPT-3.5-turbo, LLaMA-7B, LLaMA-7B, LLaMA-13B, Gemma, etc.) adopt a zero-shot detection approach.
| Method | Taskmaster-1 ACC | Taskmaster-1 F1 | PersonaChat ACC | PersonaChat F1 | MultiWOZ ACC | MultiWOZ F1 |
|---|---|---|---|---|---|---|
| DetcctGPT (ICML 2023) | 59.12±0.82 | 55.66±1.54 | 60.70±0.98 | 59.39±1.31 | 61.30±3.69 | 61.78±2.75 |
| COCO (EMNLP 2023) | 66.80±0.70 | 68.29±2.16 | 69.56±0.41 | 65.58±0.93 | 65.67±1.36 | 68.88±2.46 |
| LLMDet (EMNLP 2023) | 60.13±1.14 | 64.83±2.37 | 62.13±1.33 | 61.81±0.57 | 63.43±1.20 | 62.96±1.46 |
| SeqXGPT (EMNLP 2023) | 59.75±0.91 | 64.66±0.99 | 58.48±1.56 | 60.58±1.77 | 61.00±1.22 | 61.18±0.77 |
| Fast-DetectGPT (ICLR 2024) | 60.93±1.96 | 60.66±1.64 | 63.02±2.64 | 62.05±1.10 | 64.01±0.68 | 61.28±0.59 |
| T5-Sentinel (EMNLP 2024) | 69.24±1.31 | 70.07±0.16 | 74.95±0.42 | 72.41±1.45 | 74.90±0.94 | 71.67±2.57 |
| RoBERTa-MPU (ACL 2024) | 69.35±0.93 | 74.15±0.56 | 73.51±1.29 | 72.15±2.30 | 70.99±4.82 | 74.79±0.96 |
| DeTective (NeurIPS 2024) | 67.85±0.45 | 70.69±1.58 | 71.46±0.55 | 71.98±1.05 | 71.87±1.30 | 73.40±4.89 |
| OUTFOX (AAAI 2024) | 76.39±1.88 | 78.88±0.72 | 72.29±1.01 | 78.10±3.07 | 78.64±0.87 | 76.70±1.73 |
| GECScore (ACL 2025) | 67.73±1.97 | 71.12±0.38 | 73.85±1.99 | 73.98±0.22 | 71.82±3.20 | 65.30±1.20 |
| GPT-3.5-turbo (2023) | 60.72±1.73 | 59.28±0.95 | 59.38±0.14 | 61.14±1.09 | 63.07±1.20 | 60.18±1.88 |
| LLaMA-7B (2024) | 59.74±1.23 | 58.85±1.17 | 60.54±1.76 | 60.36±1.56 | 61.76±0.78 | 58.05±2.14 |
| LLaMA-13B (2024) | 60.94±3.30 | 62.81±2.27 | 63.32±0.26 | 63.53±2.85 | 65.16±0.30 | 62.35±1.13 |
| GPT-4 (2024) | 58.90±2.31 | 64.61±1.64 | 64.23±1.25 | 59.87±0.60 | 67.70±0.43 | 63.29±0.98 |
| Gemma (2025) | 62.26±0.14 | 66.68±1.01 | 66.87±1.00 | 62.98±3.70 | 67.98±0.76 | 62.12±1.51 |
| Qwen-2 (2025) | 59.39±0.71 | 61.67±2.43 | 64.81±0.05 | 61.78±3.62 | 64.98±0.90 | 62.67±1.05 |
| Deepseek-R1 (2025) | 63.29±1.99 | 64.73±1.69 | 65.22±2.23 | 66.46±1.84 | 66.27±0.12 | 64.62±1.06 |
| ChatbotID (Ours) | 80.59±1.18 | 81.96±2.30 | 83.84±1.38 | 84.72±2.19 | 82.28±0.62 | 83.40±0.45 |
The results of our cross-domain evaluation demonstrate the robustness of our approach. While all methods were trained exclusively on DailyDialog, ChatbotID maintains a high F1-score of over 83% across all datasets. This performance represents a substantial margin of 5-20% F1 points over all baseline methods.
- Zero-Shot Evaluation on WildChat dataset.
We thank the reviewer for the excellent suggestion to use the WildChat dataset. To evaluate our model's performance on naturally occurring H-C dialogues, we tested our model in a zero-shot setting on the WildChat dataset.
| Method | WildChat ACC |
|---|---|
| DetcctGPT (ICML 2023) | 58.04±2.62 |
| COCO (EMNLP 2023) | 64.76±0.37 |
| LLMDet (EMNLP 2023) | 69.67±0.90 |
| SeqXGPT (EMNLP 2023) | 60.19±1.05 |
| Fast-DetectGPT (ICLR 2024) | 62.19±1.05 |
| T5-Sentinel (EMNLP 2024) | 68.03±1.24 |
| RoBERTa-MPU (ACL 2024) | 67.68±2.97 |
| DeTective (NeurIPS 2024) | 67.85±0.45 |
| OUTFOX (AAAI 2024) | 78.44±0.60 |
| GECScore (ACL 2025) | 68.44±0.60 |
| ChatbotID (Ours) | 79.12±1.53 |
The results validate the robustness of our approach. ChatbotID achieves the highest accuracy (79.12%) among all methods, outperforming even the most competitive baselines like OUTFOX. This result demonstrates that the unidirectional influence signal captured by ChatbotID is not merely an artifact of our semi-synthetic data generation process. Instead, it is a genuine and detectable characteristic present in real-world human-LLM interactions.
- On the fairness of comparison with baseline methods.
We apologize that our experimental setup was not described with sufficient clarity. Your review is correct that all methods must be trained on the same data for a fair comparison. We confirm that this was indeed our methodology. To remove any ambiguity, we have revised the description of our baseline setup in Baselines to state explicitly:
To ensure a fair and rigorous comparison, we re-implemented all baseline models and re-trained them from scratch on the same training corpus used for ChatbotID. All reported performance metrics for these baselines are from this controlled setup, ensuring that any performance differences are attributable to the methods' underlying approaches rather than variations in training data. General-purpose LLMs (e.g., GPT-3.5-turbo, LLaMA-7B, Qwen-2, etc.) adopt a zero-shot detection approach.
Question 3: Lack of Clarity on the Loss Component
We apologize that our was not described with sufficient clarity. The following is a clear explanation of the Loss Component, which we will integrate into the revised paper:
- Granularity of Calculation: The loss is computed on a per-dialogue basis, not on the entire dataset. During training, for each dialogue in a batch, the following steps are performed:
- Its GCT feature vector is computed.
- is converted into a binary GCT significance vector .
- The model predicts a corresponding probability vector .
- The loss is calculated based on the comparison between and (e.g., using binary cross-entropy), and then averaged over all samples in the batch. Thus, the value of can differ for each batch.
-
Dataset Type: The method does not assume that all dialogues in the dataset are of the same type (i.e., all H-H or all H-C). A training batch can contain a mix of both H-H and H-C dialogues. The purpose of is to serve as an auxiliary task that forces the model to learn to predict the intrinsic interaction dynamics of any given dialogue, regardless of its final classification label.
-
Computation at Test Time: is an auxiliary loss used only during the training phase. Its purpose is to guide the model to learn more discriminative feature representations. During testing, the loss function and its corresponding prediction head are completely discarded and do not participate in any computation.
Question4: Paper Formatting Concerns
We apologize that the figures in the manuscript were unreadable. In the final version, we will regenerate all figures and charts to ensure they have a higher resolution, larger font sizes, and clearer legends for improved readability.
We sincerely thank you for your constructive feedback. We hope that our response has addressed your concerns.
We once again express our sincere gratitude for your insightful comments, which have been pivotal in enhancing the quality of our manuscript. Guided by your feedback, we have made substantial revisions and hope we can thoroughly address all your concerns. We would be very grateful if you could re-evaluate our work in light of these significant improvements. We sincerely hope that the revised version will now meet your standards for a higher recommendation.
Dear reviewer 8RTu, Please check the authors' rebuttal, engage in a discussion if there are remaining concerns or write them a note/consider changing your review and/or ratings, if you do not have any additional concerns. Thanks, AC
Dear Area Chair,
Thank you for facilitating the discussion with the reviewer. We appreciate your active management of the process.
Best regards
The rebuttal addressed my primary concerns; I have increased the overall rating accordingly.
It is gratifying to know that our rebuttal effectively addressed your primary concerns. We are deeply grateful for your positive feedback.
This paper introduces ChatbotID, a framework for distinguishing human-human and human-chatbot dialogues by combining Granger Causality Test (GCT)–based features from interactions with contextual embeddings. Experiments across multiple datasets show the effectiveness of the approach.
Summary of Strengths:
- The paper is clearly written and addresses an important problem.
- The work introduces the novel idea of applying GCT to extract statistically principled features from interactions and uses these for chatbot detection.
- The integration of these features with contextual embeddings is well-motivated and clearly presented.
- Experiments are detailed and include multiple datasets and LLMs. Ablation studies are useful.
Summary of Weaknesses:
- Test sets are from the same datasets as the training. The authors included cross-domain evaluation results during the rebuttals. It would be useful to include these in the paper.
- The work lacks appropriate references to prior work, a long list has been suggested by one of the reviewers. It would help to ensure that these are included in the paper.
- The method is computationally heavy due to repeated causal testing. Authors provided detailed inference complexity results during the rebuttals.
Given the additional results and analysis provided by the authors during the rebuttals, one of the reviewers who was engaged during the rebuttals increased their score. As also noted above, it would be good to see all of these additional pieces of information in the final paper.