5.5

/10

Rejected4 位审稿人

最低3最高4标准差0.5

3.0

置信度

创新性2.5

质量2.5

清晰度2.3

重要性2.3

NeurIPS 2025

Towards Anomaly Detection on Text-Attributed Graphs

Xudong Liu,Yanan Ren,Hengtong Zhang,Run-An Wang,Shenghe Zheng,Zhaonian Zou

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

TL;DR

The first anomaly detection framework towards Text-Attributed Graph.

摘要

关键词

Graph anomaly detectiontext attributed graphlow-resource learning

评审与讨论

审稿意见

评分: 3置信度: 22025-06-03

This paper proposes an anomaly detection method on graphs where each node is associated with textual information. To detect global anomalies, the proposed method uses three sources: bag-of-words representations of node texts, encoded textual features, and the graph structure, and computes an anomaly score based on reconstruction errors and contrastive losses. To detect local anomalies, it uses discrepancies between a node’s ego-graph and the text graph formed by its neighborhood. The final anomaly score is obtained as a weighted sum of the global and local scores. The effectiveness of the proposed method is demonstrated on three benchmark datasets.

优缺点分析

Strengths

Anomaly detection on graphs with textual information has a wide range of practical applications.
The combination of LLMs and GNNs is an interesting and promising approach.
The proposed method outperforms various baseline methods in the experiments.

Weaknesses

The experimental evaluation appeared insufficient. Only three datasets are used, and it is unclear what kind of textual information is associated with each node. All anomalies are synthetically generated. Evaluation on more realistic datasets would be desirable.
The proposed method involves many hyperparameters ( $\alpha$ , $\beta$ , $\lambda$ , etc.), but it is unclear how these are selected in a fully unsupervised (zero-shot) setting. Appendix C.3 mentions the use of early stopping, but in an unsupervised setting where no labeled data is available, what criterion was used to determine the stopping?
From a methodological standpoint, the proposed method seems an ad hoc combination of several existing techniques (e.g., reconstruction error, contrastive learning, etc.). The theoretical motivation behind this combination is unclear. For example, it is not well explained why both the BoW representations and embeddings from LMs are used to represent the node text.

问题

Please see the comments in "Strengths And Weaknesses".

局限性

Yes

最终评判理由

My concern regarding the hyperparameters has been resolved by the author response, so I will raise the score for Quality.

However, I still have some reservations about the technical novelty of the proposed method. I did not find breakthrough beyond a combination of existing techniques. Therefore, I would like to keep the overall Rating unchanged.

格式问题

No concerns.

作者回复

2025-07-31

We appreciate your thoughtful comments and constructive feedback. We are pleased to address any questions and clarify our manuscript accordingly.

W1: The experimental evaluation appeared insufficient. Only three datasets are used, and it is unclear what kind of textual information is associated with each node. All anomalies are synthetically generated. Evaluation on more realistic datasets would be desirable.

A1: We conduct experiments using the real-world Yelp review dataset under unsupervised settings. We use the review comments as the text attribute of the nodes. The AUC results are as follows. For the baseline methods, we use the BOW feature here. We can observe that our method significantly outperforms other baselines in this real-world dataset. We are conducting further experiments using LM-based features and under few-shot conditions, and will include those results in the revised version.

Method	AUC
SCAN	0.505 $\pm$ 0.002
Radar	--
ANOMALOUS	--
DOMINANT	0.374 $\pm$ 0.001
AnomalyDAE	--
GAD-NR	--
CONAD	0.383 $\pm$ 0.002
NLGAD	--
COLA	--
TAGAD	0.568 $\pm$ 0.041

W2: The proposed method involves many hyperparameters ( $\alpha$ , $\beta$ , $\lambda$ , etc.), but it is unclear how these are selected in a fully unsupervised (zero-shot) setting. Appendix C.3 mentions the use of early stopping, but in an unsupervised setting where no labeled data is available, what criterion was used to determine the stopping?

A2: For the selection of hyperparameters, we perform a grid search over predefined ranges of values. The combination of hyperparameters is then chosen based on the loss function. The determination of early-stopping also relies on the loss function. If the loss function shows no improvement for many epochs, the model stops training.

W3: From a methodological standpoint, the proposed method seems an ad hoc combination of several existing techniques (e.g., reconstruction error, contrastive learning, etc.). The theoretical motivation behind this combination is unclear. For example, it is not well explained why both the BoW representations and embeddings from LMs are used to represent the node text.

A3: We would like to emphasize that, to the best of our knowledge, this is the first work to explore the anomaly detection problem on the text-attributed graphs. Our method is not an ad hoc combination of existing techniques, but rather a novel framework that leverages the text features in graphs.

In the global module, we explore a two-stage algorithm towards the anomaly detection on text attributed graph: aligning the text embedding and graph structure first, and then using reconstruction loss to detect the global anomaly nodes. In the local module, we introduce a subgraph comparison mechanism to compute the local anomaly scores by the difference between the text graph and the ego graph.

Regarding the combination of BoW and LM embeddings: BOW and LMs represent the different aspects of texts. BOW represents text as word frequency vectors, emphasizing global contextual statistics, without considering word order or semantics. In contrast, LMs capture word order and semantics but may ignore global word distribution patterns. Therefore, they are complementary and represent different aspects of the text.

2025-08-04

Thank you for your response and additional experiments. Some of my concerns have been resolved, but I still have concerns about the technical novelty of this method. I would also like to know the specific loss functions used for early stopping and hyperparameter search.

评论- Clarification of the technical novelty and a detailed description of the early stopping and the hyperparameter search method.

2025-08-04

Thanks for your response.

Q1: I still have concerns about the technical novelty of this method.

A4: We first clarify the technical novelty of our paper. The existing specialized GAD methods can be broadly categorized into four types.

(1) The methods addressing imbalance label distribution, such as PC-GNN [1].

(2) The methods using the graph spectral filter to associate graph anomalies with high frequency spectral distribution, such as AMNet [2], BWGNN [3].

(3) The methods reconstructing the graph, such as DOMINANT [4], GAD-NR [5].

(4) The methods adopting the graph augmentation strategy, such as CONAD [6].

To the best of our knowledge, no prior work detects anomalous nodes by jointly leveraging both global and local perspectives. Our approach introduces two key innovations:

In the global module, we propose a first‑align‑then‑reconstruct strategy that effectively aligns the graph structure and the text representations before reconstruction. This design is significantly different from existing reconstruction‑only methods.

In the local module, we introduce a novel subgraph‑comparison mechanism to detect anomalies by contrasting the local graph. Such localized comparison has not been explored in prior GAD methods.

Q2: I would also like to know the specific loss functions used for early stopping and hyperparameter search.

A5: We provide a detailed description of the parameter selection and the early stopping method under unsupervised settings. For hyperparameter selection, we follow the existing loss function of the unsupervised methods in GAD [4, 5, 6], using the summary anomaly score in Eq.(10) as the objective to choose hyperparameters.

For the early stopping, in the global module, we employ the loss function Eq.(6) as the criterion. In the local module, we compute the anomaly score without a training phase. Therefore, early stopping is not applicable.

[1] Yang Liu, Xiang Ao, Zidi Qin, Jianfeng Chi, Jinghua Feng, Hao Yang, and Qing He. Pick and choose: a GNN-based imbalanced learning approach for fraud detection. Proceedings of the web conference, 2021, 3168–3177.

[2] Ziwei Chai, Siqi You, Yang Yang, Shiliang Pu, Jiarong Xu, Haoyang Cai, and Weihao Jiang. Can abnormality be detected by graph neural networks?. IJCAI, 2022, 1945–1951.

[3] Jianheng Tang, Jiajin Li, Ziqi Gao, and Jia Li. Rethinking graph neural networks for anomaly detection. International Conference on Machine Learning, 2022, 21076–21089.

[4] Kaize Ding, Jundong Li, Rohit Bhanushali, and Huan Liu. 2019. Deep anomaly detection on attributed networks. Proceedings of the 2019 SIAM international conference on data mining, 2019, 594–602.

[5] Amit Roy, Juan Shu, Jia Li, Carl Yang, Olivier Elshocht, Jeroen Smeets, and Pan Li. GAD-NR: Graph Anomaly Detection via Neighborhood Reconstruction. Proceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, 576–585.

[6] Zhiming Xu, Xiao Huang, Yue Zhao, Yushun Dong, and Jundong Li. Contrastive attributed network anomaly detection with data augmentation. Pacific-Asia conference on knowledge discovery and data mining. Springer, 2022, 444–457.

2025-08-06

Thank you very much for the detailed explanation of the loss functions. My concern regarding the hyperparameters has been resolved, so I will raise the score for Quality. However, I still have some reservations about the technical novelty of the proposed method. Therefore, I would like to keep the overall Rating unchanged.

2025-08-08

Thank you for your valuable feedback and for the time you’ve dedicated to multiple rounds of discussion. We truly appreciate your efforts, which have significantly contributed to improving the quality of our work.

We will update the manuscript based on your suggestions. Additionally, we would like to highlight the novelty of our work. First, our research addresses the problem of anomaly detection in text attribute graphs, which has not been addressed in prior research. We believe this is a significant contribution. Second, we propose a dual-perspective approach that integrates both global and local modules for text-attributed graphs. This innovative approach has not been explored in previous anomaly detection methods.

We hope these clarifications further emphasize the novelty of our approach. We are grateful for your thoughtful consideration and engagement, with our highest respect.

审稿意见

评分: 3置信度: 32025-06-28

This paper proposes TAGAD to detect anomaly on Text-Attributed Graph. TAGAD primarily consists of a global GAD module and a local GAD module, among which global GAD module utilizes a contrastive learning strategy to jointly train the graph-text model and an autoencoder to compute the global anomaly scores, and the local GAD module compute local anomaly scores based on an ego graph and a text graph constructed for each node.

优缺点分析

S1: Anomaly detection in text-attributed graphs represents a more application-relevant challenge.

S2: The paper attempts to fully utilization of the text information to detect the anomaly in the graph.

W1: More design choices should be discussed, including the way to combine the text and structure, the model used to encode the text.

W2: The intuition behind the computing global anomaly score and the local score should be conveyed intuitively.

问题

There are no significant differences between text feature and other attributes, if specific encoders convert them into the embedding. What is the key point if you say the text-attributed graph?
The paper claim that it is the first work on the anomaly detection in the text-enhanced graph. It is somewhat over claimed. The experimental studies are performed on graph Cora, Arxiv, Pubmed, which are widely used before.
What is the rationale behind the contrastive loss on the hidden state and the result of the text encoder? The hidden state preserves the structure and the content similarity, while the output of the text encoder only handles the content similarity. Do it means that your method mainly handles the homophilic graphs.
The two subgraphs in Line 209-2013 both represent some features of local subgraph. Please explain intuitively that the differences between two subgraphs can serve the local anomaly.
The interaction between the global and local score are not clear. The score of the global anomaly score is at the graph-level. How does this score aid to compute the node-level anomaly. I cannot find the relationship between two parts in the main figures. There are also two algorithms in the appendix. It is not encouraged to studied two subproblems in one paper.
The experimental results in Table 3 shows that the larger model can produce worse results in accuracy. Have you tried recent LLM? Please discuss the results more.

局限性

yes

最终评判理由

The authors provide more clear explanation, and I raise the score on Clarity

However, the basic idea is not conveyed properly in the original paper. They repeatedly claim that this is the first anomaly detection work on the text-attributed graph. By considering all these factors, I remain the scores in other metrics.

格式问题

The paper formatting meets the requirement

作者回复

2025-07-31

We thank you for taking the time to review our paper and for your valuable comments. We are happy to clarify our manuscript in response to the reviewer's questions.

W1: More design choices should be discussed, including the way to combine the text and structure, the model used to encode the text.

A1: In Section 5.3, we have already conducted experiments using alternative language models for text encoding, such as DeBERTa-large, e5-v2-base, and e5-v2-large, showing the robustness of our method across different pretrained models. We will explore other encoder-based language models in the revised version, such as ALBERT [1], LUKE [2].

For the way to combine the text and structure, we will conduct experiments using other GNN models, including GAT [3], GAE [4], and GraphSAGE [5], to assess the different graph encoders.

W2 & Q4: The intuition behind the computing global anomaly score and the local score should be conveyed intuitively. Please explain intuitively that the differences between two subgraphs can serve the local anomaly.

A2: Thank you for the valuable suggestion. We will add a figure in the revised version that visually illustrates the computation process.

Intuitively, inspired by [6], the patterns of global anomaly nodes deviate significantly from the majority and cannot be accurately reconstructed. Therefore, after alignment of the BOW space and the LM space, the reconstructed procedure can detect these anomaly nodes.

For a local anomaly node, its features are typically different within its neighborhood. Therefore, in the text graph constructed by semantic similarity, it has few edges with its neighbors, leading to a notable difference between the text graph and the ego graph. Accordingly, the local anomaly score is computed by measuring the discrepancy between the text graph and the ego graph.

Q1: There are no significant differences between text feature and other attributes, if specific encoders convert them into the embedding. What is the key point if you say the text-attributed graph?

A3: The key distinction is the rich information of the text feature. Text features contain various content that can capture both global context patterns and local semantic relationships. In contrast, numeric or categorical attributes are often sparse and lack the compositional structure of text. Thus, text-attributed graphs require specialized models [7] that leverage the semantic hierarchy and contextual nuance of language.

Q2: The paper claim that it is the first work on the anomaly detection in the text-enhanced graph. It is somewhat over claimed. The experimental studies are performed on graph Cora, Arxiv, Pubmed, which are widely used before.

A4: To the best of our knowledge, this work is indeed the first to focus on anomaly detection in text-attributed graphs. While there are existing models designed for graph anomaly detection (GAD), they do not specifically address text-attributed graphs, which have unique characteristics and challenges. Although the datasets are widely used in the literature, they are primarily for node classification tasks, rather than graph anomaly detection. We inject anomalies into these datasets for graph anomaly detection tasks, distinguishing our approach from previous research.

Q3: What is the rationale behind the contrastive loss on the hidden state and the result of the text encoder? The hidden state preserves the structure and the content similarity, while the output of the text encoder only handles the content similarity. Do it means that your method mainly handles the homophilic graphs.

A5: The contrastive loss aims to find the global anomaly nodes that deviate from the major distribution. For the global anomaly nodes, the features will be changed a lot due the the GNN message passing, and thus hard to align. In the heterophilic graphs, where the neighbors of anomaly nodes are mainly normal nodes, the features will be changed larger, thus easier to find the anomaly nodes.

Q5: The interaction between the global and local score are not clear. The score of the global anomaly score is at the graph-level. How does this score aid to compute the node-level anomaly. I cannot find the relationship between two parts in the main figures. There are also two algorithms in the appendix. It is not encouraged to studied two subproblems in one paper.

A6: Although the global GAD module learns from the whole graph structure, the output is still node-level anomaly scores. Each node’s global anomaly score is computed by the alignment loss between its graph-based representation and its semantic feature.

The global and local modules are not independent, but rather interdependent. In the local module, the representation in the ego graph is computed by the alignment procedure in the global module. This ensures that the ego graph combines the global embedding, rather than ignoring the graph structure. In the final anomaly score, the global and local anomaly scores are combined complementarily. We can also observe from the ablation studies that removing either module will significantly reduce the performance of the model.

Q6: The experimental results in Table 3 show that the larger model can produce worse results in accuracy. Have you tried recent LLM? Please discuss the results more.

A7: We clarify that the differences between large and small pretrained language models are quite similar, typically within 1% in most cases. This suggests that the model size does not strongly affect performance, and these small differences may be attributed to dataset-specific noise. In our work, the language model is used for encoding the text. However, most recent LLMs, such as GPT-4 and Gemini, are based on decoder-only transformer architectures, which are not directly compatible with our method.

[1] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. ArXiv, 2019, 1909.11942.

[2] Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020.

[3] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengio. Graph Attention Networks. International Conference on Learning Representations, 2018.

[4] Will Hamilton, Zhitao Ying, Jure Leskovec. Inductive Representation Learning on Large Graphs. Advances in neural information processing systems, 2017, 30.

[5] Thomas N. Kipf, Max Welling. Variational Graph Auto-Encoders. ArXiv, 2016, 1611.07308.

[6] Kaize Ding, Jundong Li, Rohit Bhanushali, Huan Liu. Deep anomaly detection on attributed networks. Proceedings of the 2019 SIAM international conference on data mining, 2019, 594-602.

[7] Hao Yan, Chaozhuo Li, Ruosong Long, Chao Yan, Jianan Zhao, Wenwen Zhuang, Jun Yin, Peiyan Zhang, Weihao Han, Hao Sun, Weiwei Deng, Qi Zhang, Lichao Sun, Xing Xie, Senzhang Wang. A Comprehensive Study on Text-attributed Graphs: Benchmarking and Rethinking. Advances in Neural Information Processing Systems, 2023, 17238-17264.

2025-08-05

I appreciate the authors' detailed feedback. The basic idea of the proposed method is conveyed more clearly now. I will raise the Clarity score.

I am still not convinced by the feedback to Q1 and Q2. As you said, Text-attributed graph requires specialized model. However, the model itself is not your major contribution.

评论- Clarification of our contributions.

2025-08-05

We thank the reviewer for the response and the improved Clarity score. We would like to emphasize that our contributions are two‑fold and that the model design is indeed a key part of our work.

(1) Due to the rich context and semantic information in the text feature, we introduce a new problem, anomaly detection on text-attributed graphs. To the best of our knowledge, this is the first work to define and address the anomaly detection problem on text-attributed graphs.

(2) We design a novel GAD model on text-attributed graphs, which integrates both global and local perspectives.

In the global module, different from existing reconstruction‑only methods, we propose a first‑align‑then‑reconstruct strategy that aligns the graph structure and the text representations before reconstruction, thereby enhancing the model's ability to jointly capture the text representations and the graph structure.

In the local module, we introduce a new subgraph‑comparison mechanism to detect anomaly nodes by contrasting the local graph. Such localized comparison has not been explored in prior GAD methods.

We also conduct an experiment on the real-world Yelp dataset, where the review comments are the text attributes of the nodes. The AUC results are as follows. We observe that our method significantly outperforms other baselines in the practical scenario.

Method	AUC
SCAN	0.505 $\pm$ 0.002
Radar	--
ANOMALOUS	--
DOMINANT	0.374 $\pm$ 0.001
AnomalyDAE	--
GAD-NR	--
CONAD	0.383 $\pm$ 0.002
NLGAD	--
COLA	--
TAGAD	0.568 $\pm$ 0.041

We will make these points more explicit in the revised version.

2025-08-06

Thanks to your response.

The core issue is that I cannot see the impact from the text-attributed graph on the anomaly detection compared to other-attributed graphs (like image), expect the text encoding model (the “model” in my last comment).

2025-08-06

Thanks for your response.

In fact, research on non-numeric attributed graphs mainly focuses on text attributed graphs, while relatively few studies explore other modalities, such as images. The text-attributed graphs represent an early attempt at non-numeric attributed graph learning. Therefore, in this work, we focus on text-attributed graphs as a representative case for anomaly detection.

The main difference between text-attributed graphs and other-attributed graphs is the nature of text attributes. Text inherently contains both global context information and local semantic meaning, whereas other attributes typically lack this dual characteristic. Both types of information can be disrupted by anomalies. Our method leverages this property of text by integrating both a global alignment-reconstruction module and a local subgraph-comparison module, thereby enabling effective detection of anomalies that affect global and local aspects of the text.

We will clarify this difference and add a discussion of other‑attributed graphs in the revised version.

2025-08-06

Thanks for your response.

Let us check the previous feedback. “different from existing reconstruction only methods, we propose a first align then reconstruct strategy that aligns the graph structure and the text representations before reconstruction,”. Things are similar when “we propose a first align then reconstruct strategy that aligns the graph structure and the image (or other) representations before reconstruction”. Thus, the text attributes do not seem closely related to the GAD model, which may impact the first claim.

2025-08-06

Thank you for your response to our paper.

We agree that the internal idea of our model, such as the first‑align‑then‑reconstruct strategy, could be adapted for anomaly detection in other attributed graphs, such as image. However, our model is not directly applicable to other attributed graph without modifications.

In our model, the BOW encoder extracts global context features from the entire document. These global context features are well-suited to integrate with GNNs, which processes the entire graph to generate graph embeddings.

In contrast, attributes like images, are generally local features, and lack a clear global context. Feeding these local features into the global GNNs will disrupt their representation, resulting in poor quality embeddings from both the global and local perspectives. This will degrade the performance of both the global and local module.

We next give a potential extension to the image attributed graphs. In this approach, we first feed each image into an encoder to obtain the local embedding of each node. Then, for each image, we randomly replace a small portion with patches from other images and encode the modified image. After repeating this replacement process multiple times, we use the average embedding of these modified images as the global embedding of the node. Finally, our method can be applied using both the local and global embeddings of each node.

We will clarify this distinction and describe the potential extensions to other attributed graphs in the revised version.

2025-08-07

Dear Reviewer Jxjr,

Thank you again for your feedback. We hope the above clarifications and the additional experiments in the revised version sufficiently addressed your concerns. If you are satisfied, we kindly request you to consider updating the score to reflect the newly added results and discussion. We remain committed to addressing any remaining points you may have during the discussion phase.

Best, Authors

审稿意见

评分: 4置信度: 32025-06-29

The paper presents TAGAD, a framework for detecting anomalous nodes in graphs whose vertices hold raw text. TAGAD uses three parallel encoders: a bag-of-words layer for shallow context, a frozen transformer for semantic content, and a graph convolutional network (GCN) for structure. A global module aligns the semantic and structural spaces with a contrastive loss and then applies a graph auto-encoder to produce reconstruction-based anomaly scores. A local module builds, for each node, (i) an ego graph drawn from the original topology and (ii) a text-similarity graph built by thresholding cosine similarity; discrepancies between the two give local scores. The final anomaly score is a weighted sum of the global and local scores. Experiments on Cora, PubMed, and a 169 k-node ArXiv citation graph show higher AUC than nine unsupervised and eight few-shot baselines. Ablation studies examine the role of alignment, reconstruction, and sub-graph comparison.

优缺点分析

Strengths

The study addresses a practical gap: most graph anomaly detectors drop raw text or reduce it to shallow features, whereas real applications such as fraud screening need both text and network cues. The split into global and local scoring, combined with lightweight training (the transformer remains frozen), allows the method to scale to graphs with more than one hundred thousand nodes. The authors provide zero-shot and few-shot results, full hyperparameter tables, run-time data, and ablations, which support reproducibility.

Weaknesses

The global module feeds Bag-of-Words vectors into a Graph Convolutional Network (GCN) and forces that output to match a frozen language-model space. Because the two spaces differ in distribution and dimensionality, I am afraid that training may suffer from unstable gradients and scaling issues, thus leading to suboptimal performance.
When the anomaly rate is not low, subtracting the global mean embedding may dampen true signals because the mean itself drifts toward anomalous directions.
The tables give mean Area-under-ROC but omit variances or confidence intervals, even though the appendix states that error bars were produced. This hides run-to-run instability.

问题

Can you provide standard deviations for the reported AUC scores and add precision-recall or top-K precision so readers can judge practical screening performance?
Could the mixing weight $\lambda$ be learned, perhaps by optimizing a small validation set during training, to remove the need for manual tuning?
How does subtracting the mean embedding affect detection when the anomaly proportion exceeds ten percent?
Please show qualitative failure cases and explain which component (global or local) is responsible, so the community can understand the limits of the approach.
In line 26, “lecture” instead of “literature”?

局限性

The approach assumes that every node holds enough text to yield a reliable transformer embedding; very short texts may violate this.

最终评判理由

The authors adequately addressed my concerns during the rebuttal phase.

格式问题

Table 3 spills over the margin in the provided PDF and is captioned at the bottom.

作者回复

2025-07-31

We thank you for taking the time to review our paper and for your valuable comments. We are happy to clarify our manuscript in response to the reviewer's questions.

W1: The global module feeds Bag-of-Words vectors into a Graph Convolutional Network (GCN) and forces that output to match a frozen language-model space. Because the two spaces differ in distribution and dimensionality, I am afraid that training may suffer from unstable gradients and scaling issues, thus leading to suboptimal performance.

A1: We have considered the two spaces to be different in distribution. To address this, we first align the BOW embedding space and the language model embedding space. This ensures that gradient updates are stable and not influenced by other objectives. After the two spaces are aligned, we introduce the reconstruction loss function to detect anomaly nodes more effectively. We illustrate this with the following training loss values on the Cora dataset, which show smooth convergence.

epoch	1	2	3	4	5	6	7	8	9	10
loss	6.77	5.120	4.291	3.665	3.169	2.747	2.366	2.046	1.794	1.540

W2 & Q3: When the anomaly rate is not low, subtracting the global mean embedding may dampen true signals because the mean itself drifts toward anomalous directions. How does subtracting the mean embedding affect detection when the anomaly proportion exceeds ten percent?

A2: As established in prior work [1], [2], anomaly rates are generally low in the graph anomaly detection problem. In the real-world datasets in the benchmark [3], among the ten real datasets, only one exhibits a related high anomaly rate (21.8%), while two have anomaly rates slightly above 10%. The anomaly rates of the other seven datasets are all below 10%. Therefore, in most practical scenarios, the impact of anomaly nodes on the global mean embedding is minimal.

However, in rare cases with a high anomaly ratio, the mean embedding may shift toward anomalous features, potentially degrading the model's performance. We will include a discussion of this limitation in the revised version.

W3: The tables give mean Area-under-ROC but omit variances or confidence intervals, even though the appendix states that error bars were produced. This hides run-to-run instability.

A3: We have conducted experiments on the Cora dataset to demonstrate the variances of our models and other baseline models under zero-shot settings with 95% confidence intervals. The results are shown in the following table. These results demonstrate that our method achieves high performance with low variance. We are conducting further experiments under few-shot conditions with other datasets and will include those results in the revised version.

Method	BOW	LM
SCAN	0.705 $\pm$ 0.098	0.705 $\pm$ 0.088
Radar	0.578 $\pm$ 0.029	0.566 $\pm$ 0.010
ANOMALOUS	0.550 $\pm$ 0.036	0.582 $\pm$ 0.124
DOMINANT	0.775 $\pm$ 0.176	0.618 $\pm$ 0.047
AnomalyDAE	0.773 $\pm$ 0.012	0.737 $\pm$ 0.062
GAD-NR	0.658 $\pm$ 0.132	0.739 $\pm$ 0.011
CONAD	0.827 $\pm$ 0.073	0.583 $\pm$ 0.092
NLGAD	0.665 $\pm$ 0.021	0.676 $\pm$ 0.001
COLA	0.536 $\pm$ 0.008	0.633 $\pm$ 0.021
TAGAD	0.905 $\pm$ 0.018	0.905 $\pm$ 0.018

Q1: Can you provide standard deviations for the reported AUC scores and add precision-recall or top-K precision so readers can judge practical screening performance?

A4: We conduct experiments on the Cora dataset using the BOW embedding to evaluate the performance of our model on other metrics, F1-macro, and Rec@k. The results are as follows. We can observe that TAGAD emerges as the winner in other metrics. We are conducting additional experiments using LM-based features and under few-shot conditions with other datasets, and will include those results in the revised version.

Method	F1-macro	Rec@k
SCAN	0.323 $\pm$ 0.205	0.278 $\pm$ 0.211
Radar	0.011 $\pm$ 0.004	0.139 $\pm$ 0.059
ANOMALOUS	0.022 $\pm$ 0.001	0.145 $\pm$ 0.056
DOMINANT	0.163 $\pm$ 0.210	0.357 $\pm$ 0.182
AnomalyDAE	0.512 $\pm$ 0.028	0.518 $\pm$ 0.004
GAD-NR	0	0.371 $\pm$ 0.002
CONAD	0.391 $\pm$ 0.214	0.443 $\pm$ 0.153
NLGAD	0.4778 $\pm$ 0.001	0.133 $\pm$ 0.018
COLA	0.4775 $\pm$ 0.001	0.077 $\pm$ 0.051
TAGAD	0.683 $\pm$ 0.002	0.548 $\pm$ 0.005

Q2: Could the mixing weight $\lambda$ be learned, perhaps by optimizing a small validation set during training, to remove the need for manual tuning?

A5: We treat $\lambda \in [0, 1]$ as a tunable hyperparameter and determine its value via grid search. Specifically, we select the value $\lambda$ that minimizes the cross entropy loss on a small valid set. This approach allows us to adjust the hyperparameter $\lambda$ automatically, rather than manual tuning. We will add the details of hyperparameter tuning in the revised version.

Q4: Please show qualitative failure cases and explain which component (global or local) is responsible, so the community can understand the limits of the approach.

A6: We agree that providing failure cases is crucial. In the revised version, we will add a case study discussing specific failure scenarios on the real-world Yelp dataset, providing insights into the limitations and future work for improvement.

Q5 & Concerns: In line 26, “lecture” instead of “literature”? Table 3 spills over the margin in the provided PDF and is captioned at the bottom.

A7: Thank you for pointing out these errors. We have corrected them in the revised version.

[1] Yang Liu, Xiang Ao, Zidi Qin, Jianfeng Chi, Jinghua Feng, Hao Yang, Qing He. Pick and Choose: A GNN-based Imbalanced Learning Approach for Fraud Detection. Proceedings of the web conference, 2021, 3168-3177.

[2] Ge Zhang, Jia Wu, Jian Yang, Amin Beheshti, Shan Xue, Chuan Zhou and Quan Z. Sheng. FRAUDRE: Fraud Detection Dual-Resistant to Graph Inconsistency and Imbalance. 2021 IEEE international conference on data mining, 2021, 867-876.

[3] Jianheng Tang, Fengrui Hua, Ziqi Gao, Peilin Zhao, Jia Li. GADBench: Revisiting and Benchmarking Supervised Graph Anomaly Detection. Advances in Neural Information Processing Systems, 2023, 29628-29653.

2025-08-07

Thank you for addressing my concerns. I appreciate the positive evaluation, and I will raise the overall score.

2025-08-07

Thank you for the appreciation of our response and the update on the score. We highly appreciate it.

审稿意见

评分: 4置信度: 42025-07-01

This paper introduces TAGAD, making the first attempt in anomaly detection on text-attributed graphs. TAGAD consists of two modules: a global module, which detects by autoencoder reconstruction and contrastive learning based alignment between graph neural networks and language models; and a local module, which detects by the difference between ego-subgraph and text subgraph, motivated by the context and semantic feature of text-attributed graphs. The authors demonstrate that TAGAD outperforms existing baselines and ablations on synthetic TAG datasets under both zero-shot and few-shot settings.

优缺点分析

Strength:

The paper is generally well organized and easy to follow.

The authors propose the first dedicated framework for anomaly detection on text-attributed graphs, addressing a novel and important problem in the field.

Experimental results on synthetic datasets demonstrate that TAGAD consistently outperforms existing baselines and ablated variants in both zero-shot and few-shot settings.

Weakness:

The paper presents experimental results only on synthetic datasets. However, real-world data distributions may differ significantly from those of synthetic datasets, which could impact the effectiveness and generalizability of the proposed method.

The proposed method is implemented with DeBERTa-base. In the current era of large language models, integrating more advanced or larger language models could further enhance the completeness of the experiments.

For Table 3, it appears that not only fine-tuned models but also larger language models result in worse performance. This outcome is quite counterintuitive, yet the authors offer only a brief explanation, stating that "the pretrained LM has already learned rich semantic representations." A deeper exploration and more thorough analysis are needed to provide reliable and explainable results.

问题

See strengths and weaknesses.

局限性

Yes.

最终评判理由

To the best of my knowledge, this paper makes the first attempt in the domain of text-attributed graph anomaly detection. It proposes a unified framework that jointly learns anomalous signals in text-attributed graphs from contextual features, semantic features, and the graph structure. I believe this work has the potential to open promising research directions in this domain and provide benefits to the community. I will keep borderline accept for this paper.

格式问题

Line 26, I think you mean "In the GAD literature".
The caption for Table 3 should be above instead of below the table.

作者回复

2025-07-31

We appreciate the reviewer's positive feedback. We also greatly appreciate your recognition and support for our work! We are happy to clarify our manuscript in response to the reviewer's questions.

W1: The paper presents experimental results only on synthetic datasets. However, real-world data distributions may differ significantly from those of synthetic datasets, which could impact the effectiveness and generalizability of the proposed method.

A1: We conduct experiments using the real-world Yelp review dataset under unsupervised settings. We use the review comments as the text attribute of the nodes. The AUC results are as follows. For the baseline methods, we use the BOW feature here. We can observe that our method significantly outperforms other baselines in this dataset. We are conducting further experiments using LM-based features and under few-shot conditions, and will include those results in the revised version.

Method	AUC
SCAN	0.505 $\pm$ 0.002
Radar	--
ANOMALOUS	--
DOMINANT	0.374 $\pm$ 0.001
AnomalyDAE	--
GAD-NR	--
CONAD	0.383 $\pm$ 0.002
NLGAD	--
COLA	--
TAGAD	0.568 $\pm$ 0.041

W2: The proposed method is implemented with DeBERTa-base. In the current era of large language models, integrating more advanced or larger language models could further enhance the completeness of the experiments.

A2: In our work, the language model is used for encoding the text. However, recent large language models are generally based on decoder-only transformers, which cannot be directly integrated into our method.

W3: For Table 3, it appears that not only fine-tuned models but also larger language models result in worse performance. This outcome is quite counterintuitive, yet the authors offer only a brief explanation, stating that "the pretrained LM has already learned rich semantic representations." A deeper exploration and more thorough analysis are needed to provide reliable and explainable results.

A3: First, we clarify that the differences between large and small pretrained language models are quite similar, typically within 1% in most cases. This suggests that the model size does not strongly affect performance and these small differences may be attributed to dataset-specific noise.

Regarding the drop in performance for fine-tuned models, our analysis indicates that this issue is caused by the mismatch between the pretrained LM and the randomly initialized GNN during early training. Since the LM already encodes rich semantic information, introducing noise from the under-trained GNN during joint optimization can cause the LM’s representation quality to degrade. This mismatch leads to a decrease in overall performance compared to keeping the LM frozen.

Concerns: Line 26, I think you mean "In the GAD literature". The caption for Table 3 should be above instead of below the table.

A4: Thank you for pointing out these errors. We have corrected them.

2025-08-05

I appreciate authors' response and extra experiments on Yelp dataset. Most of my concerns have been addressed. I would like to keep my score positive.

2025-08-07

Thank you for your valuable feedback. We greatly appreciate your review as your engagement has significantly enhanced the quality of our work.

最终决定Reject

2025-09-17

(a) This paper introduces TAGAD, making the first attempt in anomaly detection on text-attributed graphs. TAGAD consists of two modules: a global module, which detects by autoencoder reconstruction and contrastive learning based alignment between graph neural networks and language models; and a local module, which detects by the difference between ego-subgraph and text subgraph. Empirical work shows TAGAD outperforms existing baselines and ablations on synthetic datasets from Cora, Arxiv and PubMed. (b) Text attributed graphs haven't been directly addressed before. Text training is frozen, so method is efficient. Good experimentation. Good presentation. The combination of global and local perspectives. (c) Error analysis of results needed, which the authors did, and it looked OK. Real datasets should be experimented on, and the authors gave results on Yelp. More would help though. Contribution is smaller since it is really a combination of known methods, and a fairly restricted class of graphs. (d) The authors responded well with additional experiments, but still the issue of a smaller contribution remain. The paper is a borderline one, but its good enough to publish, and I would recommend it for a "findings" section of the conference if one existed. Due to NeurIPS competitive nature, I recommend reject. (e) Authors responded well, addressing many of the issues, but the main novelty issue was held up by several reviewers.