Rethinking Reconstruction-based Graph-Level Anomaly Detection: Limitations and a Simple Remedy
We investigate reconstruction-based graph neural networks in graph-level anomaly detection task.
摘要
评审与讨论
This paper investigates the reconstruction flip phenomena in graph autoencoder which may be beneficial for graph-level anomaly detection. Then the paper proposes a novel graph autoencoder model MUSE, which simply represents a graph as multifaceted summaries of its reconstruction errors and achieves appreciable improvement.
优点
This article takes an in-depth look at the problem of reconstruction loss in graph autoencoder and presents an impressive theoretical analysis.
The logic of this paper is quite clear. The theoretical analysis is also very rigorous and inspiring. The similarity of the graph structure in the test set to the training set leads to lower reconstruction errors in the test set, which is consistent with common sense.
缺点
I think the title is a bit long and redundant with information. Please change the title if you can.
The primary patterns mentioned in line 122 currently contain community structures and circles. Could there be a more systematic definition or categorization based on graph theory? I believe this may be a future direction for the research. As it stands, the community structure can probably be summarized as the case where the degree distribution is uneven, and the circle is the case where the degree is uniformly 2.
Using AUROC as a metric alone may be biased. This is because, for anomaly detection tasks, there tend to be fewer anomaly labels, which will lead to large differences in precision and recall rates, which in turn will lead to an inflated AUROC. At the very least, the reasons for using AUROC can be explained.
问题
Please see the Weaknesses. Thanks!
局限性
The authors mention the greater complexity of MUSE and the challenges of its operation on large-scale graphs.
In addition, MUSE is limited by the error distribution. However, this limitation may not occur in practice since current real-world datasets generally do not have error distributions similar to those of the majority.
Dear Reviewer H5pG
We deeply appreciate your positive reviews of our research. Your suggestions have greatly inspired our future research direction. Below, we provide detailed responses to your questions.
We provide detailed references in the Global References section of the global rebuttal.
R4.1 - W1 [Title modification]
Thank you for your suggestion regarding our title. We are considering the following modifications:
- On the Limitations of Reconstruction-based Graph-Level Anomaly Detection: Analysis and a Simple Solution.
- Rethinking Reconstruction-based Graph-Level Anomaly Detection: Limitations and a Remedy.
If the reviewer has any additional suggestions or ideas regarding the titles, we would be grateful to hear them.
R4.2 - W2 [Future direction]
Thank you for your suggestion regarding our future direction.
[Challenge: Diversity of potential patterns] Please note that our theoretical analysis focused on specific intuitive cases, which are community structures and cycles, to help readers better understand our analysis concept. However, real-world datasets may exhibit a wider variety of patterns, including those related to node attributes.
- [Example: Homophily] For example, our new analysis of the protein dataset [12] reveals that both normal and anomalous graphs exhibit homophilic patterns (i.e., edges tend to join nodes with the same attribute), with anomalies displaying a stronger pattern than normal graphs. Further details are provided in {R2.2 of our response to Reviewer EQox}.
[Importance of categorization] We agree that categorizing patterns is crucial for systematically understanding our results and generalizing them to a broader range of potential patterns. Unfortunately, however, due to the aforementioned challenges posed by the diversity of potential patterns (including those related to node attributes), categorization is not straightforward. Despite the challenges, we are actively developing a categorization that can include a significant portion of potential patterns.
R4.3 - W3 [Other metrics]
Thank you for your suggestion regarding the diverse evaluation. As verified by our additional experiments, MUSE’s empirical superiority withstands the test of different metrics (AP score and Precision@10).
- [Additional experiments with other metrics] Compared to the two strongest baselines, GLAM [9] and OCGTL [8], MUSE outperforms them in 7/10 and 8/10 datasets in terms of AP score and Precision@10, respectively.
- AP (Average Precision) score summarizes the precision-recall curve, which considers both precision and recall.
- Precision@10 measures the number of true anomalies among the top-10 anomaly score data.
[Tables: Results on other metrics]
| AP score | DD | Protein | NCI1 | AIDS | Reedit | IMDB | Mutag | DHFR | BZR | ER |
|---|---|---|---|---|---|---|---|---|---|---|
| GLAM | 74.7 (1.7) | 71.1 (2.0) | 73.6 (9.0) | 95.4 (2.4) | 83.2 (1.6) | 78.7 (3.0) | 75.9 (1.5) | 75.4 (3.2) | 76.9 (1.2) | 65.0 (0.6) |
| OCGTL | 84.9 (3.1) | 78.1 (1.8) | 73.6 (1.6) | 95.3 (3.0) | 88.0 (1.8) | 81.0 (2.2) | 72.7 (1.9) | 87.6 (4.1) | 88.6 (0.9) | 60.3 (1.1) |
| MUSE | 88.1 (1.3) | 86.5 (1.3) | 81.8 (1.2) | 99.7 (0.6) | 81.7 (2.0) | 81.7 (2.7) | 79.6 (1.4) | 78.9 (2.8) | 79.1 (2.1) | 70.1 (0.4) |
| Precision@10 | DD | Protein | NCI1 | AIDS | Reedit | IMDB | Mutag | DHFR | BZR | ER |
|---|---|---|---|---|---|---|---|---|---|---|
| GLAM | 5.1(0.9) | 6.0 (0.7) | 5.1 (1.0) | 9.9 (0.4) | 8.4 (0.7) | 5.9 (0.8) | 4.9 (0.8) | 4.1 (0.6) | 5.5 (0.6) | 4.7 (0.5) |
| OCGTL | 5.2 (0.7) | 3.5 (0.7) | 4.4 (0.8) | 9.8 (0.2) | 3.5 (0.6) | 3.9 (0.6) | 5.1 (1.2) | 3.4 (0.9) | 4.0 (0.6) | 5.1 (0.4) |
| MUSE | 7.7 (1.3) | 7.9 (1.3) | 7.8 (1.3) | 10.0 (0.0) | 5.9 (1.7) | 6.6 (1.6) | 8.5 (1.1) | 5.2 (1.1) | 4.8 (1.3) | 5.3 (0.9) |
- [Intuition behind AUROC] We used AUROC for its convenience in evaluation, as it is independent of class thresholding. Due to its convenience, many anomaly detection methods across various domains (e.g., image [1, 2, 3, 4] and graphs [6, 7, 8, 9]) utilize AUROC as their main evaluation metrics.
Thanks to the author for his reply. I can understand the challenge of pattern categorization. So even though this work doesn't give a clear categorization, I still think it's an enlightening work (from my point of view). I decided to keep my score.
Thank you for acknowledging the contribution of our research. We will incorporate the results discussed in our responses into the revised manuscript.
We deeply appreciate your invaluable feedback on our research.
Warm regards, The Authors
The authors propose a novel approach to enhance feature representation by leveraging multifaceted summaries beyond the mere average of reconstruction errors, addressing a limitation prevalent in current reconstruction-based methods. This motivation is timely and relevant, aiming to enrich the feature space for improved anomaly detection capabilities. The experimental setup is comprehensive, and the theoretical framework is solid; however, these two parts are somewhat disjointed.
优点
Reconstruction Flip Phenomenon: The illustration of the reconstruction flip in Figure 3 is particularly interesting, offering new insights into the behavior of reconstruction-based models under varying conditions.
Enlightening Visualization in Figure 5: This figure effectively demonstrates the inadequacy of relying solely on the average reconstruction error, highlighting cases where two distributions may have identical averages yet exhibit completely different characteristics. Such visualizations are crucial for understanding the nuances of reconstruction-based methods and their limitations.
Robust Experiments and Theoretical Foundation: The experimental setup is comprehensive, and the theoretical framework is solid.
缺点
Disjointed Analysis and Methodology Presentation: There is a noticeable disconnect between the insightful analysis presented in Section 3 and the methodology described in Section 5. While the former provides a deep dive into the limitations of current approaches, the latter fails to directly reflect or build upon these findings. It is recommended that the authors integrate the analytical insights more coherently into the development of their method, ensuring a seamless flow of ideas from problem identification to solution proposal.
Characterization of Primary Patterns in Datasets: The paper utilizes ten datasets, but lacks a detailed exploration of the primary patterns that characterize these datasets. Specifically, is it possible that so-called primary patterns are indeed distinct across anomaly and normal graphs? Including an analysis of these patterns would enhance the necessity and the effectiveness of the proposed approach.
问题
-
What's the connection between the analysis and the method? Is the purpose of the analysis just to prove that the average reconstruction error is not enough to detect anomalies? Does the analysis provide some insights on how to solve this issue?
-
What are the primary patterns for anomaly and normal graphs in ten datasets?
局限性
The authors have included the limitation and broader impact section.
Dear Reviewer EQox,
We sincerely appreciate the reviewer’s valuable time and efforts. The reviewer’s questions and suggestions have greatly enhanced the alignment between our theoretical analysis and our proposed method. Below, we provide detailed responses to the reviewer’s questions.
We provide detailed references in the G6 section of the global rebuttal.
R2.1 - W1 + Q1 [Gap between analysis and method]
We thank the reviewer for the constructive feedback. Fortunately, we believe the reviewer’s concerns largely stem from our lack of clarity. We claim that our method, MUSE, builds directly upon the analyses in Sec. 3 of our manuscript. Let us clarify.
The central motivation behind MUSE is three-fold:
- [M1] Error distribution serves as a good (i.e., distinguishable) representation for graph-level anomaly detection;
- The {claims and Theorems in Sec. 3} imply that the two classes (normal and anomaly) have different error distributions. Reconstruction flip in itself indicates that the error distributions differ for the normal and anomalies.
- {Fig. 5} further buttresses this idea.
- [M2] A naïve use of error distribution, however, can lead to ineffective graph-level anomaly detection;
- The {claims and Theorems in Sec. 3} imply that considering the graphs with higher mean reconstruction error as anomalies (like many other methods) can be ineffective.
- [M3] Using diverse aspects of error distribution can lead to effective graph-level anomaly detection.
- {Fig. 5} exemplifies a case where the shape of error distributions helps distinguish two graphs with similar mean reconstruction errors.
- [Proposed method: MUSE] Motivated by [M1,2,3], MUSE represents each graph by using the multifaceted summaries of its reconstruction errors.
- MUSE pools a graph’s reconstruction errors (M1) using multiple aggregation functions (M3) and uses a vector of these aggregated errors as the graph’s representation (M2).
[M1,2,3] directly supports the design principles of our method, MUSE. We, thus, argue that our analytic insights and method design principles are coherently and tightly connected. We will revise our manuscript to improve clarity.
R2.2 - W2 + Q2 [Real-world dataset analysis]
Thank you for your suggestion. We will include the below analysis in our revised manuscript.
- [Summary] We conducted additional analysis to address the reviewer’s concern. Cases discussed in our analysis (Sec. 3) are also observed in real-world datasets [12, 13]. Moreover, our theoretical findings are in line with our real-world experimental results.
- [Case 1] Anomalous graphs have distinct primary patterns from normal graphs.
- [Dataset analysis] The mutagenicity dataset [13] belongs to this case.
- [Pattern 1] Anomalies tend to contain NH2 and/or NO2 compounds.
- [Pattern 2] Normal graphs tend to lack such compounds.
- [Empirical results] In line with our analysis, the reconstruction flip does not occur, as shown in Table 3.
- Both the existing reconstruction-based anomaly detectors [5, 6, 7] and our method MUSE perform significantly better than the random guess. Specifically, [5, 6, 7] show 0.56, 0.59, and 0.55 test AUROC respectively, and MUSE shows 0.68 test AUROC.
- [Dataset analysis] The mutagenicity dataset [13] belongs to this case.
- [Case 2] Anomalous graphs have similar primary patterns to normal graphs but with different strengths.
- [Dataset analysis] The protein dataset [12] belongs to this case.
- [Shared pattern] In all the proteins, the same type of amino acids (nodes) tend to be joined by edges.
- [Different strength] Such edge homophily patterns tend to be stronger in anomalies than normal graphs. Specifically, the average likelihood of edge homophily in normal graphs is 0.63, while that of anomalous graphs is 0.70.
- [Empirical results] In line with our analysis, the reconstruction flip occurs, as shown in Table 4 of our manuscript.
- Existing reconstruction-based anomaly detectors [5, 6, 7] perform worse than random guessing. Specifically, [5, 6, 7] show 0.41, 0.48, and 0.46 test AUROC, respectively.
- In contrast, our method MUSE performs significantly better than random guessing, outperforming all the baseline methods. Specifically, MUSE shows a 0.78 test AUROC.
- [Dataset analysis] The protein dataset [12] belongs to this case.
I appreciate the efforts from authors during the rebuttal phase. I think the method in Section 5 is largely based on Figure 5 (not enough although a good case), rather than Section 3. I suppose, given theoretical analysis, a possible direct solution could be explicitly define what is the "strength" and add constraint terms or modify equations to avoid or alleviate reconstruction flip, instead of deriving solutions from case study. However, I acknowledge the contribution of the phenomenon discovery, hence I'd like to remain my relative positive score.
Thank you for recognizing our efforts in the rebuttal.
During the method development stage, we explored a similar approach to what the reviewer suggested. However, we ultimately developed our proposed method, MUSE, due to the following challenges we encountered:
[Definition of strength] Developing a clear definition of pattern strength that applies to a variety of real-world datasets is challenging.
- [Real-world datasets] As we explained in {R4.2 of our response to Reviewer H5pG}, real-world datasets often display a wide range of patterns with varying strengths.
- [Challenge] Therefore, creating a single definition that can adequately capture and explain the diversity of these patterns and their strengths is exceedingly difficult.
[Regularization] Introducing an additional regularizer for the reconstruction model can negatively impact detector performance, especially when the reconstruction flip does not occur.
- [When reconstruction flip does not occur] As we discussed in {Sec. 3} and demonstrated with the mutagenicity dataset, anomalies that exhibit distinct patterns from normal data are less likely to cause a reconstruction flip.
- [Requirement] In such cases, the detector must thoroughly learn the normal data patterns to effectively identify anomalies.
- [Challenge] However, the regularizer designed for mitigating a reconstruction flip can obstruct the model's ability to learn these patterns, resulting in suboptimal detection performance.
- [Practical challenge] Furthermore, since it is impractical to predict whether a reconstruction flip will occur in real-world scenarios, selectively applying such a regularizer is challenging for practical use.
[Clarification] We would like to emphasize that our solution, MUSE, outperforms existing graph-level anomaly detectors in 8 out of 10 real-world graph datasets, without encountering the reconstruction flip issue in any of the 10 datasets utilized.
Once again, we sincerely appreciate your invaluable feedback on our work.
Warm regards,
The Authors
This paper introduces a new GLAD method. The authors identify and analyze a phenomenon termed "reconstruction flip", where Graph-AEs sometimes reconstruct unseen graphs (with certain patterns more accurately than training graphs) better than the training graphs themselves. Based on this analysis, the authors propose MUSE (Multifacted Summarization of Reconstruction Errors), which leverages multifaceted summaries of reconstruction errors to enhance anomaly detection performance. MUSE has been shown to outperform existing methods in various datasets.
优点
-
This paper is well-organized, and identifying the reconstruction flip phenomenon is novel and provides new insights into the existing GLAD methods.
-
This paper provides relevant theoretical analysis to elaborate on the reconstruction flip phenomenon.
-
Authors conducted extensive experiments in comparison with several SOTA baselines, and demonstrated the effectiveness of the proposed MUSE. Besides, the code is released for reproducibility.
缺点
-
The primary contribution of this paper is the introduction of multifaceted summaries of reconstruction errors for anomaly detection. However, the methods employed for error representation and summarization are conventional and do not demonstrate significant innovation. Furthermore, the authors utilize this feature to train a one-class classifier, which is also a traditional two-stage method. Consequently, the originality and novelty of this contribution appear to be rather limited.
-
Another important concern of the reviewer is the theoretical analysis provided in the paper, which primarily demonstrates the existence of the reconstruction flip phenomenon. However, it does not significantly contribute to understanding or improving the anomaly detection capabilities of the proposed method. The practical relevance of the theoretical findings to the performance improvements in anomaly detection is not convincingly demonstrated.
-
This paper does not evaluate MUSE on large-scale graph benchmarks to demonstrate the feasibility of large-scale applications.
-
There is a lack of analysis of the computational complexity of MUSE compared to other methods. Such an analysis would help establish the practical advantages of the proposed method in real-world scenarios.
问题
Please refer to the weaknesses.
局限性
This paper does not have any potential negative societal impact.
Dear Reviewer SHGf,
We deeply appreciate the time the reviewer dedicated to reviewing our paper. Your questions have improved the presentation of the connection between our analysis and method. Below, we provide detailed responses to the reviewer’s questions.
We provide detailed experimental results in the PDF attached in Sec. G7 of global rebuttal, and references in Sec. G6 of global rebuttal.
R3.1 - W1 [Novelty]
We argue that our contribution is novel and valid for 3 reasons:
[Novel method] Among graph-level anomaly detectors [5, 6, 7], MUSE has novel and effective representation learning components.
- [Feature reconstruction loss] We are the first to use cosine loss to stabilize learning in graph-level anomaly detectors.
- [5] doesn’t reconstruct features, and [6, 7] use the Frobenious norm.
- [Data augmentation] We use graph augmentation to mitigate overfitting.
- [5, 6, 7] don’t augment graphs.
- [Ablation study] Performance drops in 8/10 datasets when we skip augmentation or change cosine loss to Frobenious norm, proving their necessity.
- [Result details] Please see Table 1 of the attached PDF.
[Novel findings] We are the first to report and analyze the reconstruction flip issue in graphs, which can hinder anomaly detection, and to propose a solution: represent a graph with multifaceted summaries of its reconstruction errors.
- [Related work] Recent anomaly works [1, 2, 3, 4] share the conventional two-stage framework. However, their key innovations, like ours, were in proposing a good representation, instead of an unconventional framework.
[Other reviewers] We kindly suggest the reviewer to consider other reviewers’ evaluation
- [jxne] “authors proposed MUSE, a novel method”
- [EQox] “authors propose a novel approach”
- [H5pG] “paper proposes a novel graph autoencoder model MUSE”
R3.2 - W2 [Gap between analysis and method]
We argue that our theoretical analysis (Sec. 3) well supports the detection capability of MUSE.
[Motivation] Three motivations, based on our analysis, explain MUSE’s detection capability.
- [M1] Error distribution serves as a good (i.e., distinguishable) representation for anomaly detection;
- {claims & Theorems in Sec. 3} show that normal and anomalous graphs have distinct error distributions. Reconstruction flip in itself implies that the error distributions differ for normal and anomaly.
- [M2] A naïve use of error distribution can be ineffective;
- {claims & Theorems in Sec. 3} show that simply using higher mean error as an anomaly indicator (like other methods) is ineffective.
- [M3] Using diverse aspects of error distribution enhances detection;
- {Fig. 5} shows how the shape of error distributions helps distinguish graphs with similar mean errors.
- [MUSE] It represents each graph with the multifaceted summaries of its reconstruction errors;
- It pools a graph’s reconstruction errors (M1) using multiple aggregation functions (M3) and uses a vector of these aggregated errors as the graph’s representation (M2).
[Practical relevance] Our additional data analysis supports the practical relevance of our theoretical findings.
- [Reconstruction flips in real-world data] Our new analysis verified that the reconstruction flip, the key concept of our theoretical analysis, indeed occurs in real-world data.
- [Patterns in real-world data] The patterns in the protein dataset [12] affect anomaly detection performance in line with our theoretical analysis. Details are in {R2.2 of our response to Reviewer EQox}.
- [Empirical results] Reconstruction flips occur in the protein dataset (see Table 4 of the paper). Existing reconstruction-based methods [5, 6, 7] perform worse than random guessing due to the flips, while MUSE outperforms all baselines.
R3.3 - W3 [Large graphs]
[New large graphs] We additionally used MalNetTiny [10] and OVCAR-8 [11].
- Graph statistics are in Table 2 of the attached PDF.
[Detection performance] In these new large graphs, MUSE outperforms the two strongest baselines, GLAM [9] and OCGTL [8].
- [Result details] Please see Table 3 of the attached PDF
[Complexity & runtime] Despite having higher complexity than them, MUSE shows practical runtime in large graphs, which we detail in Sec. R3.4 below.
- [Improving complexity] A new variant, MUSE-Sample, having the same complexity as the two baselines, outperforms them in large graphs. MUSE-Sample is detailed in Sec. R3.4 below.
R3.4 - W4 [Complexity]
We argue that MUSE’s greater complexity does not severely harm its practical value, evidenced by its practical runtime.
[Runtime analysis] Despite higher complexity than the two strongest baselines GLAM and OCGTL, MUSE's runtime (from encoding to anomaly score computation) is practical.
- [Result details] Please see Table 4 of the attached PDF
- [Large graphs] MUSE shows practical runtime in large graphs.
- [MalNetTiny] Handling 200 large graphs (avg. 2215 nodes each) takes 7.2 seconds in total, or 0.036 seconds per graph on average.
- [OVCAR-8] Handling 40516 graphs (avg. 26 nodes each) takes less than 3 minutes in total, or 0.004 seconds per graph on average.
- [Compared to others] MUSE is faster than OCGTL and, though slower, still comparable in speed to GLAM.
- [Why MUSE is faster than OCGTL] MUSE uses a single GNN, unlike the ensemble model OCGTL.
- [Accuracy] Recall that MUSE outperforms them in 8/10 datasets.
- [Improving complexity] We present MUSE-Sample, a variant with the same complexity as these baselines. It is more accurate than them in large graphs (see Sec. R3.3 above).
- [Detail] It samples entries from an adjacency matrix by the number of edges and reconstructs these entries.
- [Why MUSE-Sample is slower than MUSE] The sampling time exceeds the time saved in reconstruction.
Thanks to the authors for (1) reclaiming the novelty of this work; (2) including more experiments; (3) further analyzing the complexity. I have accordingly updated my rating in response to the authors' efforts during the rebuttal.
We are glad that our responses have adequately addressed your concerns. We will ensure that the results discussed in our responses are incorporated into the revised manuscript.
We greatly appreciate your invaluable feedback on our work.
Warm regards, The Authors
Graph autoencoders aim to learn graph representations by reconstructing them, with a key application in graph-level anomaly detection. GLAD identifies graphs with unusual structures or node features by considering those with high reconstruction errors as anomalies. However, the paper reports counter-examples where this assumption fails, a phenomenon named reconstruction flip. The authors investigate when this assumption holds and its limitations. Based on the conversations, the authors proposed MUSE, a novel method that uses these multifaceted summaries for better anomaly detection. MUSE outperforms 14 methods across 10 datasets, achieving state-of-the-art performance in GLAD.
优点
-
This paper proposes a new method for graph anomaly detection which is from empirical studies and with theoretical analysis.
-
The paper reports an interesting reconstruction flip phenomenon that conflicts with previous assumptions on graph reconstructions in the graph anomaly detection problem.
-
The proposed method is simple and effective in detecting anomalies compared to previous methods using extensive experiments.
缺点
-
The proposed method is two-stage instead of end-to-end, so a straightforward question is how to guarantee that the noises/mistakes from the first stage, i.e., representation learning, will not impact the second classification stage. A follow-up question is what is the impact of each stage in detecting anomalies?
-
Only AUROC is used as the evaluation metric. Previous methods, e.g., [37], also used other metrics to verify anomaly detection performance, including Precision, Recall, or F1.
-
Although the computational complexity has been analyzed, since the complexity is quadratic to the number of nodes, it will be helpful to understand the efficiency by comparing the running time of the proposed MUSE method.
问题
-
Question about the two-stage model architecture (see details above).
-
Performance w.r.t other evaluation metrics.
-
Running time comparison.
局限性
The authors have adequately addressed the limitations.
Dear Reviewer jxne,
We deeply appreciate the reviewer’s dedication to the review process. Your questions helped us better evaluate MUSE from various perspectives. Below, we provide detailed responses to your questions.
We provide detailed references in the G6 section of the global rebuttal.
R1.1 - W1 + Q1 [Role of each stage]
- The impact of each stage in anomaly detection:
- [Impact of the 1st stage] MUSE first obtains graph (error) representations to effectively distinguish normal graphs from anomalies. The better their representations are distinguished, the better MUSE can detect anomalies.
- [Impact of the 2nd stage] MUSE, then, detects a graph anomaly by determining if its representation deviates from the distribution of normal graph representations. The better the classifier learns the (error) distribution of normal graph representations, the better it can detect anomalies.
- The impact of noises/mistakes from the 1st stage on the 2nd stage classification:
- [Impact of noise] For most two-stage anomaly detection methods (including MUSE), noises/mistakes in the 1st stage may disrupt the classification in the 2nd stage.
- Noises/mistakes may lead the normal and anomalous (error) representations to be less distinguishable, which may hinder the one-class classifier from learning normal graph error distribution.
- While this is a general challenge in widely used two-stage anomaly detectors [1, 2, 3, 4], many of them achieve SOTA performance in many domains.
- [Robustness analysis] We empirically showed that noise in the 1st stage does not significantly harm MUSE performance. Specifically, when the training set is contaminated with anomalies that cause noise, MUSE shows the least performance drop compared to the baselines.
- As shown in Fig. 6, MUSE's performance drops by less than 3% in AUROC, even when the training set is contaminated with up to 30% anomalies.
- However, the two strongest graph-level anomaly detector baselines [8, 9], which are end-to-end methods, show up to a 10% performance drop.
- [Impact of noise] For most two-stage anomaly detection methods (including MUSE), noises/mistakes in the 1st stage may disrupt the classification in the 2nd stage.
R1.2 - W2 + Q2 [Other evaluation metrics]
MUSE’s empirical superiority withstands the test of different metrics (AP score and Precision@10).
- [Additional experiments with other metrics] Compared to the two strongest baselines, GLAM [9] and OCGTL [8], MUSE outperforms them in 7/10 and 8/10 datasets in terms of AP score and Precision@10, respectively.
[Tables: Results on other metrics]
| AP score | DD | Protein | NCI1 | AIDS | Reedit | IMDB | Mutag | DHFR | BZR | ER |
|---|---|---|---|---|---|---|---|---|---|---|
| GLAM | 74.7 (1.7) | 71.1 (2.0) | 73.6 (9.0) | 95.4 (2.4) | 83.2 (1.6) | 78.7 (3.0) | 75.9 (1.5) | 75.4 (3.2) | 76.9 (1.2) | 65.0 (0.6) |
| OCGTL | 84.9 (3.1) | 78.1 (1.8) | 73.6 (1.6) | 95.3 (3.0) | 88.0 (1.8) | 81.0 (2.2) | 72.7 (1.9) | 87.6 (4.1) | 88.6 (0.9) | 60.3 (1.1) |
| MUSE | 88.1 (1.3) | 86.5 (1.3) | 81.8 (1.2) | 99.7 (0.6) | 81.7 (2.0) | 81.7 (2.7) | 79.6 (1.4) | 78.9 (2.8) | 79.1 (2.1) | 70.1 (0.4) |
| Precision@10 | DD | Protein | NCI1 | AIDS | Reedit | IMDB | Mutag | DHFR | BZR | ER |
|---|---|---|---|---|---|---|---|---|---|---|
| GLAM | 5.1 (0.9) | 6.0 (0.7) | 5.1 (1.0) | 9.9 (0.4) | 8.4 (0.7) | 5.9 (0.8) | 4.9 (0.8) | 4.1 (0.6) | 5.5 (0.6) | 4.7 (0.5) |
| OCGTL | 5.2 (0.7) | 3.5 (0.7) | 4.4 (0.8) | 9.8 (0.2) | 3.5 (0.6) | 3.9 (0.6) | 5.1 (1.2) | 3.4 (0.9) | 4.0 (0.6) | 5.1 (0.4) |
| MUSE | 7.7 (1.3) | 7.9 (1.3) | 7.8 (1.3) | 10.0 (0.0) | 5.9 (1.7) | 6.6 (1.6) | 8.5 (1.1) | 5.2 (1.1) | 4.8 (1.3) | 5.3 (0.9) |
R1.3 - W3 + Q3 [Computational complexity]
We argue that MUSE’s greater complexity does not severely undermine its practical value, evidenced by its acceptable runtime.
- [Runtime analysis] Despite having a relatively higher theoretical complexity than the two strongest baselines (OCGTL and GLAM), MUSE's runtime is acceptable in many practical use cases. Specifically, we measured the time from graph encoding to final anomaly score computation.
- [Large-scale data] In large-scale real-world graphs, MUSE shows acceptable runtime.
- [Additional datasets] We additionally used two datasets: MalNetTiny [10] and OVCAR-8 [11]. Graph statistics are provided below.
- [Results in MalNetTiny] For 200 large-scale graphs, with an average of 2,215 nodes in each, MUSE processes them in 7.2 seconds total, or 0.036 seconds per graph on average.
- [Results in OVCAR-8] For 40,516 graphs, with an average of 26 nodes in each, MUSE processes them in less than 3 minutes total, or 0.004 seconds per graph on average.
- [Comparison with strongest competitors] MUSE was faster than OCGTL and, although slower, still comparable in speed to GLAM.
- [Why MUSE is faster than OCGTL] Unlike the ensemble model OCGTL, which uses multiple GNNs, MUSE uses a single GNN, resulting in faster runtime.
- [Anomaly detection performance] Recall that MUSE outperforms these baselines in 8/10 datasets, demonstrating the accuracy of MUSE.
- [Large-scale data] In large-scale real-world graphs, MUSE shows acceptable runtime.
[Table: Dataset statistics]
| Avg. # of nodes | Avg. # of edges | # of graphs | |
|---|---|---|---|
| Largest values among the utilized datasets | 429 | 497 | 7,697 |
| MalNetTiny | 2,215 | 4,537 | 200 |
| OVCAR-8 | 26 | 28 | 40,516 |
[Table: Runtime results (seconds)]
| DD | Protein | NCI1 | AIDS | IMDB | Mutag | DHFR | BZR | ER | MalNetTiny | OVCAR-8 | Forward pass complexity | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GLAM | 0.003 | 0.002 | 0.002 | 0.002 | 0.005 | 0.002 | 0.002 | 0.003 | 0.002 | 0.002 | 0.012 | 0.003 | |
| OCGTL | 0.005 | 0.004 | 0.004 | 0.004 | 0.007 | 0.004 | 0.004 | 0.004 | 0.004 | 0.004 | 0.015 | 0.006 | |
| Ours(MUSE) | 0.005 | 0.003 | 0.003 | 0.003 | 0.007 | 0.003 | 0.003 | 0.004 | 0.003 | 0.003 | 0.036 | 0.004 |
and denote the number of nodes and edges, respectively.
Thanks for the detailed responses especially the new results with more evaluation metrics. My concerns have been addressed. Thus, I would like to increase my rating.
We are pleased that our responses have satisfactorily addressed your concerns. We will incorporate the results provided in our responses into the revised manuscript.
We sincerely appreciate your invaluable feedback on our work.
Warm regards, The Authors
General responses
We thank the reviewers for the invaluable time and effort they spent reviewing our paper. Your questions and comments helped us clarify and improve our work, and we will incorporate your comments in our revised manuscript.
In this general response, we provide (1) a summary of the shared concerns among reviewers and (2) our brief responses.
G1. The connection between theoretical analysis and the proposed method MUSE
- [Implications of our analysis] As discussed in Sec. 4 of the manuscript, our theoretical analysis regarding the reconstruction flip demonstrates the limitations of the existing reconstruction-based anomaly detectors [5, 6, 7].
- [Our method] As discussed in Sec. 4 of the manuscript, our theoretical analysis coherently motivates the proposed method MUSE. Moreover, MUSE overcomes the limitations.
- [Key innovation] MUSE employs multifaceted summaries of graph reconstruction errors to address the theoretically analyzed limitations of the existing reconstruction-based anomaly detectors [5, 6 ,7].
- [Experimental results] Table 4 of the manuscript shows that MUSE outperforms all baselines on the protein dataset [10] while the existing reconstruction-based anomaly detectors [5, 6, 7] perform worse than random guessing due to the reconstruction flip.
- [Details] Their details are in {R2.1 of our response to Reviewer EQox} and {R3.2 of our response to Reviewer SHGf}.
G2. How realistic is our theoretical analysis
- [Reconstruction flips in real-world datasets] Through our additional real-world data analysis, we demonstrated that the reconstruction flip, which is the key concept of our theoretical analysis, indeed occurs in real-world datasets.
- [Patterns in real-world datasets] Moreover, the patterns in the datasets affect anomaly-detection performance in a manner consistent with our theoretical analysis.
- [Details] Analysis details are in {R2.2 of our response to Reviewer EQox}.
G3. The novelty of MUSE
- [Error representation] The key novelty of the proposed method MUSE lies in error representation, representing each graph with the multifaceted summaries of the corresponding graph’s reconstruction error.
- [Details on novelty] To our knowledge, we are the first to represent data with multifaceted summaries of its reconstruction errors.
- [Technical novelty in representation learning] Moreover, compared to existing reconstruction-based anomaly detection methods [5, 6, 7], we have two novel representation learning techniques (i.e., feature reconstruction loss and graph augmentation). These components are crucial for performance improvement, as demonstrated by our additional experiments.
- [Details] Their details are in {R3.1 of our response to Reviewer SHGf}.
G4. Evaluation with additional metrics
- [Experiments] MUSE outperforms the two strongest graph-level anomaly detector baselines (GLAM [9] and OCGTL [8]) in 7/10 and 8/10 datasets in terms of AP score metric and Precision@K metric, respectively.
- [Details] Their details are in {R1.2 of our response to Reviewer jxne and R4.3 of our response to Reviewer H5pg}.
G5. Runtime of MUSE
- [Runtime analysis] Despite MUSE's relatively higher theoretical complexity compared to the two strongest baselines (GLAM and OCGTL), we claim that MUSE's runtime is acceptable in practical use cases.
- [Large scale dataset 1: MalNetTiny [10]] For 200 large-scale real-world graphs with an average of 2,215 nodes in each, MUSE forward passes them only in 7.2 seconds total, or 0.036 seconds per graph on average.
- [Large-scale dataset 2: OCVAR-8 [11]] Additionally, for 40,516 graphs with an average of 26 nodes in each, MUSE forward passes them in less than 3 minutes total, or 0.004 seconds per graph on average.
- [Comparison with others] Compared to the baselines (OCGTL and GLAM), MUSE was faster than OCGTL and, although slower, still comparable in speed to GLAM.
- [Further scalability] In the rebuttal, we further introduce a variant of MUSE that has the same complexity as the baselines (OCGTL and GLAM), while achieving greater accuracy than them in large-scale graphs.
- [Anomaly detection performance] Recall that MUSE outperforms these baselines in 8/10 datasets, by 0.0631 AUROC on average, demonstrating MUSE is a more accurate detector.
- [Details] Their details are in {R1.3 of our response to Reviewer jxne} and {R3.3 and R3.4 of our response to Reviewer SHGf}.
We believe our additional experiments and clarifications have addressed many of the reviewers’ major concerns. Please let us know if you have any further questions. We deeply appreciate the reviewers’ dedication to the review process.
Sincerely,
Authors of submission 14690.
G6. Global references
[1] Sohn et al, ICLR 2021
[2] Li et al., CVPR 2021
[3] Mirzae et al., ICLR 2022
[4] Reiss and Hosen., AAAI 2023
[5] Kipf et al., NeurIPS 2016
[6] Luo et al., Scientific Reports 2022.
[7] Niu et al., ECML-PKDD 2023.
[8] Qiu et al., IJCAI 2022
[9] Zhao et al., ICDM 2022
[10] Freitas et al., NeurIPS 2021
[11] Yan et al., SIGMOD 2008
[12] Borgwardt et al., Bioinformatics 2005.
[13] Kazius et al., Journal of Medical Chemistry 2005.
↓↓↓↓↓ G7. PDF containing experimental results for the reviewer SHGf
Dear Reviewers,
Thank you for your thoughtful and insightful feedback. We have carefully addressed your concerns and provided detailed responses. If you have any further questions, we are happy to discuss them.
Warm regards,
The Authors
We sincerely appreciate the invaluable feedback provided by all the reviewers. Your comments have been instrumental in improving our manuscript, and we will incorporate all the results discussed in our rebuttal into the revised version.
Below, we present a summary of the rebuttal.
[Overview] We are pleased that all the reviewers expressed satisfaction with our rebuttals and acknowledged our efforts.
- [Reviewer jxne] “My concerns have been addressed”.
- [Reviewer EQox] “appreciate the efforts from authors during the rebuttal phase”.
- [Reviewer SHGf] “updated my rating in response to the authors' efforts during the rebuttal”.
- [Reviewer H5pG] “think it's an enlightening work”.
[Details] In the rebuttal, we primarily addressed the following points:
- The connection between our theoretical analysis and the proposed method MUSE
- Our theoretical analysis, particularly regarding the reconstruction flip, directly motivates the proposed method MUSE.
- For details, please refer to {R2.1 of our response to Reviewer EQox} and {R3.2 of our response to Reviewer SHGf}.
- Real-world dataset analysis regarding the reconstruction flip
- The reconstruction flip is observed in real-world datasets, and our real-world analysis of it aligns with our theoretical findings.
- For details, please refer to {R2.2 of our response to Reviewer EQox}.
- The novelty of the proposed method MUSE
- While MUSE’s main novelty lies in its error representation, its approach to representation learning also contains novel components.
- For details, please refer to {R3.1 of our response to Reviewer SHGf}.
- Additional evaluation metrics
- MUSE’s empirical superiority withstands the test of different metrics.
- For details, please refer to {R1.2 of our response to Reviewer jxne} and {R4.3 of our response to Reviewer H5pg}.
- Runtime of the proposed method MUSE
- MUSE processes 200 large-scale graphs in just a few seconds, implying that its speed can be acceptable in practical use cases.
- For details, please refer to {R1.3 of our response to Reviewer jxne} and {R3.3 and R3.4 of our response to Reviewer SHGf}.
Once again, we deeply appreciate the reviewers for their constructive feedback on our research.
Warm regards,
The Authors
This paper investigates reconstruction flip of graph autoencoder and further claims their implication in graph-level anomaly detection. Based on these analysis, the paper proposes MUSE that involves taking multifaceted summaries of reconstruction errors as graph features for GLAD. While reviewers generally appreciate the paper's theoretical analysis and experimental validations, concerns were raised about the method's novelty and disjointed analysis and methodology presentation which the authors addressed in their rebuttals. In addition, the authors should also pay attention to the following two issues.
- The title of the paper is quite hard to read and very long, it would be helpful to future readers of this paper if the authors can generate a shorter and more precise title that is easier to parse.
- The main paper has no discussion of limitations of the work, the authors need to add the discussion of limitations in the final version and also take into account the potential limitations raised by reviewers when discuss limitations.
I highly encourage the authors to pay attention to the comments in the final revision.