PaperHub
5.5
/10
Poster4 位审稿人
最低5最高7标准差0.9
5
7
5
5
4.0
置信度
正确性2.8
贡献度2.8
表达2.8
NeurIPS 2024

Revisiting Score Propagation in Graph Out-of-Distribution Detection

OpenReviewPDF
提交: 2024-05-15更新: 2025-01-05

摘要

关键词
Graph Neural NetworkOut-of-distribution Detection

评审与讨论

审稿意见
5

In this paper, the authors attempt to tackle the task of detecting Out-of-Distribution (OOD) nodes in graph data. To this end, the authors propose an augmentation method, namely Graph-Augmented Score Propagation (GRASP). GRASP performs the task of OOD detection in graphs by increasing the ratio of intra-edges (i.e., edges only connecting in-distribution nodes or OOD nodes). Theoretical analysis is also provided to support the effectiveness of the proposed approach. Experimental results show that GRASP can outperform existing OOD detection methods in various datasets.

优点

  1. The proposed method is introduced in an informative manner.
  2. The paper is easy to follow.
  3. The comparative experiments are comprehensive.
  4. The hyperparameter sensitivity analysis is thorough, providing readers with valuable insights.

缺点

  1. At the beginning of the article, the author might overly emphasize the contribution of this article, to some extent giving the impression to readers that it is the first study on graph OOD detection. However, there have already been some papers in this area.​ For instance, 1) "Learning on Graphs with Out-of-Distribution Nodes," in KDD 2022 (cited); 2) "Generalizing graph neural networks on out-of-distribution graphs," in TPAMI 2023 (not cited); 3) "Good-d: On unsupervised graph out-of-distribution detection (cited)," in WSDM 2023. While their core idea differs from this paper, their contributions should not be overlooked. Given these observations, it is highly suggested that the authors conduct broader investigations of existing works, discuss the differences between the proposed methods and other related methods, and adjust the contributions of the paper.
  2. The baselines selected in the experiments may not be sufficiently novel. This makes the experimental results are not convincing enough.
  3. The proposed method needs more motivation from both theoretical and practical aspects. For example, are there any observations or previous studies supporting the assumptions that edges follow a Bernoulli distribution? Are there concrete (or real-world) examples that can demonstrate the importance of the number of intra-edges and inter-edges in OOD detection?

问题

  1. The authors emphasize multiple times the importance of the number of intra-edges and inter-edges in OOD detection, providing theoretical proof. However, could a concrete example be provided to illustrate this concept more intuitively?
  2. Does the assumption that edges follow a Bernoulli distribution receive support in most real-world datasets, practical applications, or previous studies?
  3. In Table 3, the FPR on the dataset reddit2 is significantly lower than that of its comparison models. Could the author provide further explanation for this remarkable improvement?
  4. Is there any underlying pattern in the ratio of intra-edges to inter-edges when achieving the best OOD detection performance across different datasets?
  5. Can authors conduct broader investigations of recent related works, explicitly discuss the differences between the proposed and other related methods, and adjust the contributions of the paper (if possible)?

局限性

The authors have discussed the limitations of the proposed method in the Appendix.

评论

W3&Q2. Justification regarding using Bernoulli distribution to model edge weights.

Fair Concern! As we are dealing with discrete graphs, where edges either exist or do not, the adjacency matrix values are binary, taking on either 0 or 1. This naturally aligns with the Bernoulli distribution, which is frequently employed in graph structure learning, as exemplified in several pertinent references in [A,B,C].

[A] Learning Discrete Structures for Graph Neural Networks. ICML'19.

[B] Variational Inference for Graph Convolutional Networks in the Absence of Graph Data and Adversarial Settings. NeurIPS'20.

[C] Data Augmentation for Graph Neural Networks. AAAI'21.

W3-2. Examples that demonstrate the importance of the number of intra-edges and inter-edges.

We present detailed evidence in Table 14, which outlines the relationship between the ratio of intra-edges and OOD detection performance across 10 real datasets. The data illustrate that a higher number of intra-edges is crucial for effective OOD detection.

Q1. Request for an intuitive example showing the importance of intra-edges and inter-edges in OOD detection.

Figure 2 provides a clear, intuitive example demonstrating how variations in the number of intra-edges versus inter-edges impact OOD detection. The two graphs in Figure 2 have the same nodes, but due to the differing numbers of intra-edges and inter-edges, they yield completely opposite results after score propagation. Specifically, score propagation enhances OOD detection performance only when intra-edges dominate; otherwise, it may in turn hurt the performance. This conclusion is supported by the emperical results of our used real datasets, as shown in Table 2, Table 3, and Table 14.

Q3. Concern about lower FPR on the dataset Reddit2 compared to other models.

The notable decrease in False Positive Rate (FPR) observed with GRASP on the Reddit2 dataset can be attributed to an increased proportion of intra-edges, as detailed in Table 14. After propagating score along A_+A\_+, the high scores of ID nodes are transferred more to V_uid\mathcal{V}\_{uid} than to V_uood\mathcal{V}\_{uood}, resulting in higher scores for V_uid\mathcal{V}\_{uid} than for V_uood\mathcal{V}\_{uood}, thus widening the gap between V_uid\mathcal{V}\_{uid} and V_uood\mathcal{V}\_{uood} and reducing the misidentification of OOD. Figure 1 in the attached pdf (Please refer to the Common Responses section titled "Author Rebuttal by Authors" https://openreview.net/forum?id=jb5qN3212b&noteId=HVpGXjNQ0B) shows the score distribution of V_uid\mathcal{V}\_{uid} and Vuood\mathcal{V}_{uood} on Reddit2 before and after using GRASP respectively. The shaded area with diagonal lines represents FPR, visually illustrating the aforementioned reasons.

Q4. Patterns in the ratio of intra-edges to inter-edges for better OOD detection performance.

The following table systematically presents the impact of the intra- to inter-edge ratio (η_intra\eta\_{intra}/η_inter\eta\_{inter}) on OOD detection performance across various datasets. The results indicate that the number of intra-edges plays a crucial role in OOD detection performance. When the ratio of intra-edges is increased to dominate, score propagation can achieve excellent OOD detection performance across different datasets.

Before Applying GRASPAfter Applying GRASP
Datasetsη_intra\eta\_{intra}/η_inter\eta\_{inter}AUROCη_intra\eta\_{intra}/η_inter\eta\_{inter}AUROC
cora4.6587.5211.8793.50
amazon14.2496.2728.0796.68
coauthor12.6695.8234.4697.75
chameleon1.0350.423.7276.93
squirrel0.6535.881.6761.09
reddit20.6831.9946.6298.50
ogbn-products4.5585.6614.5393.79
arxiv-year0.7235.304.6081.24
snap-patents0.5327.352.5672.13
wiki1.5260.323.9177.97
评论

Dear Authors,

    Thanks very much for your detailed responses. I will keep my positive score on this paper.
评论

Dear Reviewer hXZy,

We are delighted that our responses have effectively addressed your concerns. Your recognition of our efforts in the paper is greatly appreciated!

Warm regards,

The Authors of Submission 12283.

作者回复

We appreciate the reviewer's constructive feedback and insightful questions. Below, we address each point in detail:

W1&Q5. Suggestion on providing more discussion w.r.t the exisiting graph OOD works and comments on the paper's contribution position.

We appreciate your comments and the opportunity to clarify and further discuss the positioning!

Firstly, we acknowledge the contributions of previous works in this field and have made it clear from the introduction (line 29) that our approach is inspired by prior research [72]. If any part of the text inadvertently suggested that we were the first to address graph OOD detection, we are welcome with advices to eliminate possible confusion and better position our contributions.

Regarding the discussion with references and existing works on graph OOD:

  1. OOD Generalization vs. OOD Detection. The uncited paper [A] suggested by the reviewer addresses a different problem -- OOD generalization rather than OOD detection. This distinction is crucial as OOD generalization focuses on correctly classifying domain-shifted data, which involves a different kind of "OOD" than the semantic shift concerns of OOD detection. Our work specifically targets scenarios where data does not conform to any in-domain class.
  2. Graph-level vs. node-level. We also note a significant body of research [B-G] that focuses on graph-level OOD detection. Our work, however, contributes to the less-explored area of node-level OOD detection, where current literature remains sparse [H-L]. Our findings help fill this gap by providing new insights and methodologies.
  3. Training-required vs. post hoc method. Methodologically, our approach differs significantly from those in existing literature, which often require extensive re-training [I-L]. Our post hoc solution (demonstrated in Tables 2, 3, 15, and 16) enables effective and efficient OOD detection, facilitating easier deployment and practical application in real-world scenarios.
  4. New theoretical contribution. Our paper introduces fundamental theoretical analysis, a first in the realm of node-level graph OOD detection. We elucidate key factors such as the importance of intra-edge dominance, providing a new theoretical framework that aids in understanding and addressing the unique challenges of OOD detection on graphs.

We hope these clarifications and expansions will address the concerns raised and better articulate the value and positioning of our work in the literature.

[A] Generalizing graph neural networks on out-of-distribution graphs. TPAMI'23

[B] GraphDE: A Generative Framework for Debiased Learning and Out-of-Distribution Detection on Graphs. NeurIPS'22

[C] Towards OOD Detection in Graph Classification from Uncertainty Estimation Perspective.ICML'22

[D] On Estimating the Epistemic Uncertainty of Graph Neural Networks using Stochastic Centering. ICML'23

[E] GOOD-D: On Unsupervised Graph Out-Of-Distribution Detection. WSDM'23

[F] A Data-centric Framework to Endow Graph Neural Networks with Out-Of-Distribution Detection Ability. KDD'23

[G] HGOE: Hybrid External and Internal Graph Outlier Exposure for Graph Out-of-Distribution Detection. MM'23

[H] Energy-based out-of-distribution detection for graph neural networks. ICLR'23

[I] Uncertainty aware semi-supervised learning on graph data. NeurIPS'20

[J] Graph posterior network: Bayesian predictive uncertainty for node classification. NeurIPS'21

[K] Learning on graphs with out-of-distribution nodes. KDD'22

[L] Bounded and Uniform Energy-based Out-of-distribution Detection for Graphs. ICML'24

W2. The baselines selected in the experiments may not be sufficiently novel. This makes the experimental results are not convincing enough.

To address this concern, we have incorporated two latest baselines, NODESafe [A] and fDBD [B], into our experiments on both common datasets (5 datasets) and large-scale datasets (3 datasets). NODESafe aims to reduce the generation of extreme scores when traing ID models on graphs, and fDBD detects OOD samples based on their feature distances to decision boundaries. Our results reveal that the performance of these methods lags significantly behind ours due to their inability to increase the proportion of intra-edges.

Datasetscoraamazoncoauthorchameleonsquirrel
methodAUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95
NODESafe84.0968.7368.6591.5180.2671.7450.7292.2949.2694.60
fDBD56.7781.5673.3151.8763.1059.6850.8589.8553.1794.78
GRASP93.5029.7096.6814.3897.757.8476.9366.8861.0985.59
Datasetsarxiv-yearsnap-patentsreddit2
methodAUROCFPR95AUROCFPR95AUROCFPR95
NODESafe47.4194.9251.5193.0243.6097.11
fDBD48.8795.8243.2794.9955.7289.78
GRASP81.2473.9372.1375.2298.502.41

[A] Bounded and Uniform Energy-based Out-of-distribution Detection for Graphs. ICML'24

[B] Fast Decision Boundary based Out-of-Distribution Detector. ICML'24

审稿意见
7

This study investigates the effectiveness of score propagation in graph raph Out-of-Distribution (OOD) detection. It explores the conditions under which score propagation can be beneficial and proposes an edge augmentation strategy (GRASP) to improve its performance. The authors provide theoretical guarantees for GRASP and demonstrate its performance compared to existing OOD detection methods on several graph datasets.

优点

  1. The study delves into the mechanisms of score propagation and derives conditions for its effectiveness. This theoretical foundation is solid and extends the understanding of graph OOD detection.

  2. The proposed GRASP method is a practical and efficient solution for improving OOD detection performance. The results showed that GRASP has achieves the SOTA performance.

3.) The paper is generally in well-written and easy to follow.

缺点

  1. it is strongly suggested to release the source code of experiments.

  2. Considering the imporvement is margin in Table 2 and 3, it is also suggested that the authors should provide a standard error with statistical significance analysis in the results.

问题

  1. How does the performance of GRASP vary with different OOD distributions (e.g., uniform, normal, outliers)?

  2. Can the authors provide further insights into how GRASP works and how it interprets the graph structure?

  3. How does the performance of GRASP scale with the size of the graph and the number of nodes? Is there any systemic analysis?

局限性

I did not find significant limitation of this study, one point may concern me that the effectiveness of GRASP relies on the connectivity and structure of the graph. In scenarios with random connectivity patterns, the method may not be as effective..

评论

Limitation. GRASP may not be as effective in scenarios with random connectivity patterns

We agree with the reviewer's point that when a graph has random connectivity, all methods will perform poorly because such a graph lacks any meaningful edge information, making it indeed impossible to work effectively. To validate this, we conduct a simple experiment in Cora. Specifically, we randomly shuffle the original edges of the graph while retaining the original features and labels for training the ID model. We then evaluate the performance of each baseline on the trained model. The results, presented in the table below, confirm that all methods achieve AUROC scores around 50%, comparable to random guessing. However, in real-world graphs, random connectivity is unlikely, ex. social network. In reality, there are connections between IDs, and they naturally tend to be linked together. Therefore, it is less practical to consider the network with random connectivity.

MethodAUROCFPR95
MSP53.7594.00
Energy53.9193.91
KNN50.2994.56
ODIN54.1993.73
Mahalanobis54.2793.86
GNNSafe51.9793.27
GRASP51.3792.73
评论

Q1. How does the performance of GRASP vary with different OOD distributions (e.g., uniform, normal, outliers)?

We thank reviewer for the suggestion! We manually generate 2D sample points as graph nodes, with ID samples drawn from a standard Gaussian distribution and OOD samples drawn from three different distributions: a 2D uniform distribution, a different normal distribution from ID, and an outlier distribution. The outlier samples are randomly selected points outside the region of the ID sample points. We calculate the RBF kernel similarity between pairs of sample points and constructe edges between nodes with high similarity to simulate the similarity between nodes' embeddings after training GNN.

The experimental results are shown in the table below. The results indicate that GRASP performs best when the OOD is normal, worst when the OOD is uniform, and intermediate when the OOD is outlier.

OODAUROCFPR
Before GRASP56.2092
uniform+GRASP74.0862
normal+GRASP97.980
outlier+GRASP85.8147

We are happy to provide more experiment results for reviewer's interest!

Q2. Can the authors provide further insights into how GRASP works and how it interprets the graph structure?

Thank you for your inquiry about the rationale behind GRASP! Here's a succinct overview to aid your understanding:

  • Core Principle: The effectiveness of GRASP in enhancing graph OOD detection hinges on increasing the proportion of intra-edges before performing score propagation.
  • Implementation: Based on this principle, GRASP identifies a subset GG within the graph, characterized by having more connections to ID data than to OOD data. Additional edges within GG are then added to enhance the score propagration for OOD detection.

Regarding your question on how GRASP interprets the graph structure:

  • GRASP is designed to amplify the score difference between V_uid\mathcal{V}\_{uid} (nodes to be identified as ID) and V_uood\mathcal{V}\_{uood} (nodes to be identified as OOD). By reinforcing connections within the subset GG that has stronger ties to ID data, and then conducting score propagation, it exert a greater influence on V_uid\mathcal{V}\_{uid} compared to V_uood\mathcal{V}\_{uood}. This structured modification and targeted propagation are pivotal for enhancing the discriminative capability of the network against OOD nodes.

If there is any aspect of the question that we have misunderstood, or if further clarification is required, please do not hesitate to let us know!

Q3. How does the performance of GRASP scale with the size of the graph and the number of nodes? Is there any systemic analysis?

The 10 datasets used in our experiments cover various scales from small scale to large scale. As shown in Tables 1-3, GRASP's performance is not significantly affected by the scale or size of the datasets. Instead, its performance is more influenced by the proportion of inter-edges in the original datasets.

Datasets#Nodes#EdgesRatio of intra-edgesRatio of inter-edgesAUROC
cora2708542992.237.7793.50
amazon765023816296.563.4496.68
coauthor1833316378897.182.8297.75
chameleon22773142178.8121.1976.93
squirrel520119849362.5437.4661.09
reddit22329652321383897.902.1098.50
ogbn-products24490296185914093.566.4493.79
arxiv-year169343116624382.1517.8581.24
snap-patents29239221397578871.9428.0672.13
wiki192534230343486079.6420.3677.97
作者回复

We appreciate the reviewer's constructive feedback and insightful questions. Below, we address each point in detail:

W1. it is strongly suggested to release the source code of experiments.

Sure! We have made the source code available at the following link: https://anonymous.4open.science/r/GRASP-EEA3/README.md.

W2. Considering the imporvement is margin in Table 2 and 3, it is also suggested that the authors should provide a standard error with statistical significance analysis in the results.

We appreciate your observations regarding the improvements in Table 2 and 3! We'd like to emphasize that the overall improvement of our method is substantial. Nevertheless, your suggestion about including the standard error (STD) to provide statistical significance analysis is well taken. We have updated the manuscript to include standard errors for all datasets. It is important to note that the FPR95 metric displays a relatively high standard error across methods on the Cora, Amazon, and Chameleon datasets due to their small sizes, which makes the outcomes sensitive to variations in data splits. Nonetheless, our method consistently remains competitive across all five runs, as detailed in the updated Table.

Datasetscoraamazoncoauthorchameleonsquirrel
methodAUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95
MSP84.56 ± 5.3970.86 ± 15.8889.34 ± 3.4949.26 ± 10.5194.34 ± 0.4128.82 ± 1.9457.96 ± 3.3185.70 ± 7.0948.51 ± 0.4694.68 ± 1.01
Energy85.47 ± 4.9867.54 ± 22.9890.28 ± 3.4242.13 ± 9.9695.67 ± 0.2520.29 ± 1.4959.20 ± 4.3188.06 ± 7.5045.07 ± 1.6893.98 ± 1.42
KNN70.94 ± 5.6290.20 ± 4.3584.71 ± 3.2865.19 ± 7.2690.13 ± 0.5051.24 ± 1.8357.90 ± 6.4893.38 ± 5.4854.68 ± 2.2594.72 ± 2.84
ODIN84.98 ± 5.5968.41 ± 18.4889.90 ± 3.6544.06 ± 10.6995.27 ± 0.3322.59 ± 1.7657.94 ± 3.7585.31 ± 7.6444.08 ± 0.3594.17 ± 0.44
Mahalanobis85.48 ± 1.6969.68 ± 14.6075.58 ± 7.9796.49 ± 5.9684.98 ± 0.5885.71 ± 1.8253.19 ± 4.3095.55 ± 2.3654.99 ± 0.7094.90 ± 0.51
GKDE86.27 ± 2.6963.71 ± 14.3677.26 ± 5.5481.29 ± 3.3695.13 ± 0.2925.48 ± 1.4850.14 ± 5.5092.93 ± 4.8949.38 ± 3.5896.71 ± 0.67
GPN82.93 ± 11.2058.45 ± 31.9882.63 ± 5.8772.95 ± 19.7793.82 ± 2.6334.11 ± 22.4668.20 ± 6.7082.25 ± 6.5548.38 ± 4.4395.58 ± 1.65
OODGAT53.63 ± 5.1394.59 ± 6.3866.95 ± 16.0271.34 ± 15.3452.18 ± 8.2696.53 ± 3.3959.67 ± 6.3794.43 ± 3.4346.13 ± 3.1095.27 ± 1.00
GNNSafe87.52 ± 6.1654.71 ± 31.4196.27 ± 0.3122.39 ± 4.9095.82 ± 0.2816.64 ± 1.9050.42 ± 0.65100.00 ± 0.0035.88 ± 0.24100.00 ± 0.00
GRASP93.50 ± 1.6529.70 ± 12.2596.68 ± 0.2814.38 ± 6.6397.75 ± 0.187.84 ± 0.5876.93 ± 4.1866.88 ± 6.4861.09 ± 1.4985.59 ± 3.61
Datasetsreddit2odbn-productarxiv-yearsnap-patentswiki
MethodAUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95
MSP46.61 ± 0.6696.59 ± 0.1470.19 ± 0.9286.87 ± 0.3547.24 ± 3.7095.03 ± 1.4646.99 ± 0.8394.31 ± 0.3054.70 ± 0.6895.46 ± 0.32
Energy44.13 ± 0.1496.77 ± 0.0368.13 ± 0.3885.09 ± 0.4551.35 ± 5.9194.10 ± 2.7646.03 ± 4.5996.82 ± 1.0729.02 ± 2.7897.31 ± 1.54
KNN66.74 ± 0.5590.78 ± 0.7573.58 ± 1.2184.22 ± 2.0057.96 ± 2.1995.35 ± 0.9253.45 ± 0.9390.54 ± 1.0943.69 ± 4.8393.43 ± 2.57
ODIN44.69 ± 0.2496.74 ± 0.0768.95 ± 0.5285.65 ± 0.3147.36 ± 3.4695.06 ± 1.4745.20 ± 0.8794.27 ± 0.3029.91 ± 0.4797.88 ± 0.18
Mahalanobis74.89 ± 1.0171.73 ± 1.55OOM59.57 ± 1.2788.60 ± 1.2758.50 ± 0.8196.03 ± 0.2267.95 ± 1.5672.33 ± 2.15
GKDEOOTOOMOOMOOMOOM
GPNOOMOOM50.97 ± 14.9895.62 ± 3.29OOMOOM
OODGATOOMOOM59.38 ± 3.4492.90 ± 0.94OOMOOM
GNNSafe31.99 ± 0.2699.49 ± 0.0785.66 ± 1.1677.86 ± 1.0935.30 ± 0.06100.00 ± 0.0027.35 ± 0.1899.92 ± 0.1860.32 ± 4.5172.63 ± 2.05
GRASP98.50 ± 0.022.41 ± 0.0993.79 ± 0.2439.77 ± 1.2581.24 ± 0.3973.93 ± 0.6072.13 ± 0.0675.22 ± 0.0977.97 ± 1.3858.49 ± 1.07
审稿意见
5

The paper proposes a methodology called Graph-Augmented Score Propagation to improve OOD detection performance on graphs. The key idea of the paper is an edge augmentation strategy which selectively adds edges to a subset of training nodes, which is combined with score propagation for the OOD node detection task on the graphs. Theoretical analyses is provided which links the OOD score propagation to the intra-edge vs inter-edge ratios between ID and OOD samples. Experimental results are provided on several real world graph datasets with the measurement of OOD detection metrics for demonstrating the effectiveness of their method.

优点

  1. The paper is generally well written, well motivated and easy to follow.

  2. Theoretical analysis of the setting when OOD score propagation will be effective is helpful for future work in this direction, and for developing OOD detection methods for graphs.

  3. The key idea of the work for OOD detection is presented as a post hoc strategy, hence the practical applicability of the method is good.

  4. Results in Table 2 and 3 are convincing.

缺点

  1. The creation of the subset G is dependent on the selection of S_id and S_ood examples which uses the MSP as a measure of for creation of these sets. However it has been well-discussed that MSP suffers from several practical issues such as overconfidence [Hendrycks and Gimpel, ICLR ‘17], poor generalization [Lee et. al, NeurIPS ‘18] and calibration issues [Guo et al, ICML ‘17]. Because of GRASP’s dependence on the MSP for creating the set G, the overall method can be non-robust.

  2. Error bars are missing from all the results, which is important for assessing the consistency of the proposed method, since the data augmentation could be significantly affected by the randomness.

问题

How would the authors ensure robustness of their detection method when the confidence could vary? (see weakness pt 1 above) Did the authors do any experiments to evalute the robustness of their method and how would they address it?

局限性

The authors mention the limitations of their work in Appendix section E and societal impact in Appendix Section F.

评论
Datasetscoraamazoncoauthorchameleonsquirrel
methodAUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95
MSP84.56 ± 5.3970.86 ± 15.8889.34 ± 3.4949.26 ± 10.5194.34 ± 0.4128.82 ± 1.9457.96 ± 3.3185.70 ± 7.0948.51 ± 0.4694.68 ± 1.01
Energy85.47 ± 4.9867.54 ± 22.9890.28 ± 3.4242.13 ± 9.9695.67 ± 0.2520.29 ± 1.4959.20 ± 4.3188.06 ± 7.5045.07 ± 1.6893.98 ± 1.42
KNN70.94 ± 5.6290.20 ± 4.3584.71 ± 3.2865.19 ± 7.2690.13 ± 0.5051.24 ± 1.8357.90 ± 6.4893.38 ± 5.4854.68 ± 2.2594.72 ± 2.84
ODIN84.98 ± 5.5968.41 ± 18.4889.90 ± 3.6544.06 ± 10.6995.27 ± 0.3322.59 ± 1.7657.94 ± 3.7585.31 ± 7.6444.08 ± 0.3594.17 ± 0.44
Mahalanobis85.48 ± 1.6969.68 ± 14.6075.58 ± 7.9796.49 ± 5.9684.98 ± 0.5885.71 ± 1.8253.19 ± 4.3095.55 ± 2.3654.99 ± 0.7094.90 ± 0.51
GKDE86.27 ± 2.6963.71 ± 14.3677.26 ± 5.5481.29 ± 3.3695.13 ± 0.2925.48 ± 1.4850.14 ± 5.5092.93 ± 4.8949.38 ± 3.5896.71 ± 0.67
GPN82.93 ± 11.2058.45 ± 31.9882.63 ± 5.8772.95 ± 19.7793.82 ± 2.6334.11 ± 22.4668.20 ± 6.7082.25 ± 6.5548.38 ± 4.4395.58 ± 1.65
OODGAT53.63 ± 5.1394.59 ± 6.3866.95 ± 16.0271.34 ± 15.3452.18 ± 8.2696.53 ± 3.3959.67 ± 6.3794.43 ± 3.4346.13 ± 3.1095.27 ± 1.00
GNNSafe87.52 ± 6.1654.71 ± 31.4196.27 ± 0.3122.39 ± 4.9095.82 ± 0.2816.64 ± 1.9050.42 ± 0.65100.00 ± 0.0035.88 ± 0.24100.00 ± 0.00
GRASP93.50 ± 1.6529.70 ± 12.2596.68 ± 0.2814.38 ± 6.6397.75 ± 0.187.84 ± 0.5876.93 ± 4.1866.88 ± 6.4861.09 ± 1.4985.59 ± 3.61
Datasetsreddit2odbn-productarxiv-yearsnap-patentswiki
MethodAUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95
MSP46.61 ± 0.6696.59 ± 0.1470.19 ± 0.9286.87 ± 0.3547.24 ± 3.7095.03 ± 1.4646.99 ± 0.8394.31 ± 0.3054.70 ± 0.6895.46 ± 0.32
Energy44.13 ± 0.1496.77 ± 0.0368.13 ± 0.3885.09 ± 0.4551.35 ± 5.9194.10 ± 2.7646.03 ± 4.5996.82 ± 1.0729.02 ± 2.7897.31 ± 1.54
KNN66.74 ± 0.5590.78 ± 0.7573.58 ± 1.2184.22 ± 2.0057.96 ± 2.1995.35 ± 0.9253.45 ± 0.9390.54 ± 1.0943.69 ± 4.8393.43 ± 2.57
ODIN44.69 ± 0.2496.74 ± 0.0768.95 ± 0.5285.65 ± 0.3147.36 ± 3.4695.06 ± 1.4745.20 ± 0.8794.27 ± 0.3029.91 ± 0.4797.88 ± 0.18
Mahalanobis74.89 ± 1.0171.73 ± 1.55OOM59.57 ± 1.2788.60 ± 1.2758.50 ± 0.8196.03 ± 0.2267.95 ± 1.5672.33 ± 2.15
GKDEOOTOOMOOMOOMOOM
GPNOOMOOM50.97 ± 14.9895.62 ± 3.29OOMOOM
OODGATOOMOOM59.38 ± 3.4492.90 ± 0.94OOMOOM
GNNSafe31.99 ± 0.2699.49 ± 0.0785.66 ± 1.1677.86 ± 1.0935.30 ± 0.06100.00 ± 0.0027.35 ± 0.1899.92 ± 0.1860.32 ± 4.5172.63 ± 2.05
GRASP98.50 ± 0.022.41 ± 0.0993.79 ± 0.2439.77 ± 1.2581.24 ± 0.3973.93 ± 0.6072.13 ± 0.0675.22 ± 0.0977.97 ± 1.3858.49 ± 1.07
评论

Thanks for the rebuttal and sharing the additional results, I would encourage the authors to include them in the paper for a clearer indication of statistical significance of their results and contribution.

Given my positive outlook on the paper, I would like to keep my score.

评论

Dear Reviewer xmGd,

Thank you for taking the time to read our rebuttal and for your positive feedback. We appreciate your suggestion and will incorporate these results in the revised version of our paper.

Best regards,

The Authors of Submission 12283.

作者回复

We appreciate the reviewer's constructive feedback and insightful questions. Below, we address each point in detail:

W1&Q1. Concern about using MSP score to select S_id and S_ood.

Thank you for this insightful question! We acknowledge the concern that the model can sometimes exhibit overconfidence for certain OOD nodes. However, this does not undermine the applicability of GRASP when ID nodes' MSP scores are generally higher than those of OOD nodes. As demonstrated in Figure 4, we selectively utilize the highest MSP scores for ID and the lowest for OOD. This selection strategy results in more accurate estimations of S_idS\_{id} and S_oodS\_{ood}.

We empirically support this approach with results presented in Tables 2, 3, and 5, across five datasets (Squirrel, Reddit2, arXiv-Year, snap-patents, and Wiki). These results show significant performance improvements with GRASP, particularly where original MSP scores were ineffective, ensuring more accurate distinctions between SidS_{id} and SoodS_{ood} in each iteration.

Additionally, our method offers high flexibility—it can integrate any existing OOD scoring function. While we use MSP in our primary demonstrations to explain GRASP's principles, substituting MSP with other well-known OOD scoring methods in our framework also yields competitive results, as shown in Table 7. This highlights GRASP’s adaptability and robustness.

W2. Suggestion of including error bars.

This is an excellent suggestion! We have now included the main results with standard errors for all datasets in the revised manuscript and the table below. It is important to note that the FPR95 metric displays a relatively high standard error across methods on the Cora, Amazon, and Chameleon datasets due to their small sizes, which makes the outcomes sensitive to variations in data splits. Nonetheless, our method consistently remains competitive across all five runs, as detailed in the updated table in the second part of the response.

审稿意见
5

This work aims to detect out-of-distribution (OOD) nodes on a graph by exploring useful OOD score propagation methods. It introduces a novel edge augmentation strategy, with a theoretical guarantee. The approach's superiority is empirically demonstrated, outperforming OOD detection baselines in various scenarios and settings.

优点

S1. The paper conducts an in-depth analysis and empirically validates the beneficial conditions for OOD score propagation in a graph.

S2. The paper introduces the GRASP methodology, which addresses the limitations of previous score propagation methods and demonstrates superior performance in node-level OOD detection tasks.

缺点

W1. Does the strategy of utilizing GRASP with additional edges compromise classification accuracy within the distribution? In other words, how can you balance the homogeneity of graph classification with the homogeneity of OOD detection tasks when adding edges?

W2. While this method is suitable for node-level OOD detection, it appears to be more challenging to adapt it to subgraph-level tasks or graph-level tasks.

问题

Q1. Is the classification accuracy in Appendix D.1 assessed before or after employing GRASP? Is there a comparison between the two?

Q2. Do similar cases discussed in Figure 2 exist within real-world datasets?

Q3. What is the accuracy of the edge-adding method presented in the text, specifically how many to intra-edges, and how many to inter-edges? What impact would incorrectly added edges have on the results?

局限性

Yes.

作者回复

We appreciate the reviewer's constructive feedback and insightful questions. Below, we address each point in detail:

W1. Relationship between GRASP and the in-distribution classification accuracy.

Fair concern! GRASP is a post hoc OOD detection method that does not interfere with the classification process. Therefore, the classification accuracy of in-distribution (ID) data remains unchanged.

W2. Extension to subgraph-level tasks or graph-level tasks.

We appreciate the reviewer's suggestion to extend our work to different tasks! While our primary focus is on the node-level setting, we are open to discussing how GRASP can be adapted for subgraph-level or graph-level tasks. One potential approach is to treat a subgraph or graph as a single node, with the similarity between subgraphs or graphs represented by the weights of edges. This adaptation would enable the application of our method to these broader contexts.

Q1. Is the classification accuracy in Appendix D.1 assessed before or after employing GRASP? Is there a comparison between the two?

(Related to W1) The classification accuracy presented in Appendix D.1 is from the pretrained ID model before employing GRASP. Since our method is post hoc, the classification accuracy remains the same across all post hoc methods.

Q2. Do similar cases discussed in Figure 2 exist within real-world datasets?

Yes, they do. Figure 2 is designed to intuitively illustrate our theoretical findings, which are authentically reflective of real-world scenarios. Our theory suggests that score propagation enhances OOD detection performance when intra-edges are predominant. This behavior is captured in real-world datasets, as supported by the empirical evidence presented in Table 14.

Q3. What is the accuracy of the edge-adding method presented in the text, specifically how many to intra-edges, and how many to inter-edges? What impact would incorrectly added edges have on the results?

The accuracy of the added edges on the common datasets is shown in the table below:

DatasetsAdded Edges' ACC
cora0.94
amazon0.94
coauthor0.99
chameleon0.80
squirrel0.80

We investigate the impact on GRASP's performance by adding varying proportions of incorrect edges. The results are presented in the table below, with the "ratio" column indicating the proportion of incorrect edges added. The results show that GRASP's performance deteriorates significantly when the proportion of incorrect edges exceeds 0.3.

Datasetscoraamazoncoauthorchameleonsquireel
ratioAUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95
098.780.4299.850.2199.790.0095.065.7992.277.99
0.187.0454.5790.8647.7390.0339.1077.9155.0970.0466.20
0.357.3377.9865.0480.7066.6379.1363.8976.9449.6391.78
0.542.5280.9437.0990.6040.7588.1645.7990.7223.4396.29
0.738.8886.9730.2795.6334.6492.7828.4197.3412.2099.91
评论

Thank you for your reply.

I have concerns about your claim in the paper that "GRASP still works when raw MSP fails," based on the new experiment involving the addition of incorrect edges. The results in Tables 2 and 3 indicate that MSP's performance is poor, which suggests that using MSP to select S_id and S_odd may introduce numerous errors. This could lead to incorrect edges being added to the augmented graph, which, based on the evidence that "GRASP's performance deteriorates significantly when the proportion of incorrect edges exceeds 0.3," could adversely impact the effectiveness of GRASP.

It's unclear under what conditions your algorithm is efficient, specifically on which kinds of graph datasets and with which base models.

评论

Dear Reviewer hQev,

Thank you for raising the concerns regarding our claim that "GRASP still works when raw MSP fails." We appreciate the opportunity to further clarify this aspect of our study.

To delve deep into this, we conducted a visualization analysis of the MSP score distribution similar as Figure 4 across each dataset. Due to the limitations of the OpenReview platform, we are unable to upload these new figures to share the insight. (We will include in the revision) In these figures, we observe multiple distinct "heaps" in the score distribution. It is important to note that since MSP measures across the entire test set, it does not effectively capture all the local characteristics. Notably, our results indicate that selections made using Sid, which prioritizes the highest MSP scores, more composed of ID nodes rather than OOD nodes. This tendency is further reinforced with additional propagation iterations.

We also wish to highlight additional evidence supporting the effectiveness of GRASP:

  1. GRASP is also compatible with other scoring function beyond MSP. Substituting MSP with other well-known OOD scoring methods in our framework also yields competitive results, as shown in the table (AUROC) below.
methodcoraamazoncoauthorchameleonsquirrel
MSP84.5689.3494.3457.9648.51
MSP+prop88.0295.3297.1550.3536.21
MSP+GRASP93.5096.6897.7576.9361.09
Energy85.4790.2895.6759.2045.07
Energy+prop87.5296.2795.8250.4236.49
Energy+GRASP88.3496.3596.6462.0460.66
KNN70.9484.7190.1357.9054.68
KNN+prop73.7092.3695.4749.7653.99
KNN+GRASP91.4897.4396.5276.3260.24
  1. Enhancement of Intra-edge Ratios During Propagation. By utilizing Sid and Sood estimates, GRASP significantly increases the proportion of intra-edges, thereby improving the overall accuracy of edge addition. The ablation study results displayed below illustrate the increased accuracy of added edges when GRASP is employed, compared to scenarios where it is not used:
Datasetsw/o GRASPw GRASP
cora0.890.94
amazon0.920.94
coauthor0.930.99
chameleon0.510.80
squirrel0.390.80
评论
Datasetscoraamazoncoauthorchameleonsquirrel
BackboneMethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
H2GCNMSP67.0086.5059.2386.8899.3740.3591.0062.7994.3457.21
Energy68.0686.8457.0586.2197.8551.6592.6663.2496.7553.18
KNN80.0079.6863.8580.5460.6677.2595.1356.8995.6257.45
ODIN65.2187.1056.2586.9799.4341.5891.0763.5295.0855.69
Mahalanobis81.6780.5586.2677.3397.9261.0297.6258.2996.3653.54
GNNSafe43.9788.8333.4090.8793.0043.23100.0050.35100.0036.26
GRASP33.5492.6316.5796.4814.2396.0866.3874.7286.0460.83
MixHopMSP83.9478.6053.5690.9748.6690.9192.9556.7795.6049.07
Energy83.6777.1557.0489.2828.4994.6794.1057.2195.6148.87
KNN93.3669.9365.4186.4562.4085.9189.5257.6493.4454.00
ODIN83.1479.1050.0091.2541.3992.6593.4556.4895.6847.58
Mahalanobis82.3580.0490.0581.8547.4191.6793.9356.3091.3356.56
GNNSafe66.8683.7739.7293.5433.8392.46100.0050.35100.0036.42
GRASP32.1192.7710.0796.999.4197.3176.9266.1285.9260.69
GPR-GNNMSP64.9087.4462.8487.6623.9695.6496.0947.6595.7844.62
Energy72.8583.8664.2385.2816.4296.5093.7849.0995.1642.63
KNN74.2481.4648.4790.4838.8392.3194.3955.3194.1851.74
ODIN62.5888.1355.4988.4117.2496.5196.1647.5095.5142.32
Mahalanobis79.5684.5397.2569.7549.9391.5687.0155.9587.2461.10
GNNSafe51.6585.9113.6396.4614.7395.96100.0050.32100.0036.25
GRASP26.7194.025.3097.148.2897.7076.5372.4385.4061.33
GCNIIMSP72.8583.0251.7288.1323.1895.2196.0355.4694.1349.46
Energy83.1575.2448.2888.7817.7296.0395.8756.7594.6148.63
KNN83.9976.0259.2586.7436.0593.4394.7252.8694.6553.47
ODIN71.4983.3149.4488.3519.4495.7595.6156.6394.7348.34
Mahalanobis73.9082.0177.6380.8744.0192.6396.6846.5791.6653.62
GNNSafe66.7083.1227.0893.1317.8794.47100.0050.35100.0036.32
GRASP27.9293.5123.5393.728.8297.6176.7966.4486.2760.62
评论

Regarding the impact of other based models, we have shown in Table 13 that GRASP achieves optimal performance on various base models in the literature. For the reviewer’s convenience, we have included the relevant results below and in the third part of the response, which clearly attests to the versatility and practicality of GRASP:

Datasetscoraamazoncoauthorchameleonsquirrel
BackboneMethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
GATMSP55.3388.8229.8894.3928.1594.2691.2761.9495.2147.50
Energy80.7179.1626.4895.2420.9695.6592.7161.1196.4745.69
KNN71.1481.2846.4290.7442.5191.5189.0261.1395.3753.16
ODIN55.2789.0626.9294.8924.6194.9590.8362.8996.1145.68
Mahalanobis67.9286.3714.2895.8026.2794.4695.3550.6591.3657.67
GNNSafe58.9785.6429.1293.1625.4193.91100.0050.39100.0036.21
GRASP22.7694.2814.2196.798.5997.5170.1573.4085.8461.18
GCNJKMSP81.3380.4032.4594.6426.4394.4486.4268.1994.9351.83
Energy96.5670.1640.9093.8018.7595.7591.9265.1695.3649.68
KNN90.9873.8164.4785.1850.9589.9894.4559.0494.6453.49
ODIN81.0480.6828.3595.1321.1295.4186.0368.5895.0450.64
Mahalanobis60.8486.2061.6187.1183.0487.3487.2366.6191.5257.24
GNNSafe65.0183.1122.4196.2813.2796.47100.0050.40100.0036.21
GRASP29.6992.9812.6696.868.0397.7459.6175.7886.0260.70
GATJKMSP69.5684.5147.2191.3224.6695.3794.3955.4394.6750.98
Energy62.2785.7534.7592.8917.2396.3891.1159.0195.6148.76
KNN82.5474.3270.9883.4838.9592.5692.2161.1495.2054.32
ODIN64.2585.2139.2992.1918.1696.3093.5656.1095.2448.62
Mahalanobis79.6079.3352.7988.5334.6093.6891.5952.3891.5256.19
GNNSafe44.4390.0122.4695.4517.5495.32100.0050.39100.0036.15
GRASP29.0492.5714.7896.708.3297.7078.6571.0985.8861.17
APPNPMSP59.3789.0164.6486.5118.3896.4594.2448.8794.4150.91
Energy81.8281.2162.8784.3614.5797.0190.5555.7590.9153.04
KNN75.3381.2149.5589.7638.4491.7192.1454.1994.1253.14
ODIN56.7289.4760.6786.7615.0296.9894.6350.7194.4150.60
Mahalanobis73.6486.0298.7562.1330.2093.9192.3858.1593.2956.65
GNNSafe59.7085.4519.2695.0812.1096.60100.0050.45100.0036.24
GRASP26.4594.165.6997.118.6997.5983.4163.0286.4260.76
作者回复

Dear Reviewers and ACs,

We are grateful for the insightful comments and valuable suggestions from all reviewers. In the following, we would like to summarize the contributions and revisions of this paper. As abbreviations, we refer to Reviewer hQev as R1, Reviewer xmGd as R2, Reviewer aWhs as R3, and Reviewer hXZy as R4 respectively.

Contributions:

  • We study score propagation in-depth, theoretically elucidating the conditions under which score propagation is effective. Multiple reviewers value the theoretical contribution of our paper: conducts an in-depth analysis (R1), theoretical analysis is helpful for future work in this direction, and for developing OOD detection methods for graphs (R2), theoretical foundation is solid and extends the understanding of graph OOD detection (R3).
  • We propose a graph augmentation method aimed at increasing the ratio of intra-edges to enhance OOD detection performance without the need for training from scratch and without requiring knowledge of true OOD nodes. Multiple reviewer recognized the effectiveness and practical values of our method, demonstrates superior performance (R1), results are convincing (R2), has achieves the SOTA performance (R3); the practical applicability is good (R2), the solution is practical and efficient (R3), experiments are comprehensive, the hyperparameter sensitivity analysis is thorough, providing readers with valuable insights (R4).
  • Our paper is well motivated (R2), introduced in an informative manner (R4), well written (R2, R3), and easy to follow (R2, R3, R4).

Responses and Revisions:

  • For Reviewer R1's concerns:

    • We clarify the relationship between GRASP and ID classification accuracy.
    • We discuss the extension of GRASP to subgraph-level and graph-level tasks.
    • We elucidate the applicability of Figure 2 in real-world datasets.
    • We investigate the impact of incorrectly added edges on the results.
  • For Reviewer R2's concerns:

    • We explain the effectiveness and robustness of our method in using MSP.
    • We add error bars to the results.
  • For Reviewer R3's concerns:

    • We release the source codes and pre-trained checkpoints for the reproducibility of our work.
    • We add error bars to the results.
    • We conduct additional experiments to analyze how GRASP's performance varies with different OOD distributions.
    • We provide further insights into the rationale behind GRASP and how it interprets the graph structure.
    • We analyze the relationship between GRASP and the graph size.
    • We explain the effect of random connectivity on GRASP.
  • For Reviewer R4's concerns:

    • We provide a more detailed discussion of existing graph-OOD works and position our paper's contributions.
    • We add comparative experiments with two latest baselines.
    • We justify the use of the Bernoulli distribution to model edge weights.
    • We present a concrete and intuitive example to illustrate the motivation behind our approach.
    • We offer further explanation for the remarkable improvement of GRASP on FPR in reddit2.
    • We highlight the key factors in achieving optimal OOD detection performance.

Thanks again for your efforts in reviewing our work, and we hope our responses can address any concerns about this work.

Warm regards,

The Authors of Submission 12283.

最终决定

This paper theoretically understands the propagation mechanism that leads to performance degradation in OOD detection on graph data and presents a simple yet effective post-hoc methodology called GRASP to address this issue. This approach focuses on increasing the intra-edge ratio, based on the observation and theoretical insight that OOD detection becomes more difficult as the inter-edge ratio between ID and OOD nodes increases. By dividing unlabeled nodes into ID and OOD sets based on maximum softmax probability, the method identifies a subset with a high connection ratio to the ID set from labeled nodes and links the nodes within this subset. Extensive experiments demonstrate the effectiveness of the proposed method. The reviewers initially suggested the following strong points and concerns.

Strong points:

  • The paper conducts an in-depth analysis and empirically validates the beneficial conditions for OOD score propagation in a graph.

  • The key idea of the work for OOD detection is presented as a post hoc strategy, hence the practical applicability of the method is good.

  • The study delves into the mechanisms of score propagation and derives conditions for its effectiveness. This theoretical foundation is solid and extends the understanding of graph OOD detection.

Concerns:

  • Several issues such as overconfidence and poor generalization can cause degradation of GRASP performance.

  • Standard error with statistical significance analysis should be provided.

  • The baselines selected in the experiments may not be sufficiently novel.

The authors have adequately addressed the reviewers' concerns, but one concern remains. The performance of the latest baseline, NODESafe [1], presented in the authors' rebuttal differs from the results reported in the original NODESafe paper by approximately 9 in AUROC and 39 in FPR. Similarly, the baseline GNNSafe [2] included in the main table shows a discrepancy of about 5 in AUROC and 24 in FPR compared to its original paper. Although the settings are slightly different (e.g., in the Cora dataset, the paper used 3 classes as ID and 4 classes as OOD, while NODESafe used 4 classes as ID and 3 classes as OOD, with the same GCN backbone), the performance gap seems notable even considering this. It is unclear why such a significant performance difference has occurred.

[1] Yang, Shenzhi, et al. "Bounded and Uniform Energy-based Out-of-distribution Detection for Graphs." Forty-first International Conference on Machine Learning.

[2] Wu, Qitian, et al. "Energy-based Out-of-Distribution Detection for Graph Neural Networks." The Eleventh International Conference on Learning Representations.