Q1&Q2

Annotations

Let denote the homogeneous ratio and be the node label, where for anomalous nodes and for normal nodes. The proportion of anomalous nodes is , such that .

The proportion of anomalous neighbors for a node , denoted as , is defined as:

We assume a constant node degree , and , with the assumption that .

Theorem 1: Under a homophily distribution shift, the optimal parameter learned on a training set with homophily is no longer optimal for a test set with a different homophily .

Key statistical properties:

For a one-layer message-passing GNN followed by a linear classifier, the node embedding is:

The final prediction is then:

The optimal coefficient that minimizes the Mean Squared Error (MSE) is given by linear regression:

Since is a function of the homophily ratio , the optimal weight vector also depends on . Consequently, a parameter learned with will be suboptimal for a test distribution with , causing model instability.

Theorem 2: If there exists an such that , then the difference between the empirical loss and the optimal loss on the test set is bounded. Specifically, , where is the Lipschitz constant of .

Define

The MSE loss is

The optimal for a given is defined as , with corresponding optimal loss .

We assume that HEM trains on the interval

,

which encompasses .

Define

Therefore, we can establish an upper bound for the empirical loss exceeding the optimal loss:

Since the objective of the adversarial training process in HEM is to minimize the loss in the worst case, we further assume HEM minimizes this upper bound. This implies that is chosen as the midpoint of the range of :

If adheres to Lipschitz continuity,

Finally,

Theorem 2 shows that HEM ensures stability through adversarial training on disentangled embeddings, resulting in a bounded test loss, unlike traditional methods, which lack this condition and thus may suffer unbounded loss.

Q3

The dataset is divided by the homogeneous ratio of nodes, split into 50%/10%/20%/20% for training, validation, in-distribution test, and out-of-distribution (OOD) test sets, respectively. The bottom 20% of nodes with the lowest homogeneous ratios are selected to form the OOD test set.

Q4

To ensure that the model remains stable under minimal distributional shift, we report its performance on the test sets without/with distribution shift (w/o DS and w/ DS). As shown, our model achieves state-of-the-art or near-SOTA performance even on the w/o DS test set.

L1

(1) We add a new baseline [ICML 2024] BAT [1], an uncertainty-aware framework.

| | ('amazon', 'w/o DS') | ('amazon', 'w/ DS') | ('yelp', 'w/o DS') | ('yelp', 'w/ DS') | ('tfinance', 'w/o DS') | ('tfinance', 'w/ DS') | | :--- | :------------------- | :------------------ | :----------------- | :----------------- | :--------------------- | :-------------------- | | BAT | 0.9403(0.0204) | 0.7226(0.0116) | 0.6219(0.0261) | 0.5493(0.0049) | 0.9277(0.0067) | 0.5635(0.0175) | | HEM | 0.9197(0.0074) | 0.7491(0.0049) | 0.6142(0.0056) | 0.5845(0.0101) | 0.9476(0.0060) | 0.6604(0.0029) |

(2) Key Differences between HEM and GOOD-D:

Problem Setting:

GOOD-D focuses on unsupervised graph-level out-of-distribution (OOD) detection, where the goal is to distinguish whether an entire graph is from the in-distribution or OOD.
HEM is designed for node-level graph anomaly detection under structural distribution shift, specifically addressing the performance degradation caused by varying levels of homophily between training and test data.

Disentanglement Strategy:

In GOOD-D, disentanglement is implemented through a perturbation-free augmentation strategy that generates a structure view using pre-computed structural encodings, contrasting it with a feature view.
In HEM, we propose an ego-neighborhood disentangled encoder that explicitly separates the modeling of node features and neighborhood structures through MLP and GNN sharing the same encoder, allowing more targeted learning of invariant patterns for node-level prediction.

Optimization Objectives and Mechanisms:

GOOD-D uses a hierarchical contrastive learning framework to capture ID patterns and define OOD scores based on multi-level inconsistencies.
HEM introduces a homophily-aware environment mixup that dynamically adjusts edge weights to create structurally diverse environments, and optimizes the encoder in an adversarial training setup to improve robustness across distribution shifts.

L2

To address this concern, we have supplemented our evaluation with Rec@K metric. Since GAD is a highly imbalanced classification problem, we believe AUPRC and Rec@K are more informative than AUROC and ACC.

Rec@K (Recall at K) measures the recall among the top-K highest-confidence predictions, where K equals the number of actual anomalies in the test set.

	('amazon', 'w/o DS')	('amazon', 'w/ DS')	('yelp', 'w/o DS')	('yelp', 'w/ DS')	('tfinance', 'w/o DS')	('tfinance', 'w/ DS')
GCN	0.9018(0.0101)	0.6537(0.0069)	0.5195(0.0024)	0.4404(0.0027)	0.8957(0.0022)	0.575(0.0089)
GAT	0.9022(0.0218)	0.6699(0.0083)	0.5642(0.0)	0.4811(0.0)	0.8259(0.0)	0.5188(0.0)
GraphSAGE	0.8818(0.0064)	0.652(0.0061)	0.5892(0.0023)	0.5101(0.0049)	0.8867(0.0169)	0.5329(0.0258)
BernNet	0.8974(0.0165)	0.6553(0.0184)	0.5233(0.0021)	0.4789(0.0025)	0.8659(0.0077)	0.6231(0.0018)
PCGNN	0.8955(0.0064)	0.6667(0.0046)	0.5089(0.0012)	0.4314(0.0008)	0.8008(0.0144)	0.5336(0.0069)
GHRN	0.8955(0.0)	0.6293(0.0)	0.5496(0.0054)	0.5113(0.0057)	0.8753(0.0069)	0.6016(0.0028)
BWGNN	0.9(0.0)	0.639(0.0)	0.5498(0.0083)	0.5033(0.0037)	0.8776(0.0084)	0.6075(0.0072)
V-REx	0.8924(0.0021)	0.6276(0.0092)	0.5494(0.0025)	0.5137(0.0054)	0.8682(0.0019)	0.5972(0.0028)
GroupDRO	0.8955(0.0)	0.6163(0.0061)	0.5563(0.0054)	0.5109(0.004)	0.8863(0.0029)	0.6053(0.0031)
SRGNN	0.9(0.0037)	0.6341(0.004)	0.5431(0.0058)	0.5079(0.0059)	0.8792(0.0073)	0.592(0.011)
GDN	0.8939(0.0021)	0.6488(0.0105)	0.4934(0.0166)	0.4821(0.0164)	0.8573(0.0022)	0.5802(0.0091)
BAT	0.9018(0.0121)	0.6699(0.0083)	0.568(0.0264)	0.5137(0.0214)	0.8883(0.0154)	0.5654(0.0072)
HEM	0.9015(0.0021)	0.678(0.008)	0.561(0.0068)	0.5263(0.0028)	0.9043(0.0022)	0.6312(0.001)

L3

Although anomaly detection involves significant class imbalance, it can still be formulated as a classification task. Therefore, general GNNs like GCN and GAT can be directly applied. Unlike prior works that adopt naive implementations of these GNNs as weak baselines, we follow the insights from [2] by incorporating residual connections and Layer Normalization. These enhancements significantly improve the performance of classic GNNs in GAD tasks.

[1] Liu Z, Qiu R, Zeng Z, et al. Class-imbalanced graph learning without class rebalancing[J]. arXiv preprint arXiv:2308.14181, 2023.

[2] Luo Y, Shi L, Wu X M. Classic gnns are strong baselines: Reassessing gnns for node classification[J]. Advances in Neural Information Processing Systems, 2024, 37: 97650-97669.