Node-wise Filtering in Graph Neural Networks: A Mixture of Experts Approach
摘要
评审与讨论
The paper identifies that real-world graphs exhibit a complex mix of homophilic and heterophilic patterns, meaning each node may display varying levels of homophily. Consequently, applying a single global filter leads to suboptimal performance. Using real-graphs and synthetic graphs constructed using the CSBM model, the authors empirically and theoretically demonstrate that a global filter optimized for one pattern increases the loss for other patterns. To address these limitations, they propose Node-MoE: Node-Wise Filtering Via Mixture of Experts.
Node-MoE dynamically applies distinct filters to individual nodes based on their specific patterns. The gating/routing model utilizes a composite vector, applying a GIN network to assign weights to each expert model for each node. Expert models are instantiated using ChebNetII/GCNII with fixed filter initializations, such as low-pass, high-pass, or constant filters. An additional smoothing loss is incorporated to enhance training stability and improve explainability. The proposed method is validated across multiple homophilic and heterophilic datasets, benchmarked against a diverse set of baselines.
优点
-
Theoretical and empirical analysis highlighting the limitations of applying a single global filter.
-
A suite of experiments across homophilic and heterophilic datasets demonstrating the effectiveness of the proposed method.
-
Multiple ablation studies validating specific design choices, such as different gating variants.
缺点
-
A comparable framework, Link-MoE [1], was previously proposed for link prediction and bears a close resemblance to the overall framework of Node-MoE. However, this work is neither cited nor discussed in the related literature. Given the presence of this recent work, the novelty of the proposed method appears limited.
-
While the paper argues that applying different filters to homophilic and heterophilic separately can lead to linear separability, a similar effect might also be achieved by increasing the representation capacity of global filters. For instance, by increasing the order in polynomial filters, or by learning piecewise polynomial filters[2]. Moreover, [3] explores node-wise filters via polynomial graph transformers. A comparison with these methods is necessary to evaluate the effectiveness of Node-MoE.
-
Table 1 compares Node-MoE with ACM-GCN [4]. However, [4] also includes additional results for ACM-GCN+ and ACMII-GCN++, which achieve 76.08% (Node-MoE: 73.64%) on the Chameleon dataset and 69.98% (Node-MoE: 62.31%) on the Squirrel dataset, with similar improvements across other datasets. GCN+ and GCN++ incorporate MLP(A) as additional features. Moreover, [5] demonstrated that incorporating structural information (A) as additional features alongside simple graph filters can yield notable improvements—reaching 77.48% on the Chameleon dataset and 74.17% on the Squirrel dataset. In light of these results, the necessity of an MoE-based approach remains unclear.
[1] Mixture of Link Predictors, NeurIPS24 (arXiv version available since 13 Feb 2024)
[2] A Piece-Wise Polynomial Filtering Approach for Graph Neural Networks, ECML22
[3] PolyFormer: Scalable Node-wise Filters via Polynomial Graph Transformer, KDD24
[4] Revisiting Heterophily for Graph Neural Networks, NeurIPS22 Supplementary:https://proceedings.neurips.cc/paper_files/paper/2022/file/092359ce5cf60a80e882378944bf1be4-Supplemental-Conference.pdf
[5] Simple Truncated SVD based Model for Node Classification on Heterophilic Graphs, DLG-KDD21
问题
See weaknesses.
Additional questions.
Q1. The average training time per epoch for ChebNetII and Node-MoE-3/5 are quite similar. What is the end-to-end training time comparison between these methods?
Q2. In Appendix C.2, the strategies for polynomial coefficients include both increasing/decreasing powers of and uniform values. How would initializing each filter randomly impact performance?
Q3. What are some limitations of this work?
Dear Reviewer N224,
We appreciate your constructive feedback. We are pleased to provide detailed responses to address your concerns.
W1: A comparable framework, Link-MoE [1], was previously proposed for link prediction and bears a close resemblance to the overall framework of Node-MoE. However, this work is neither cited nor discussed in the related literature. Given the presence of this recent work, the novelty of the proposed method appears limited.
A1: Thank you for pointing out Link-MoE [1], which utilizes the MoE framework for link prediction tasks. While both Link-MoE and Node-MoE leverage the MoE concept, they differ significantly in their motivations, expert designs, gating mechanisms, and training paradigms:
-
Motivation: The primary motivation of Link-MoE is to use different GNN4LP models for different node pairs, as different node pairs may require distinct heuristics for effective link prediction. In contrast, the motivation of Node-MoE is to address the diverse structural patterns exhibited by nodes within a graph.
-
Experts design: Link-MoE employs different GNN4LP models as experts to specialize in various link prediction heuristics. Node-MoE, on the other hand, uses the same base model as experts, allowing them to focus on learning distinct filters tailored to different node-level patterns.
-
Gating design: Link-MoE uses predefined heuristics as inputs for the gating model, while Node-MoE leverages the feature distance to distinguish nodes exhibiting different structural patterns.
-
Model Training: Link-MoE follows a two-step training process, where the expert models are trained first, followed by the training of the gating model. In contrast, Node-MoE adopts an end-to-end training approach, simultaneously training the gating model and experts, which ensures better integration and optimization.
We have updated our paper to include a discussion of Link-MoE in the related work section, highlighting these differences and clarifying the unique contributions of Node-MoE.
W2: While the paper argues that applying different filters to homophilic and heterophilic separately can lead to linear separability, a similar effect might also be achieved by increasing the representation capacity of global filters. For instance, by increasing the order in polynomial filters, or by learning piecewise polynomial filters[2]. Moreover, [3] explores node-wise filters via polynomial graph transformers. A comparison with these methods is necessary to evaluate the effectiveness of Node-MoE.
A2: Thank you for pointing out these related works. While high-order polynomial filters can address complex graph structures, the optimal order of polynomial filters is the number of nodes , which is infeasible for large graphs. Moreover, the lack of sufficient labels for training such high-order filters and the need to ensure generalization further limit their practicality. PP-GNN [2] addresses the over-smoothing issue of the Monomial basis by approximating with piecewise polynomial filters. However, the approximation is suboptimal due to the complexity of the optimal filter, which still requires an order of . PP-GNN [2] applies the same filter to all nodes, whereas our theoretical analysis shows that a single filter optimized for one type of node may negatively impact the performance of other node types. PolyFormer [3] applies multiple filters (where each polynomial basis can be viewed as a filter) to all nodes with an attention mechanism to combine the filtered results, which has a similar idea to ACM-GCN and AutoGCN. In contrast, the proposed Node-MoE employs a gating model to adaptively select appropriate filters for different nodes.
We further conducted experiments to compare the performance of Node-MoE with PP-GNN [2] and PolyFormer [3]. The following results further highlight the effectiveness of Node-MoE, particularly on heterophilic datasets.
| Cora | CiteSeer | PubMed | Chameleon | Squirrel | Penn94 | |
|---|---|---|---|---|---|---|
| PP-GNN | 89.15 ± 1.03 | 77.63 ± 1.28 | 88.97 ± 0.45 | 69.08 ± 2.05 | 50.92 ± 2.48 | 76.43 ± 0.78 |
| PolyFormer | 87.59 ± 1.29 | 76.13 ± 1.70 | 89.48 ± 0.48 | 53.87 ± 2.78 | 34.27 ± 1.65 | 74.82 ± 0.26 |
| Node-MoE | 89.38 ± 1.26 | 77.78 ± 1.36 | 89.58 ± 0.60 | 73.64 ± 1.80 | 62.31 ± 1.98 | 85.37 ± 0.31 |
Q3: What are some limitations of this work?
A6: While the proposed Node-MoE demonstrates strong performance and flexibility, we acknowledge the following limitations and propose future directions for improvement:
-
Gating Model: The gating model is a critical component of Node-MoE, as it determines the assignment of experts to different nodes. In this work, we leverage the feature distance between a node and its neighbors as the input for the gating model, and we have theoretically demonstrated that this feature distance can differentiate homophilic and heterophilic nodes under the CSBM model (as shown in the revised Theorem 1). However, real-world graphs may exhibit more complex features, where the feature distance might not effectively separate homophilic and heterophilic nodes. Although Node-MoE consistently outperforms the single-expert baseline under the noise feature setting, as shown in Table 6, designing more sophisticated gating models could further improve performance.
-
Expert Model: In this work, we use the same expert model with different learned filters. However, exploring diverse expert architectures tailored to specific node characteristics could lead to further improvements. For instance, if a node's feature alone can determine its label, an MLP-based expert could be more efficient. Investigating hybrid expert architectures could enhance both efficiency and performance.
-
Applicability to Other Tasks: The primary focus of this work is on node classification. However, extending Node-MoE to other graph-based tasks, such as node regression and clustering, is a promising direction for future research. These tasks could benefit from the flexibility and adaptability of the Node-MoE framework.
We hope that we have addressed the concerns in your comments. Please kindly let us know if there are any further concerns, and we are happy to clarify.
I appreciate the authors' detailed response and their effort in addressing my concerns. In recognition of these efforts, I have raised my score from 3 to 5.
Reasons for not increasing the score further:
-
While there are notable differences between Link-MoE and Node-MoE, these differences are largely expected due to the shift in the underlying task, i.e., from link prediction to node classification. While Node-MoE is an interesting extension, its framework novelty appears limited when compared to Link-MoE.
-
The claim in lines 50-53 of the updated manuscript appears incorrect:
"There are few methods, such as ACM-GNN... However, applying a uniform type of filter, tailored for just one of these patterns, across all nodes may hurt the performance of other patterns."
[1] also observed in their Fig. 4 that different nodes exhibit varying levels of homophily. They propose ACMII, which uses Node-wise Adaptive Channel Mixing to better identify the most suitable filter/channel for each node.
Furthermore, results from Table A.2 in [1]'s supplementary section suggest that ACMII-GCN++ achieves strong performance using the same fixed splits (48%/32%/20%):
- Chameleon: 74.76±2.2
- Squirrel: 67.40±2.21
Additionally, ACMII-GCN++ reports 81.76±1.25 on CiteSeer using random splits (60%/20%/20%). These results make a MoE-based framework for node classification appear less compelling.
References:
[1] Revisiting Heterophily for Graph Neural Networks, NeurIPS22 Supplementary: https://proceedings.neurips.cc/paper_files/paper/2022/file/092359ce5cf60a80e882378944bf1be4-Supplemental-Conference.pdf
C3: Furthermore, results from Table A.2 in [1]'s supplementary section suggest that ACMII-GCN++ achieves strong performance using the same fixed splits (48%/32%/20%): Chameleon: 74.76±2.2 Squirrel: 67.40±2.21 Additionally, ACMI-GCN++ reports 81.76±1.25 on CiteSeer using random splits (60%/20%/20%). These results make a MoE-based framework for node classification appear less compelling.
R3: Thank you for bringing this to our attention. To ensure a fair comparison, we used the same dataloader and performed all baseline experiments, including ACM-GCN, with the hyperparameter search space provided in their original paper. For each data split, we ran the experiments three times and reported the average performance. The hyperparameter search space for ACM-GCN was taken directly from its appendix. Below, we present the results for ACMII-GCN, ACMII-GCN+, ACMII-GCN++, Node-MoE, and Node-MoE+:
| Cora | CiteSeer | Chameleon | Squirrel | |
|---|---|---|---|---|
| ACMII-GCN | 88.69 ± 1.01 | 76.84 ± 1.71 | 68.14 ± 2.05 | 50.47 ± 1.49 |
| ACMII-GCN+ | 88.31 ± 1.08 | 76.43 ± 1.83 | 74.43 ± 1.68 | 64.56 ± 1.81 |
| ACMII-GCN++ | 88.33 ± 1.07 | 76.63 ± 1.92 | 74.39 ± 1.84 | 64.17 ± 2.14 |
| Node-MoE | 89.38 ± 1.26 | 77.78 ± 1.36 | 73.64 ± 1.80 | 62.31 ± 1.98 |
| Node-MoE+ | 89.41 ± 1.28 | 77.90 ± 1.48 | 75.39 ± 1.34 | 66.30 ± 1.99 |
For the Chameleon dataset, we were able to reproduce results comparable to the reported performance of ACMII-GCN++. However, for the Squirrel dataset, our reproduced results are approximately three percentage points lower than the reported performance. We are reaching out to the authors for further reproduction details to ensure no steps were overlooked. It is important to note that the use of MLP(A) to improve performance on heterophilic datasets, as seen in ACMII-GCN+, is orthogonal to the contributions of Node-MoE. Node-MoE can also incorporate this trick. Node-MoE demonstrates robust performance on both homophilic and heterophilic datasets.
Regarding the Cora and CiteSeer datasets, while both Node-MoE and ACM-GCN use random splits, the specific splits differ. Previous studies [5, 6] have shown that the choice of training data splits can significantly impact performance, which may explain some discrepancies in the reported results.
[5] Wu, Yuexin, et al. "Active learning for graph neural networks via node feature propagation." arXiv preprint arXiv:1910.07567 (2019).
[6] Ma, Jiaqi, et al. "Partition-based active learning for graph neural networks." arXiv preprint arXiv:2201.09391 (2022).
We sincerely appreciate your thoughtful feedback, which has helped us better illustrate our method. If you have any further concerns or questions, we would be more than happy to discuss them.
Dear Reviewer N224,
Thank you for your prompt reply. We are happy to discuss your further concerns.
C1: While there are notable differences between Link-MoE and Node-MoE, these differences are largely expected due to the shift in the underlying task, i.e., from link prediction to node classification. While Node-MoE is an interesting extension, its framework novelty appears limited when compared to Link-MoE.
R1: We acknowledge that both Node-MoE and Link-MoE share the same divide-and-conquer principle, leveraging different experts to process different samples. Such divide-and-conquer principle can indeed be traced back to the early MoE papers [2, 3] and has since been widely adopted in various domains, including NLP, CV, and RecSys [4].
For the node classification task, we provide a theoretical demonstration that a single filter may not be suitable for the entire graph, and nodes with different structural properties require different filters. This insight motivates the use of the MoE framework, which naturally aligns with the need to address this disparity. The MoE framework primarily consists of two components: the gating model and the experts. To address the unique challenges of node classification, we carefully designed both components in Node-MoE.
The motivations and designs of Link-MoE and Node-MoE are fundamentally distinct. In Node-MoE, we employ different filters to process nodes exhibiting diverse structural patterns, using the neighbor feature distance as input to the gating model for expert assignment. In contrast, Link-MoE leverages node pair heuristics to assign node pairs to the most suitable link predictors for link prediction tasks. These differences highlight that Node-MoE is not an extension of Link-MoE but rather a tailored solution to the specific challenges of node classification.
[2] Jacobs, Robert A., et al. "Adaptive mixtures of local experts." Neural computation 3.1 (1991): 79-87
[3] Jordan, Michael I., and Robert A. Jacobs. "Hierarchical mixtures of experts and the EM algorithm." Neural computation 6.2 (1994): 181-214.
[4] Cai, Weilin, et al. "A survey on mixture of experts." arXiv preprint arXiv:2407.06204 1 (2024).
C2: The claim in lines 50-53 of the updated manuscript appears incorrect: "There are few methods, such as ACM-GNN... However, applying a uniform type of filter, tailored for just one of these patterns, across all nodes may hurt the performance of other patterns." [1] also observed in their Fig. 4 that different nodes exhibit varying levels of homophily. They propose ACMII, which uses Node-wise Adaptive Channel Mixing to better identify the most suitable filter/channel for each node.
R2: Thank you for your thoughtful consideration. In lines 50-53, our intent was to illustrate that the previous methods would apply multiple filters to all the nodes. And if the same filters apply to all nodes, it may lead to certain limitations. Below, we provide a more detailed explanation.
Post-fusion methods, including ACM-GCN, first apply multiple filters to all nodes and then combine the filtered features for final predictions. In ACM-GCN, an attention mechanism calculates the attention weights of each filter for a node based on its filtered features. However, such filtered features may make different nodes indistinguishable. Take Figure 1(a) in our paper as an example. The three leftmost nodes have different labels, but after applying a global filter, their filtered features become identical. If these identical filtered features are used as input to the attention mechanism, the attention mechanism may assign the same weights to all three nodes, leading to suboptimal filter assignments. To address this limitation, Node-MoE takes a different approach by using the neighbor feature distance of the original features as input to the gating model.
We add further explanations in the Introduction section to help readers clearly distinguish our method from previous approaches.
W3: Table 1 compares Node-MoE with ACM-GCN [4]. However, [4] also includes additional results for ACM-GCN+ and ACMII-GCN++, which achieve 76.08% (Node-MoE: 73.64%) on the Chameleon dataset and 69.98% (Node-MoE: 62.31%) on the Squirrel dataset, with similar improvements across other datasets. GCN+ and GCN++ incorporate MLP(A) as additional features. Moreover, [5] demonstrated that incorporating structural information (A) as additional features alongside simple graph filters can yield notable improvements—reaching 77.48% on the Chameleon dataset and 74.17% on the Squirrel dataset. In light of these results, the necessity of an MoE-based approach remains unclear.}
A3: Thank you for pointing out the variants of ACM-GCN. As noted, ACM-GCN+ incorporates MLP(A) as additional features, which provides an enhancement over the original ACM-GCN. However, since this trick was not utilized by other baselines in our experiments, we did not initially include the variants of ACM-GCN in the comparison.
We would like to point out that the reported performance of ACM-GCN+ and ACMII-GCN++, which achieve 76.08% and 69.98% on the Chameleon dataset, respectively, is based on a random 60%/20%/20% split. In our experiments, we use the fixed data split as the same with [6, 7].
To ensure a fair comparison, we extended our approach by incorporating MLP(A) into the experts of Node-MoE, resulting in a variant denoted as Node-MoE+. Using the same experimental setup and hyperparameter tuning (based on the official code of ACM-GCN paper), we compared Node-MoE+ against ACM-GCN+ and ACM-GCN++. The results show that Node-MoE+ also benefits from the inclusion of MLP(A) on heterophilic datasets, achieving competitive performance improvements. However, on homophilic datasets, both ACM-GCN+ and Node-MoE+ exhibit minimal performance gains.
| Cora | CiteSeer | Chameleon | Squirrel | |
|---|---|---|---|---|
| ACM-GCN | 88.01 ± 1.26 | 76.52 ± 1.72 | 69.62 ± 1.22 | 57.02 ± 0.79 |
| ACM-GCN+ | 88.16 ± 1.36 | 77.12 ± 1.69 | 74.12 ± 1.78 | 65.65 ± 1.82 |
| ACM-GCN++ | 88.85 ± 1.05 | 77.17 ± 1.86 | 73.98 ± 1.91 | 64.74 ± 1.96 |
| Node-MoE | 89.38 ± 1.26 | 77.78 ± 1.36 | 73.64 ± 1.80 | 62.31 ± 1.98 |
| Node-MoE+ | 89.41 ± 1.28 | 77.90 ± 1.48 | 75.39 ± 1.34 | 66.30 ± 1.99 |
HLP [5] considers learning filter coefficients as equivalent to learning a new graph and uses SVD to generate these new graphs. However, applying SVD to large graphs is computationally expensive, which may not be efficient for practical applications. Additionally, the data splits of the Chameleon and Squirrel datasets used in HLP [5] differ from those in our work and ACM-GCN, making direct performance comparisons challenging. Since HLP [5] does not provide code, we are unable to include numerical performance comparisons with their method. Nonetheless, the proposed Node-MoE demonstrates strong performance on both homophilic and heterophilic datasets, validating its robustness and effectiveness.
[6] Pei, Hongbin, et al. "Geom-gcn: Geometric graph convolutional networks." ICLR'20.
[7] Lim, Derek, et al. "Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods." NeurIPS'21.
Q1: The average training time per epoch for ChebNetII and Node-MoE-3/5 are quite similar. What is the end-to-end training time comparison between these methods?
A4: The training time per epoch for ChebNetII and Node-MoE-3/5 is quite similar, as shown in Table 2, because we adopt a Top-1 gating mechanism in Node-MoE, which ensures that each node is processed by only one expert. We adopt end-to-end training for Node-MoE with the same maximum number of epochs and select the results with the best epoch.
Q2. In Appendix C.2, the strategies for polynomial coefficients include both increasing/decreasing powers of and uniform values. How would initializing each filter randomly impact performance?
A5: Thank you for the suggestion. We have tested the performance of the proposed Node-MoE with random filter initialization, denoted as Node-MoE-Random. The following results demonstrate that Node-MoE with our filter initialization strategy outperforms Node-MoE-Random across all datasets. However, it is noteworthy that Node-MoE-Random still performs better than the single expert ChebNetII, highlighting the robustness of the Node-MoE framework.
| Cora | CiteSeer | Chameleon | Squirrel | |
|---|---|---|---|---|
| ChebNetII | 88.71 ± 0.93 | 76.93 ± 1.57 | 71.14 ± 2.13 | 57.12 ± 1.13 |
| Node-MoE-Random | 89.10 ± 1.01 | 77.51 ± 1.34 | 72.28 ± 2.90 | 61.78 ± 1.86 |
| Node-MoE | 89.38 ± 1.26 | 77.78 ± 1.36 | 73.64 ± 1.80 | 62.31 ± 1.98 |
I would like to thank the authors for their proactive engagement during the discussion period and for providing additional experiments and clarifications. Based on the presented results, the ACMII-GCN+/++ variants demonstrate competitive performance compared to Node-MoE, which somewhat diminishes the appeal of an MoE-based approach in this context. As such, I will maintain my current rating of 5.
That said, the current setup utilizes identical expert models differentiated only by filter initializations. To further strengthen the contribution of this work, I suggest exploring the use of diverse expert models—such as a combination of specialized GNN architectures tailored for homophily and heterophily. Demonstrating improvements with such an approach could significantly enhance the paper’s impact and practical relevance.
Dear Reviewer N224,
Thank you for the excellent suggestion. Following your recommendation, we explored the use of diverse expert models by selecting GCN and LinkX as two representative architectures tailored for homophilic and heterophilic patterns, respectively. We integrated these two models as the experts in Node-MoE. The results are summarized in the table below:
| Cora | CiteSeer | Chameleon | Squirrel | |
|---|---|---|---|---|
| LinkX | 82.89 ± 1.27 | 70.05 ± 1.88 | 68.42 ± 1.38 | 61.81 ± 1.80 |
| GCN | 88.60 ± 1.19 | 76.88 ± 1.78 | 67.96 ± 1.82 | 54.47 ± 1.17 |
| Node-MoE(LinkX+GCN) | 88.92 ± 2.00 | 77.02 ± 2.03 | 73.84 ± 1.25 | 64.69 ± 1.84 |
GCN performs exceptionally well on homophilic datasets, while LinkX excels on heterophilic datasets. The proposed Node-MoE, utilizing both GCN and LinkX as experts, achieves better performance across both homophilic and heterophilic datasets. This demonstrates the flexibility and effectiveness of Node-MoE in leveraging diverse expert models to adapt to varying structural patterns within graphs.
Thank you for your suggestion and for helping us improve our paper. If you have any further questions or concerns, we would be happy to discuss them.
Best regards,
Authors
Thank you to the authors for their response and for providing additional evidence on the significance of using diverse experts. Based on these new experiments, I have revised my score to 6. However, a more thorough analysis is still needed to explore the criteria for selecting diverse experts and to evaluate their impact on overall latency and memory consumption.
Dear Reviewer N224,
Thank you for your great suggestions. In this work, we have demonstrated the potential of Node-MoE in effectively addressing the complex structural patterns commonly observed in real-world graphs. We agree that there is significant potential to further explore the effectiveness and efficiency of the proposed Node-MoE, particularly with respect to expert selection criteria and its impact on overall latency and memory consumption, as you have thoughtfully suggested. This also highlights the flexibility of the proposed Node-MoE, making it adaptable to different scenarios.
Thank you again for helping us improve our paper and for pointing out potential directions for further exploration.
Best regards,
Authors
In this paper, the authors address limitations in graph neural networks (GNNs) that arise when only a single global filter is used. They observe that graphs may contain a combination of homophilic and heterophilic patterns within the same graph, with varying levels of homophily across different subgraphs or communities. Based on the patterns generated by the CSBM model, they theoretically analyze the (near) linear separability of nodes when (1) a low-pass global filter is applied to the entire graph and (2) different filters are applied separately to nodes with different patterns. These theoretical insights lead to the development of a novel GNN model that employs a mixture-of-experts approach to adaptively select filters for nodes based on their specific patterns.
优点
originality
The originality of the paper is marginal, as it builds on well-known observations (see "weakness").
quality
The paper appears to be technically sound, presenting a nice theoretical analysis based on the CSBM model, which motivates the development of its proposed node-wise filtering method.
clarity
The paper is well-structured and generally easy to follow. However, Figure 1 could be improved, as it currently reads a bit confusing. For instance, there are no numbers or labels on the right side of Figure 1(a), making it unclear which nodes correspond to those on the left side after applying the global filter. Also, what are the key messages of the figure? The figures in Section 4 (Experiments) could be enhanced, e.g., the font sizes for the x-axis and y-axis labels in Figures 6 and 7 are too small, which makes them difficult to read.
significance
The significance of the proposed method is limited due to its marginal originality.
缺点
(W1) The observation of varying structural patterns (homophilic and heterophilic patterns within the same graph) is not highly original. While the paper provides a well-illustrated demonstration in Section 2.1, analyzing these patterns from two perspectives: node homophily density and homophily across different communities, this does not significantly enhance the originality. These structural variations are commonly known and have been addressed in previous works using various approaches as discussed in Section 5 Related Works.
(W2) The proposed method assumes that a heterophilic pattern is present when a node’s features significantly differ from those of its neighboring nodes. While this assumption is reasonable, it restricts the method's applicability. Node features do not always correlate with node labels in terms of homophilic and heterophilic patterns, which may reduce the method's effectiveness when feature-label alignment is weak. Although the authors argue that the proposed method is distinct from existing works because those approaches pass all nodes through all filters, these prior methods do not make specific assumptions about heterophilic patterns, instead allowing the filters to adaptively learn from data. Thus, the distinction between the current work and these prior methods seems more like a trade-off rather than a clear improvement.
(W3) The proposed method appears to be a combination of a gating function and a mixture of experts. When multiple GNNs with different filters are used, the model grows in complexity, introducing additional learnable parameters and hyperparameters (such as for adjusting the influence of the filter smoothing loss, and for the number of experts). This added complexity typically leads to performance improvements, so the modest gains shown in the experimental results are not surprising.
问题
See the questions in "Weaknesses".
For the statement: "... different structural patterns are not uniformly distributed across the graph, and distinct communities may exhibit varying structural characteristics. To capitalize on this phenomenon, we employ GNNs with low-pass filters, such as GIN (Xu et al., 2018), for the gating model. These networks are chosen due to their strong community detection capabilities (Shchur & Günnemann, 2019; Bruna & Li, 2017), ensuring that neighboring nodes are likely to receive similar expert selections", I also have two questions:
-
Could you elaborate on why GIN has strong community detection capabilities?
-
Regarding "ensuring that neighboring nodes are likely to receive similar expert selections", does this apply to both homophilic and heterophilic graphs? If so, how would the approach differ between them? In heterophilic graphs, neighboring nodes are more likely to differ in features or labels, so how would similar expert selections across neighbors work?
Q2: Regarding "ensuring that neighboring nodes are likely to receive similar expert selections", does this apply to both homophilic and heterophilic graphs? If so, how would the approach differ between them? In heterophilic graphs, neighboring nodes are more likely to differ in features or labels, so how would similar expert selections across neighbors work?
A5: As shown in Figure 3 in Section 2.1, the community phenomenon exists in both homophilic and heterophilic graphs. In the proposed Node-MoE, we do not explicitly distinguish between homophilic and heterophilic graphs, as the gating model operates uniformly across both types.
It is true that neighboring nodes are more likely to differ in features or labels in a heterophilic graph. However, the gating model's objective is to assign nodes to different experts, where each expert is capable of handling nodes from different classes. Although neighboring nodes have different features or labels, they can still be processed by the same filter. For instance, as shown in Figure 1(b) of our paper, although the heterophilic nodes have different features and labels, the same high pass filter can still distinguish them.
We hope that we have addressed the concerns in your comments. Please kindly let us know if there are any further concerns, and we are happy to clarify.
W3: The proposed method appears to be a combination of a gating function and a mixture of experts. When multiple GNNs with different filters are used, the model grows in complexity, introducing additional learnable parameters and hyperparameters (such as for adjusting the influence of the filter smoothing loss, and K for the number of experts). This added complexity typically leads to performance improvements, so the modest gains shown in the experimental results are not surprising.
A3: The number of learnable parameters of the proposed method are similar to several baselines. To ensure a fair comparison, we assume that all models have a feature dimension and hidden size of , and the number of layers is . We omit a small number of parameters, such as the learnable filter parameters in models like GPR-GNN and ChebNetII, which are typically around 10 and do not significantly impact the comparison. For the MLP, GCN, APPNP, GCNII, GPR-GNN, ChebNetII, AutoGCN, WRGCN, and PC-Conv, the trainable parameters are approximately . For GAT and ASGAT, with attention heads, the number of learnable parameters becomes . ACM-GCN uses three filters with separate transformation matrices, resulting in and LinkX includes three MLPs, also leading to . For the MoE-based methods, GMoE has parameters, where is the number of experts. Mowst has learnable parameters, which combines MLP and GCN. For the proposed Node-MoE, the number of learnable parameters is also , and for most of the results in Table 1. Thus, the proposed Node-MoE has a comparable number of learnable parameters to several baselines. Furthermore, as shown in Table 2, the training time per epoch for Node-MoE with top-1 gating is comparable to that of a single expert model, ensuring that the additional parameters do not significantly inflate training costs.
We further conducted experiments with Node-MoE using 3 experts and a significantly reduced hidden dimension size, decreasing it from 64 to 21. This adjustment ensures that the proposed Node-MoE has a comparable number of learnable parameters to models like MLP and GCN. Despite the lower hidden dimension, the results demonstrate that Node-MoE maintains similar performance and continues to outperform the baselines.
| Hidden Size | Cora | CiteSeer | Chameleon | Squirrel |
|---|---|---|---|---|
| Node-MoE 21 | 89.36 ± 1.00 | 77.93 ± 1.52 | 73.51 ± 1.77 | 60.93 ± 1.89 |
| Node-MoE 64 | 89.38 ± 1.26 | 77.78 ± 1.36 | 73.64 ± 1.80 | 62.31 ± 1.98 |
Compared to the base expert model, the additional hyperparameters introduced by Node-MoE include the filter smoothing loss weight , the number of experts, and the top-k gating mechanism. As shown in Figure 9, the model demonstrates robustness to the choice of , achieving stable performance across a wide range of values. Similarly, Table 4 shows that the proposed method achieves competitive performance even with as few as 2 experts, and Table 5 indicates that good performance is maintained with a top-1 gating mechanism. These results suggest that the proposed method is robust to its additional hyperparameters.
Q1: Could you elaborate on why GIN has strong community detection capabilities?
A4: GNNs are widely utilized in community detection and graph clustering tasks due to their powerful ability to represent graph-structured data effectively [1, 2, 3, 4]. Due to the community properties of nodes with different patterns as shown in Section 2.1, we choose GIN as an example for the gating model. As demonstrated in A1, GCN can also achieve comparable performance to GIN and both outperform MLP, further validating the effectiveness of GNN-based models in this context.
[1] Shchur, Oleksandr, and Stephan Günnemann. "Overlapping community detection with graph neural networks." arXiv preprint arXiv:1909.12201 (2019).
[2] Souravlas, Stavros, Sofia Anastasiadou, and Stefanos Katsavounis. "A survey on the recent advances of deep community detection." Applied Sciences 11.16 (2021): 7179.
[3] Bianchi, Filippo Maria, Daniele Grattarola, and Cesare Alippi. "Spectral clustering with graph neural networks for graph pooling." International conference on machine learning. PMLR, 2020.
[4] Tsitsulin, Anton, et al. "Graph clustering with graph neural networks." Journal of Machine Learning Research 24.127 (2023): 1-21.
Dear Reviewer s73g,
We appreciate your constructive feedback. We are pleased to provide detailed responses to address your concerns.
W1: The observation of varying structural patterns (homophilic and heterophilic patterns within the same graph) is not highly original. While the paper provides a well-illustrated demonstration in Section 2.1, analyzing these patterns from two perspectives: node homophily density and homophily across different communities, this does not significantly enhance the originality. These structural variations are commonly known and have been addressed in previous works using various approaches as discussed in Section 5 Related Works.
A1: We acknowledge that several prior studies have highlighted that real-world graphs contain nodes exhibiting variations in homophilic behavior, as we have also cited in Section 2.1. The previous methods usually apply the same filter or multiple filters to all nodes. Through our theoretical analysis, we demonstrate that applying the same filter to all nodes can lead to suboptimal performance, as a single filter optimized for one type of node may negatively impact others. This observation inspired our node-wise filtering approach, where different filters are applied to different nodes.
In Section 2.1, we further illustrate that homophily levels vary significantly across different structural communities, and nodes with varying homophily levels are not uniformly distributed across the graph. This phenomenon, to the best of our knowledge, has not been explicitly studied in prior works. This insight motivated the choice of our gating model, which is tailored to capture community properties. We demonstrate experimentally that GNN-based gating models usually outperform MLP-based ones, as shown in the table below:
| Gating Model | Cora | CiteSeer | Chameleon | Squirrel |
|---|---|---|---|---|
| MLP | 89.26 ± 1.34 | 77.59 ± 1.27 | 72.70 ± 1.75 | 59.88 ± 2.33 |
| GCN | 89.28 ± 0.96 | 77.89 ± 1.29 | 73.29 ± 1.71 | 62.04 ± 1.92 |
| GIN | 89.38 ± 1.26 | 77.78 ± 1.36 | 73.64 ± 1.80 | 62.31 ± 1.98 |
W2: The proposed method assumes that a heterophilic pattern is present when a node’s features significantly differ from those of its neighboring nodes. While this assumption is reasonable, it restricts the method's applicability. Node features do not always correlate with node labels in terms of homophilic and heterophilic patterns, which may reduce the method's effectiveness when feature-label alignment is weak. Although the authors argue that the proposed method is distinct from existing works because those approaches pass all nodes through all filters, these prior methods do not make specific assumptions about heterophilic patterns, instead allowing the filters to adaptively learn from data. Thus, the distinction between the current work and these prior methods seems more like a trade-off rather than a clear improvement.
A2: Thank you for raising the issue of feature-label alignment. While it is true that the effectiveness of our method depends on some level of alignment, this is a common assumption in graph-based models. For example, the CSBM model assumes that nodes from different classes exhibit different feature distributions. To further strengthen our approach, we have added a new theorem in the revised version of our paper to address the classification of homophilic and heterophilic nodes. The theorem and proof can be found in Section 2.2 and Appendix A of our paper. For your convenience, the new theorem is as follows:
Homophilic and heterophilic nodes can be separated based on the feature distance between a node and the average feature vector of its neighbors, given by with probability .
Specifically, we demonstrate that the feature distance can be used to effectively separate homophilic and heterophilic nodes in the CSBM model.
To address weaker feature-label alignment scenarios, we have conducted additional experiments by introducing varying levels of Gaussian noise to the node features. The results, shown in Table 6 of our paper, reveal that as the noise level increases, the performance of both MLP and learnable filter methods like ChebNetII drops significantly. The performance gap between NODE-MOE and the single-expert ChebyNetII decreases. However, NODE-MOE consistently outperforms the single expert, even under high noise levels. This is because, when noise is excessively high, the gating model may randomly assign nodes to different experts, causing the learned filters to converge to a performance similar to the single-expert model. Nevertheless, Node-MoE remains robust and effective across various noise levels, demonstrating its adaptability to weaker feature-label alignment scenarios. Furthermore, the proposed Node-MoE is highly flexible and it can incorporate existing methods as experts, which also differ from the current methods.
This paper tackles the challenges faced by Graph Neural Networks (GNN) in real-world graphs that exhibit both hemophilic and heterophilic patterns, which complicates processing with a single global filtering approach. The authors propose a Mixture of Experts (MoE) based node classification method specifically designed to address these issues. By utilizing node-level MoE, the proposed method effectively navigates the complexities of these patterns, demonstrating promising results through extensive experimental evaluations.
优点
The proposal of MoE-like node label filtering is particularly interesting, as it enhances the method's suitability for handling real-world graphs with diverse patterns.
The authors provide a comprehensive analysis throughout the paper, complemented by a thorough set of experimental studies that effectively support their proposed approach.
缺点
- Figure 1 serves as a poor illustrative example, as it may lead to confusion. The introduction in Section I states that “nodes tend to connect with others that share similar labels, reflecting homophilic patterns,” yet Figure 1 shows both solid and dashed edge nodes connecting to nodes labeled and 1 on the right side, which contradicts this description.
- The message and patterns presented in Figure 2 lack clarity regarding their relevance to the key contributions of the work. The statement that “the majority of nodes in homophilic graphs predominantly exhibit homophilic patterns” raises questions about the definition of homophilic graphs. According to the authors' definition in the Homophily metrics section, this concept only applies to graphs with different labels; otherwise, h(v) will always equal 1.
- The paper suffers from overloaded notations that can lead to confusion for readers. For instance, h(v_i) is used in Section 1 to denote the label similarity between a node v_i and its neighbors, while in Section 3.1, the same notation is repurposed to refer to a node classifier. This inconsistency may hinder comprehension.
问题
see the details in the weaknesses part
Dear Reviewer VQRg,
We appreciate your constructive feedback. We are pleased to provide detailed responses to address your concerns.
W1: Figure 1 serves as a poor illustrative example, as it may lead to confusion. The introduction in Section I states that “nodes tend to connect with others that share similar labels, reflecting homophilic patterns,” yet Figure 1 shows both solid and dashed edge nodes connecting to nodes labeled and 1 on the right side, which contradicts this description.
A1: Thank you for pointing out the potential confusion regarding Figure 1. In the figure, we assume that a node exhibits a homophilic pattern if the majority of its neighbors share the same label as the node itself; otherwise, it exhibits a heterophilic pattern. For the solid-edge nodes, 2 out of their 3 neighbors have the same label, indicating homophilic patterns. Conversely, for the dashed-edge nodes, 2 out of their 3 neighbors have different labels, indicating heterophilic patterns. Thus, Figure 1 is consistent with the description in Section I. To avoid further confusion, we have revised the paper to clarify the assumptions and explanation associated with Figure 1, ensuring it aligns clearly with the text.
W2: The message and patterns presented in Figure 2 lack clarity regarding their relevance to the key contributions of the work. The statement that “the majority of nodes in homophilic graphs predominantly exhibit homophilic patterns” raises questions about the definition of homophilic graphs. According to the authors' definition in the Homophily metrics section, this concept only applies to graphs with different labels; otherwise, h(v) will always equal 1.
A2: It is true that the homophily metric applies only to graphs with nodes having different labels. The homophily metric for graphs is a global property, where a high graph-level homophily indicates that nodes generally tend to connect with others that share the same label. However, at the node level, it is not guaranteed that the majority of neighbors of every individual node in a homophilic graph will share the same label. Figure 2 illustrates the density of node-level homophily across the graph, showing that while most nodes in homophilic graphs predominantly exhibit homophilic patterns, some nodes still exhibit heterophilic patterns. This observation inspires our node-wise filtering approach, which applies different filters to nodes with homophilic and heterophilic patterns. To address potential confusion, we have revised the description of Figure 2 in the paper for greater clarity.
W3: The paper suffers from overloaded notations that can lead to confusion for readers. For instance, h(v_i) is used in Section 1 to denote the label similarity between a node v_i and its neighbors, while in Section 3.1, the same notation is repurposed to refer to a node classifier. This inconsistency may hinder comprehension.
A3: Thank you for pointing out the issue with overloaded notations. To address this, we have updated the notation for the classifier in Section 3.1 to Classifier for clarity. Additionally, we have reviewed the other notations throughout the paper to ensure consistency and avoid potential confusion.
We hope that we have addressed the concerns in your comments. Please kindly let us know if there are any further concerns, and we are happy to clarify.
Thank you for your responses to the comments and for providing clarifications regarding the homophilic and heterophilic patterns in the work. I appreciate the effort you've put into revisiting the corresponding sections of the paper and enhancing the illustrative examples in the figures.
While these revisions certainly help in clarifying the concepts presented, I still have concerns regarding the lack of a clearly quantitative definition on that, especially when it comes to extending the proposed method in practical applications that often involve multiple categorical labels. This aspect is critical for broader applicability and understanding of your approach. Given this ongoing concern, I believe the current scores remain valid.
Dear Reviewer VQRg,
Thank you for your prompt reply and for highlighting your ongoing concerns. We are happy to address them further.
While these revisions certainly help in clarifying the concepts presented, I still have concerns regarding the lack of a clearly quantitative definition on that, especially when it comes to extending the proposed method in practical applications that often involve multiple categorical labels. This aspect is critical for broader applicability and understanding of your approach. Given this ongoing concern, I believe the current scores remain valid.
Although we use a two-class example in Figure 1 for simplicity, the concepts of homophily and heterophily naturally extend to graphs with multiple categorical labels. There are mainly two levels of homophily:
Node-level homophily: Node-level homophily quantifies the similarity of a node's neighbors to itself in terms of labels and is mathematically defined as: , where is the neighbor set of node , and denote the labels. This definition accommodates graphs with multiple categorical labels.
Graph-level homophily: Graph-level homophily provides a global measure for the entire graph and can be defined in various ways. For example, one approach is the average node-level homophily [1]: , which is often referred to as node homophily. (In Section 2 of our paper, we adopt the term node homophily to refer to this graph-level homophily, which might have caused some ambiguity. We will revise the concepts to emphasize the node-level or graph-level homophily.) Another common definition of graph-level homophily is the ratio of edges connecting nodes with the same label[2]. Typically, graphs are categorized based on graph-level homophily, where graphs with are considered homophilic graphs, and with are heterophilic graphs.
In the literature, most prior works focus predominantly on graph-level homophily, often applying different methods for homophilic and heterophilic datasets. For homophilic datasets, it is generally assumed that nodes tend to connect with others that share the same labels, leading to the use of a global low-pass filter to process all nodes in the graph. However, our work takes a different perspective by emphasizing node-level homophily. Even in homophilic graphs, the homophily level of individual nodes, , can vary significantly. As shown in Figure 2 in our paper, in the homophilic dataset, such as Cora and CiteSeer, while the majority of nodes have , a considerable number of nodes exhibit . (The Cora dataset has 7 categorical labels, while the CiteSeer dataset has 6 categorical labels.)
To account for these variations, we follow the approach in [3] and categorize nodes based on their homophily levels: nodes with are considered to exhibit homophilic patterns, while those with are classified as having heterophilic patterns. Our theoretical results show that applying a single global filter designed for one type of pattern (e.g., homophilic) across all nodes may degrade performance for nodes that follow the other type of pattern (e.g., heterophilic) in the same graph. This insight motivates our approach of applying different filters to nodes based on their individual patterns instead of applying the same filter to all nodes. In the proposed Node-MoE, the gating model is designed to identify distinct patterns among nodes and assign them to the most appropriate filters. The experts are responsible for learning different filters tailored to these patterns. Experiments conducted on real-world datasets with multiple labels, as shown in Table 3, demonstrate that the proposed Node-MoE achieves strong performance on both homophilic and heterophilic datasets.
There are many real-world applications where graphs exhibit a mixture of patterns. For example, in fraud detection, graphs are commonly used to model user interactions. Clean users typically connect with clean users, exhibiting homophilic patterns, whereas malicious users often exhibit heterophilic patterns as they usually connect to clean users.
[1] Pei, Hongbin, et al. "Geom-gcn: Geometric graph convolutional networks." ICLR 2020.
[2] Zhu, Jiong, et al. "Beyond homophily in graph neural networks: Current limitations and effective designs." NeurIPs 2020
[3] Mao, Haitao, et al. "Demystifying structural disparity in graph neural networks: Can one size fit all?."NeurIPs 2024
We hope this explanation clarifies how the concepts extend to multiple categorical labels and the broader applicability of our method. If you have further questions or concerns, we would be delighted to discuss them.
Best regards,
Authors
The paper introduces a Mixture of Experts inspired technique for GNNs aimed to improve classification performance on heterophilic and homophilic nodes in a graph. The paper first, presents some analysis to show how real-world graph contain varying levels of homophily and a single fixed "low" or "high" pass filter may not be the best solution to learn reasonable node features. Thus, motivated by this analysis, the authors propose a mixture of experts technique, wherein a gating model is capable of "soft" selection of representation of a bunch of expert GNN models with varying node wise filter levels. Experiments show that the proposed method generally outperforms previous graph MoE based methods and other standard baselines. Detailed ablations and analysis into the hyper parameters are presented. Some considerations into the efficiency of the proposed methods are also presented.
优点
-
Good motivation and interesting use of MoE for dealing with the problem of node having varying levels of homophilic / heterophilic patterns in real world graphs.
-
While the use of MoE to graph data is not new, the paper presents some new insights when used for node level tasks e.g., the use of filter smoothing loss, experts with node wise filtering. Experimentally, performs better than other graph based MoE techniques.
-
Good and reasonable set of experiments on the analysis of the proposed method, detailed experiments on the time complexity and hyper parameter choices.
缺点
-
Limited novelty of the proposed problem setup (Section 2.1 and 2.2). It is well known that real world graphs contain nodes that exhibit variations in homophilic behavior. [ref - 1,2]. Limited technical novelty since, much of the techniques used in the proposed method is based on earlier work in the MoE area like Top K sampling, use of gating mechanism [3].
-
The gains in performance due to the proposed method "Node-MoE" is not clear compared to other non MoE baselines given the significant overlap of mean standard deviation. Moreover, the heterophilic datasets chosen are known to be problematic [4]. It would a stronger argument, if additional new datasets from [4] are used for the experiments.
-
A clear algorithm detailing the different techniques can be presented for better clarity to the reader.
Refs. [1] Peel et al. Multiscale mixing patterns in networks. PNAS (2018). [2] Suresh et al. Breaking the limit of graph neural networks by improving the assortativity of graphs with local mixing patterns. KDD (2021). [3] Shazeer et al. Outrageously large neural networks: The sparsely gated mixture-of-experts layer. ICLR (2017) [4] Platonov, et al. A critical look at the evaluation of GNNs under heterophily: Are we really making progress?. ICLR (2023)
问题
- Given that the performance gains in Table 1 are not very clear (due to the overlap in std. dev. with non MoE baselines) what were the winning hyper parameters from the search space? The reason I ask is because the paper claims that due to efficiency reasons one can reduce the number of experts (say 2) and utilize gating with Top K = 1. But since that come with the trade-off with accuracy, I am curious about the setting used for the main results presented.
Dear Reviewer qWvW,
We sincerely appreciate your recognition of our work and insightful comments. We are pleased to provide detailed responses to answer your questions.
W1: Limited novelty of the proposed problem setup (Section 2.1 and 2.2). It is well known that real-world graphs contain nodes that exhibit variations in homophilic behavior. [ref - 1,2]. Limited technical novelty since, much of the techniques used in the proposed method is based on earlier work in the MoE area like Top K sampling, use of gating mechanism [3].
A1: Thank you for pointing out the related works. We acknowledge that several prior studies have highlighted that real-world graphs contain nodes exhibiting variations in homophilic behavior, as we have also cited in Section 2.1. The previous methods usually apply the same filter or multiple filters to all nodes. Through our theoretical analysis, we demonstrate that applying the same filter to all nodes can lead to suboptimal performance, as a single filter optimized for one type of node may negatively impact others. This observation inspired our node-wise filtering approach, where different filters are applied to different nodes.
In Section 2.1, we further illustrate that homophily levels vary significantly across different structural communities, and nodes with varying homophily levels are not uniformly distributed across the graph. This insight motivated the choice of our gating model, which is tailored to capture community properties. We demonstrate experimentally that GNN-based gating models usually outperform MLP-based ones, as shown in the table below:
| Gating Model | Cora | CiteSeer | Chameleon | Squirrel |
|---|---|---|---|---|
| MLP | 89.26 ± 1.34 | 77.59 ± 1.27 | 72.70 ± 1.75 | 59.88 ± 2.33 |
| GCN | 89.28 ± 0.96 | 77.89 ± 1.29 | 73.29 ± 1.71 | 62.04 ± 1.92 |
| GIN | 89.38 ± 1.26 | 77.78 ± 1.36 | 73.64 ± 1.80 | 62.31 ± 1.98 |
While it is a common practice to leverage MoE models with top-k gating mechanisms to reduce computational complexity, the technical novelty of Node-MoE lies in the design and selection of the gating model. Specifically, we innovate in the input design for the gating model and the choice of the gating model itself. As shown in Figure 8 in our paper, the proposed gating model significantly outperforms traditional gating models, demonstrating the effectiveness of our approach.
W2: The gains in performance due to the proposed method "Node-MoE" is not clear compared to other non MoE baselines given the significant overlap of mean standard deviation. Moreover, the heterophilic datasets chosen are known to be problematic [4]. It would a stronger argument, if additional new datasets from [4] are used for the experiments.
A2: To address your concern, we have updated Table 1 to include a significance test (t-test) for our results compared to the best-performing baselines. The results demonstrate that the proposed Node-MoE outperforms all baselines on 7 datasets, with 5 of these showing statistically significant improvements (). These findings provide stronger evidence for the effectiveness of Node-MoE.
For the Chameleon and Squirrel datasets, [4] demonstrates that there are some nodes with the same features and neighbors. To address this, we provide results for filtered versions of these datasets in Table 8 of our paper. The proposed Node-MoE continues to outperform the baselines on these filtered datasets. Additionally, we have included results for two widely used large heterophilic datasets, Penn94 and Pokec, in our experiments, where Node-MoE also demonstrates superior performance compared to the baselines.
Following your suggestion, we have conducted further experiments on the amazon-ratings and tolokers datasets from [4]. The results show that the proposed Node-MoE not only outperforms the single expert model but also achieves better performance than the best model reported in [4]. These additional evaluations reinforce the robustness and generalizability of Node-MoE across diverse datasets.
| amazon-ratings | tolokers | |
|---|---|---|
| Best model in [4] | 53.63 ± 0.39 | 83.78 ± 0.43 |
| ChebNetII | 52.23 ± 0.58 | 82.52 ± 0.98 |
| Node-MoE | 54.02 ± 0.71 | 84.87 ± 0.92 |
W3: A clear algorithm detailing the different techniques can be presented for better clarity to the reader.
A3: Thank you for the great suggestion. We have added the algorithm in Appendix C.2 in the revised paper.
Q1: Given that the performance gains in Table 1 are not very clear (due to the overlap in std. dev. with non MoE baselines) what were the winning hyper parameters from the search space? The reason I ask is because the paper claims that due to efficiency reasons one can reduce the number of experts (say 2) and utilize gating with Top K = 1. But since that come with the trade-off with accuracy, I am curious about the setting used for the main results presented.
A4: For the results presented in Table 1, we used 3 experts without sparse gating to simplify the experimental setup. To provide a more comprehensive analysis, we present the performance with varying numbers of experts and different Top-K gating settings in Table 4 and Table 5, respectively.
The results demonstrate that accuracy is not always a trade-off with the number of experts or the Top-K gating mechanism. For instance, while using 5 experts outperforms 3 experts on the Cora dataset, it performs worse on the Chameleon dataset. Similarly, Top-1 gating on the Cora dataset yields even better results than using all experts. However, the differences in performance across these configurations are not substantial. These findings suggest that the relationship between the number of experts, the gating mechanism, and performance is dataset-dependent, and the proposed Node-MoE is robust across a range of configurations.
We hope that we have addressed the concerns in your comments. Please kindly let us know if there are any further concerns, and we are happy to clarify.
The paper introduces NODE-MOE, a GNN framework that uses a mixture of experts (MoE) model to apply node-specific filters, addressing the limitations of traditional GNNs that use a uniform global filter. NODE-MOE’s architecture includes a gating model, which assigns each node a set of specialized filters from multiple expert models, allowing for dynamic adaptation to the local structure of each node. Experimental results across diverse datasets demonstrate NODE-MOE’s superior accuracy over standard GNNs, especially in noisy environments and with limited labels.
优点
- The use of the Contextual Stochastic Block Model (CSBM) to validate this limitation strengthens the motivation for NODE-MOE’s node-specific filtering approach.
- The NODE-MOE allows each node to dynamically access different filters through a mixture of experts.
- Extensive experiments show NODE-MOE’s effectiveness across both homophilic and heterophilic datasets.
缺点
- The abstract and introduction of this paper require revision to reflect the existing literature more accurately. Currently, statements such as “Traditionally, GNNs employ a uniform global filter” in the abstract and introduction could mislead readers, including reviewers, by implying that no prior research has explored GNNs with multiple filters. However, as discussed in the related work section, numerous studies share similar motivations and have developed approaches that utilize multiple filters in GNNs. To provide a clearer foundation, the authors should acknowledge these existing contributions in both the abstract and introduction, clarifying how their work builds upon or differentiates from these prior approaches. This adjustment will ensure a more accurate representation of the field and help readers better understand the novelty and contributions of this paper.
- The paper claims that the proposed model offers better computational complexity than previous work discussed in the related section. However, no rigorous analysis is provided to substantiate this claim. Given the highly parallelizable nature of some existing “post-fusion” architectures, these prior models may, in fact, achieve lower computational costs.
- Theorem 1 only provides proof for linear models with lower expressivity than typical GNNs. Moreover, the proof in part 2 relies on a strong assumption that every homophilic and heterophilic node can be perfectly classified and assigned the appropriate filter.
- The proposed NODE-MoE seems to use a larger number of learnable parameters compared to other baselines. How many learnable parameters are used for both the baselines and NODE-MoE, as shown in Table 1? Is the same performance improvement observed, even if the number of parameters for NODE-MoE is adjusted to be on a similar scale as the baselines?
- The proposed NODE-MoE seems to have a broader hyperparameter search space than other baselines. Since its training time per epoch is longer, this broader search space would likely result in a significantly longer training time, including the validation time required to find the best hyperparameter set. Is the same performance improvement observed, even if the total training time, including hyperparameter search, is adjusted to be comparable to the baselines?
- Figures 6 and 15 seem inconsistent with the underlying motivation. While Figure 6 shows that the weights for the high-pass filter decrease as the homophily level rises, the difference between the two weights is not significant across the entire range of homophily. Moreover, even when the homophily level reaches 1, the high-pass filter’s average weight still exceeds that of the low-pass filter. Additionally, in Figure 15, even the trend observed in Figure 6 no longer exists.
- Minor: There appears to be an error in y-tick (density) values in Figure 2, as they are not within the 0-1 range.
问题
How did the authors find GIN and ChevNetII as an gating and expert models respectively? What alternatives are considered and why these combination performs the best?
Dear Reviewer AtKX,
We appreciate your constructive feedback. We are pleased to provide detailed responses to address your concerns.
W1: The abstract and introduction of this paper require revision to reflect the existing literature more accurately.
A1: Thanks for your great suggestion, we have revised our paper following your suggestions. Here are some changes for your reference:
Abstract: Most GNNs employ a uniform global filter. ... While a few methods have introduced multiple global filters, they often apply these filters uniformly across all nodes, which may not effectively capture the diverse structural patterns present in real-world graphs.
Introduction: There are a few methods, such as ACM-GNN, AutoGCN, PC-Conv, and ASGAT that leverage different filters to alleviate this issue. Specifically, they apply multiple filters to all nodes and then combine the predictions of different filters. However, applying a uniform type of filter, tailored for just one of these patterns, across all nodes may hurt the performance of other patterns. Combining the predictions of different filters may inevitably introduce noise.
Instead of applying one or multiple filters to all nodes, we propose a node-wise filtering method that apply different filters to different nodes based on their specific structural patterns.
W2: The paper claims that the proposed model offers better computational complexity than previous work discussed in the related section. However, no rigorous analysis is provided to substantiate this claim. Given the highly parallelizable nature of some existing “post-fusion” architectures, these prior models may, in fact, achieve lower computational costs.
A2: To provide a fair comparison, we assume identical model architectures and training methods, differing only in the use of filters. For post-fusion methods, each node must be processed by all filters, resulting in higher computational demands. However, for the proposed node-wise filtering method Node-MoE, each node only needs to be processed by one filter with a top-1 gating mechanism. Consequently, the total computational cost of Node-MoE is lower than that of post-fusion methods. While it is true that some existing post-fusion architectures can be highly parallelized, enabling similar training times, their overall computational cost remains high due to the necessity of processing all filters for every node. Additionally, the proposed Node-MoE is flexible and can also benefit from existing parallelization techniques, further enhancing its computational efficiency.
W3: Theorem 1 only provides proof for linear models with lower expressivity than typical GNNs. Moreover, the proof in part 2 relies on a strong assumption that every homophilic and heterophilic node can be perfectly classified and assigned the appropriate filter.
A3: In our theoretical analysis, we separate the graph-filtering operator from the classifier, similar to decoupled GNNs like SGC, which is a common practice for analyzing GNNs [1, 2, 3]. Theorem 1 demonstrates that applying different filters to nodes with varying patterns can achieve linear separability, providing a foundational understanding of the advantages of node-wise filtering in such scenarios.
We appreciate your feedback on the assumption in part 2 of the proof. To address this, we have included additional proof in the revised version of our paper to handle the classification of homophilic and heterophilic nodes. The theorem and proof can be found in Section 2.2 and Appendix A of our paper. For your convenience, the new theorem is stated as follows:
Homophilic and heterophilic nodes can be separated based on the feature distance between a node and the average feature of its neighbors, given by with probability .
Specifically, we demonstrate that the feature distance can be leveraged to effectively separate homophilic and heterophilic nodes. This refinement not only mitigates the strong assumption but also provides a rationale for the design of the gating model, further validating our approach.
[1] Baranwal, Aseem, Kimon Fountoulakis, and Aukosh Jagannath. "Graph convolution for semi-supervised classification: Improved linear separability and out-of-distribution generalization." arXiv'21
[2] Ma, Yao, et al. "Is homophily a necessity for graph neural networks?." ICLR'22.
[3] Wu, Xinyi, et al. "A non-asymptotic analysis of oversmoothing in graph neural networks." arXiv'22
W4: The proposed NODE-MoE seems to use a larger number of learnable parameters compared to other baselines. How many learnable parameters are used for both the baselines and NODE-MoE, as shown in Table 1? Is the same performance improvement observed, even if the number of parameters for NODE-MoE is adjusted to be on a similar scale as the baselines?
A4: Different baselines have varying configurations across datasets, such as the number of layers, hidden dimensions, and attention heads. To ensure a fair comparison, we assume that all models have a feature dimension and hidden size of , and the number of layers is . We omit a small number of parameters, such as the learnable filter parameters in models like GPR-GNN and ChebNetII, which are typically around 10 and do not significantly impact the comparison. For the MLP, GCN, APPNP, GCNII, GPR-GNN, ChebNetII, AutoGCN, WRGCN, and PC-Conv, the trainable parameters are approximately . For GAT and ASGAT, with attention heads, the number of learnable parameters becomes . ACM-GCN uses three filters with separate transformation matrices, resulting in and LinkX includes three MLPs, also leading to . For the MoE-based methods, GMoE has parameters, where is the number of experts. Mowst has learnable parameters, which combines MLP and GCN. For the proposed Node-MoE, the number of learnable parameters is also , and for most of the results in Table 1. Thus, the proposed Node-MoE has a comparable number of learnable parameters to several baselines.
We further conducted experiments with Node-MoE using 3 experts and a significantly reduced hidden dimension size, decreasing it from 64 to 21. This adjustment ensures that the proposed Node-MoE has a comparable number of learnable parameters to models like MLP and GCN. Despite the lower hidden dimension, the results demonstrate that Node-MoE maintains similar performance and continues to outperform the baselines.
| Hidden Size | Cora | CiteSeer | Chameleon | Squirrel |
|---|---|---|---|---|
| Node-MoE 21 | 89.36 ± 1.00 | 77.93 ± 1.52 | 73.51 ± 1.77 | 60.93 ± 1.89 |
| Node-MoE 64 | 89.38 ± 1.26 | 77.78 ± 1.36 | 73.64 ± 1.80 | 62.31 ± 1.98 |
W5: The proposed NODE-MoE seems to have a broader hyperparameter search space than other baselines. Since its training time per epoch is longer, this broader search space would likely result in a significantly longer training time, including the validation time required to find the best hyperparameter set. Is the same performance improvement observed, even if the total training time, including hyperparameter search, is adjusted to be comparable to the baselines?
A5: Compared to the base expert model, the additional hyperparameters introduced by Node-MoE include the filter smoothing loss weight , the number of experts, and the top-k gating mechanism. As shown in Figure 9, the model demonstrates robustness to the choice of , achieving stable performance across a wide range of values. Similarly, Table 4 shows that the proposed method achieves competitive performance even with as few as 2 experts, and Table 5 indicates that good performance is maintained with a top-1 gating mechanism. These results suggest that the proposed method is robust to its additional hyperparameters. Furthermore, as shown in Table 2, the training time per epoch for Node-MoE is comparable to that of a single expert model, ensuring that the additional hyperparameters do not significantly inflate training costs.
W6: Figures 6 and 15 seem inconsistent with the underlying motivation.
A6: In Figure 5 of our paper, we demonstrate the two learned filters. It's important to note that we did not constrain the shape of these filters. Filter 0 is approximately a low-pass filter but also has a relatively high response at high frequencies. Conversely, Filter 1 is primarily a high-pass filter but also responds at low frequencies. Additionally, the real homophily level of each node in the gating model is estimated based on the node features, which can introduce some errors. Consequently, both filters have weights in each homophily group. As a result, both filters have weights in each homophily group. Despite this, there is a clear trend in Figure 6: as the node homophily level increases, the weight of the high-pass filter consistently decreases. The overall trend supports our motivation that nodes with higher homophily levels tend to rely less on the high-pass filter. For Figure 15, the CiteSeer dataset exhibits a high homophily level, and the two learned filters as shown in Figure 14 are low pass filters. However, Filter 1 has a higher response for the high-frequency signals. Therefore, the weight of filter 1 decreases with the increase of homophily level. Therefore, it also consistents with our motivation.
W7: Minor: There appears to be an error in y-tick (density) values in Figure 2, as they are not within the 0-1 range.
A7: Thank you for pointing this out. Figure 2 illustrates the probability density function (PDF), where the area under the entire curve equals 1. As a result, the y-axis (density) values are not constrained to the 0-1 range. To clarify this, we have updated the figure description in the revised paper to explicitly state that it represents a PDF.
Q1: How did the authors find GIN and ChevNetII as an gating and expert models respectively? What alternatives are considered and why these combination performs the best?
A8: Gating model selection:
As discussed in Section 2.1, structural patterns are not uniformly distributed across the graph, and distinct communities often exhibit varying structural characteristics. To account for these differences, we aim to use a gating model capable of capturing community properties. Given the strong community detection capabilities of GNNs, we selected GIN as an example. We also experimented with different gating models, and the results show that GNN-based models, such as GCN and GIN, generally outperform MLP, particularly on heterophilic datasets.
| Gating Model | Cora | CiteSeer | Chameleon | Squirrel |
|---|---|---|---|---|
| MLP | 89.26 ± 1.34 | 77.59 ± 1.27 | 72.70 ± 1.75 | 59.88 ± 2.33 |
| GCN | 89.28 ± 0.96 | 77.89 ± 1.29 | 73.29 ± 1.71 | 62.04 ± 1.92 |
| GIN | 89.38 ± 1.26 | 77.78 ± 1.36 | 73.64 ± 1.80 | 62.31 ± 1.98 |
Expert model selection:
Rather than using fixed filters, we aim for each expert to learn its own filter dynamically. Therefore, we selected expert models from the category of learnable filter GNNs. In approximation theory, Chebyshev polynomials achieve near-optimal error when approximating functions, making ChebNetII an excellent choice for its strong filter learning capabilities. Additionally, we experimented with GPR-GNN as an expert model. The results indicate that Node-MoE with GPR-GNN significantly improves the performance of the base GPR-GNN model, further demonstrating the flexibility and effectiveness of our approach.
| Cora | CiteSeer | Chameleon | Squirrel | |
|---|---|---|---|---|
| GPR-GNN | 88.54 ± 0.67 | 76.44 ± 1.89 | 62.85 ± 2.90 | 54.35 ± 0.87 |
| Node-MoE (GPR-GNN) | 89.30 ± 1.44 | 77.12 ± 1.60 | 70.88 ± 2.22 | 56.92 ± 0.61 |
| ChebNetII | 88.71 ± 0.93 | 76.93 ± 1.57 | 71.14 ± 2.13 | 57.12 ± 1.13 |
| Node-MoE (ChebNetII) | 89.38 ± 1.26 | 77.78 ± 1.36 | 73.64 ± 1.80 | 62.31 ± 1.98 |
We hope that we have addressed the concerns in your comments. Please kindly let us know if there are any further concerns, and we are happy to clarify.
Thank you for the detailed response. While the authors address several points raised in the review, there are still notable gaps and concerns. For example, the robustness of the proposed model against different combinations of hyperparameters is unclear, given the narrow differences between the performance of different models. Several analyses provided in the manuscript are also inconsistent regarding the dataset used for the analysis. Based on the other reviews and responses, I have decided to keep my original score.
Dear Reviewer AtKX,
Thank you for your reply and for allowing us the opportunity to further address your concerns. In our analysis, we primarily selected the Cora and Chameleon datasets to represent homophilic and heterophilic datasets, respectively. For the training time analysis, we used two large datasets (i.e., ogbn-arxiv and Pokec) to demonstrate the efficiency of the proposed Node-MoE. Regarding the parameter , we initially analyzed its behavior using CiteSeer and Squirrel as examples. To ensure consistency, we have now extended the analysis to include the Cora and Chameleon datasets.
| 0 | 0.1 | 0.5 | 1 | 5 | |
|---|---|---|---|---|---|
| Cora | 88.88 ± 1.12 | 89.25 ± 1.18 | 89.38 ± 1.26 | 89.45 ± 1.13 | 89.01 ± 1.16 |
| Chameleon | 73.62 ± 2.36 | 74.28 ± 2.22 | 73.64 ± 1.80 | 73.55 ± 2.05 | 72.17 ± 1.70 |
The results show that the hyperparameter is robust across the evaluated range on these datasets as well.
Additionally, we have updated Table 1 of our paper to include a significance test (t-test) comparing our results to the best-performing baselines. The findings indicate that Node-MoE outperforms all baselines on 7 datasets, with 5 of these improvements being statistically significant (p<0.05). These results provide stronger evidence for the effectiveness and robustness of Node-MoE.
Thank you for your valuable feedback in helping us improve our paper. We are happy to engage in further discussion if additional clarifications are needed.
Best regards,
Authors
This paper proposes NODE-MOE, a Graph Neural Network (GNN) framework that combines a mixture of experts (MoE) with node-specific filters to address the limitations of traditional GNNs relying on uniform global filters. The architecture features a gating model that dynamically assigns each node a set of specialized filters from multiple expert models, enabling adaptive filtering based on local node structure.
The authors argue that real-world graphs exhibit complex patterns that cannot be captured by a single global filter, citing the coexistence of homophilic and heterophilic structures. However, this perspective oversimplifies the complexity of real-world graphs, which often defy categorization as solely homophilic or heterophilic.
While the experimental results demonstrate the effectiveness of NODE-MOE, particularly in noisy environments and with limited labels, I concur with reviewers AtKX and VQRg that the paper lacks a clear theoretical foundation. The gating mechanism's selection process and potential failure conditions are not well-understood, and the provided theory, based on linear models and Gaussian vectors, is overly simplistic and fails to provide meaningful insights into the model's behavior in practice.
Furthermore, the paper's approach is essentially an integration of homophily and heterophily concepts into GNN mixture models (e.g., GraphMETRO and GMoE), which feels ad hoc, lacking a solid theoretical justification. The authors should provide a more rigorous analysis of the relationships between these concepts and the proposed model, as well as a clearer explanation of when and why NODE-MOE is expected to work and under which conditions.
The reviewers provided thorough and insightful evaluations of the paper, identifying both its strengths and weaknesses. After carefully reading the manuscript, I concur with their assessments and share their concerns regarding the lack of a solid theoretical foundation. The reviewers' comments and critiques have helped to clarify the paper's contributions and limitations, and I believe that the authors should address these issues in order to strengthen the work. Overall, I appreciate the diligence of the reviewers and I believe that their feedback will be invaluable in helping the authors to improve the paper.
审稿人讨论附加意见
The reviewers provided thorough and insightful evaluations of the paper, identifying both its strengths and weaknesses. After carefully reading the manuscript, I concur with their assessments and share their concerns regarding the lack of a solid theoretical foundation. The reviewers' comments and critiques have helped to clarify the paper's contributions and limitations, and I believe that the authors should address these issues in order to strengthen the work. Overall, I appreciate the diligence of the reviewers and I believe that their feedback will be invaluable in helping the authors to improve the paper.
Reject