PaperHub
3.8
/10
Rejected4 位审稿人
最低1最高4标准差1.1
1
2
2
4
ICML 2025

Wide & Deep Learning for Node Classification

OpenReviewPDF
提交: 2025-01-17更新: 2025-06-18
TL;DR

The first Wide & Deep architecture model for node classification.

摘要

关键词
Wide & Deep LearningGraph Convolutional NetworksNode Classification

评审与讨论

审稿意见
1

The paper identifies the over-generalization problem in the GCNII model and addresses it by proposing GCNIII, which leverages a Wide & Deep architecture. The effectiveness of this approach is validated through experiments.

给作者的问题

  1. In the section on intersect memory, an adjacency matrix is applied before the feature transformation (Equation 12). Why not directly use a GCN layer for this purpose?

  2. In lines 200–202, you state, "The improved model is still a linear model, but whether to use this technique depends on the dataset." Does this mean the model architecture is dataset-dependent? Additionally, I did not find any experiments that explore this claim.

  3. In lines 265–266, you mention, "Equation (8) is not commonly seen in the design of GCNs, but it is indeed one of the key aspects of the GCNII model." However, dropout is widely used in GNNs—even the early GCN paper incorporates dropout.

  4. In Equation 2, the cross-product transformations ϕ(x)\phi(x) are included in the wide component. However, in GCNIII, this part is removed in Equation 7. What is the reasoning behind this change? If it is removed, why is this component still considered wide?

  5. Some descriptions of attention are unclear. For instance, in line 267, what does "both" refer to? Additionally, what is the 'attention' of GCNII?

  6. Typos: In line 259, you state, "Graph Convolution G~\tilde{G} corresponds to the attention matrix on the left-hand side of Equation (13)." However, Equation 13 does not have a left-hand side.

论据与证据

Claims are not clear, see Other Strengths and Weaknesses.

方法与评估标准

See Other Strengths and Weaknesses.

理论论述

Yes.

实验设计与分析

  • The implementation of baselines requires further verification. In lines 364–367, you state, "Therefore, we re-conduct the experiments in accordance with the optimal model hyperparameters reported in Luo et al. (2024b) under our experimental framework." However, the optimal hyperparameters in Luo et al. were tuned within their own framework and may not transfer well to a different setting. Conducting a separate hyperparameter search under your method's framework would be a fairer comparison. The same issue applies to lines 381–383: "We reproduce the results of APPNP (Gasteiger et al., 2019) and GCNII (Chen et al., 2020) using the hyperparameters in Chen et al. (2020)."

  • The ablation study lacks experiments on LLMs.

  • If your method incorporates LLMs as auxiliary tools, it would be beneficial to include similar methods in the baselines or enhance existing GNNs with LLMs.

  • I notice that all baselines are from papers published between 2017 and 2020. Could you compare your method against more recent baselines?

  • In the discussion of over-generalization, all experiments are conducted on the Cora dataset. Have you observed the same phenomenon on other datasets?

  • In Appendix C, I recommend increasing the training ratio in the data split rather than relying solely on training accuracy, as evaluating only training accuracy is not practical in real-world scenarios.

  • In Figure 6, there seems to be no significant improvement in the training error of GCNIII over GCNIII.

补充材料

Yes. The code part.

与现有文献的关系

The paper re-examines the previous work GCNII and introduces the Wide & Deep approach into GNN design.

遗漏的重要参考文献

No.

其他优缺点

Strengths

  • The paper provides a detailed re-examination of GCNII and identifies the over-generalization problem. It explores a potential solution by incorporating the Wide & Deep architecture into GNN design.

Weaknesses

  • The core intuition is unclear, particularly due to the lack of a concrete definition of over-generalization. Key questions include:

    1. What exactly is over-generalization?
    2. How does it impact performance? Specifically, does it degrade test accuracy?
    3. Why does over-generalization occur in GCNII?
  • The proposed method, GCNIII, requires better explanation. Important clarifications include:

    1. What is the final architecture of GCNIII? While Equation 11 describes the Deep & Wide structure and Equation 12 presents the intersect memory technique, there is no clear formulation for the complete model.
    2. What are the key differences between GCNII and GCNIII?
    3. How do these differences address the over-generalization problem?
  • The writing lacks coherence. The paper introduces multiple motivations, including over-generalization, a shift in focus from structure to node features, and the inability of GCNII to incorporate a linear model (line 247). However, these points appear disconnected, without a clear overarching theme or logical progression.

其他意见或建议

I recommend revising the line colors of Figure 5. It is difficult to mapping lines to legends.

作者回复

Thank you very much for your review, suggestions, and questions. We have carefully addressed your concerns shown below.

Q1: Conducting a separate hyperparameter search under your method's framework would be a fairer comparison.

A1: Thanks for your suggestion. We re-optimized all hyperparameters, and the experimental results can be found in https://anonymous.4open.science/r/GCNIII_supplement.

Q2: The ablation study lacks experiments on LLMs. If your method incorporates LLMs as auxiliary tools, it would be beneficial to include similar methods in the baselines or enhance existing GNNs with LLMs.

A2: Your suggestion is very reasonable. We consider that the focus of this paper is not LLM, so the content related to LLM will be deleted.

Q3: Could you compare your method against more recent baselines?

A3: Thanks for your suggestion. We conducted comparative experiments between GCNIII and recent baseline methods, with results available at https://anonymous.4open.science/r/GCNIII_supplement.

Q4: Have you observed the same phenomenon on other datasets?

A4: The same phenomenon is more likely to occur on homophilic datasets.

Q5: In Appendix C, I recommend increasing the training ratio in the data split rather than relying solely on training accuracy, as evaluating only training accuracy is not practical in real-world scenarios.

A5: We consider this limit case in order to verify the node classification ability of the linear models rather than the generalization ability.

Q6: In Figure 6, there seems to be no significant improvement in the training error of GCNIII over GCNIII.

A6: We suspect your question stems from Figure 5, which illustrates a clear trend: as γ increases, the training error decreases significantly. Our goal is to strike a balance, as we also aim to prevent the model from overfitting.

Q7: What exactly is over-generalization? How does it impact performance? Specifically, does it degrade test accuracy? Why does over-generalization occur in GCNII?

A7: The over-generalization phenomenon shown in Figure 1 refers to the situation where training error remains significantly higher than validation error during model training. This phenomenon has currently only been observed when training deep GCNs. It does not lead to test accuracy degradation, but it means that the utilization efficiency of features in the model training process will be reduced, which is why linear models are introduced. We provide a detailed analysis of over-generalization phenomenon in GCNII in Section 4.

Q8: What is the final architecture of GCNIII? What are the key differences between GCNII and GCNIII? How do these differences address the over-generalization problem?

A8: The structure of GCNIII is shown in Figure 2. Compared with GCNII, GCNIII introduces a linear model and uses Intersect memory, Initial residual and Identity mapping as optional modules. GCNIII is a more flexible framework. We believe that the improved training error curve of GCNIII indicates a better balance between the model’s fitting ability and generalization, which successfully demonstrates that GCNIII can more effectively balance the trade-off between the over-fitting of GCN and the over-generalization of GCNII.

Q9: In the section on intersect memory, an adjacency matrix is applied before the feature transformation (Equation 12). Why not directly use a GCN layer for this purpose?

A9: Our proposed technique applies attention mechanisms to the output of a linear model, where graph convolution is not the only possible choice.

Q10: Does this mean the model architecture is dataset-dependent?

A10: Yes, details are in Appendix B.

Q11: you mention, "Equation (8) is not commonly seen in the design of GCNs, but it is indeed one of the key aspects of the GCNII model." However, dropout is widely used in GNNs—even the early GCN paper incorporates dropout.

A11: We mean that it is not common to use dropout in Feature Embedding prior to the GCN layers.

Q12: In Equation 2, the cross-product transformations ϕ(x) are included in the wide component. However, in GCNIII, this part is removed in Equation 7. What is the reasoning behind this change? If it is removed, why is this component still considered wide?

A12: Intersect memory creates a new 'wide' effect, where attention enables each node's output representation to incorporate information from more nodes.

Q13: Some descriptions of attention are unclear. For instance, in line 267, what does "both" refer to? Additionally, what is the 'attention' of GCNII?

A13: G^\hat{G} and α(In(1α)G^)1(1)\alpha(I_{n}-{(1-\alpha)\hat{G})}^{-1} (1). Initial residual causes the “attention” of GCNII to asymptotically approach (1) as the number of layers increases indefinitely.

Q14: In line 259, you state, "Graph Convolution G~\tilde{G} corresponds to the attention matrix on the left-hand side of Equation (13)." However, Equation 13 does not have a left-hand side.

A14: Sorry for the confusion. We meant the left side of the multiplication sign.

审稿人评论

Thanks a lot for your response. For me there are still key questions unsolved:

  1. The over-generalization issue is only demonstrated on one dataset (i.e., Cora) with one deep model (i.e., GCNII). It is hard to tell it is a general phenomenon for deep models. Additionally, if it is true that “The same phenomenon is more likely to occur on homophilic datasets.”, Table 4 shows contrary results, where GCNII \approx GCNIII on homophily datasets while GCNII >> GCNIII on heterophily graphs.
  2. The underlining mechanism of over-generalization is still unclear. I have carefully read Section 4. It seems that the (only) possible reason is the attention. However, this part is too rough without comprehensive experimental verifications or theoretical supports.
  3. The significance / negative effect of over-generalization is questionable. There is a contradiction point, over-generalization “reduces the utilization efficiency of features in the model training process”, however, “It does not lead to test accuracy degradation”. My point is, if over-generalization only affects the utilization of training data (assuming it’s true) but does not degrade the performance, what is the motivation for solving it?
  4. I really recommend showing a specific formulation of GCNIII, as the case of Equation 6 for GCNII. It’s beneficial for further analysis and comparison. Additionally, the practical applicability of GCNIII is limited when the the model architecture is actually dataset-dependent.
作者评论

Thank you very much for your new reviews and suggestions. Our responses to your points are as follows.

Q1: The over-generalization issue is only demonstrated on one dataset (i.e., Cora) with one deep model (i.e., GCNII). It is hard to tell it is a general phenomenon for deep models. 

A1: You are right that this is not a general phenomenon, which is why researchers have not noticed it. The development of deep GCNs has been hindered in recent years, and we hope that the discussion of this phenomenon will provide some inspiration for further research.

Q2: Additionally, if it is true that “The same phenomenon is more likely to occur on homophilic datasets.”, Table 4 shows contrary results, where GCNII ≈ GCNIII on homophily datasets while GCNII < GCNIII on heterophily graphs.

A2: This is not a contradiction. Heterophily graphs do not provide a good enough graph structure as a priori attention to support the over-generalization of GCNII, that is, if it is not good on the training set, it is not good on the validation/test set. GCNIII performs better than GCNII because it balances the over-generalization phenomenon and makes better use of the linear model's memory for features during training.

Q3: The underlining mechanism of over-generalization is still unclear. I have carefully read Section 4. It seems that the (only) possible reason is the attention. However, this part is too rough without comprehensive experimental verifications or theoretical supports.

A3: Attention is not the only explanation and reason. In the original we wrote, “Dropout is the key. In fact, this can be easily inferred intuitively from Figure 1, because dropout is the only component in the entire end-to-end GCNII model that has a different structure during training and inference”(lines 248–252). We also have experimental evidence, “Taking the Cora dataset as an example, we find through experimental studies that removing all dropout from GCNII results in a drop in accuracy from over 85% to 82%”(lines 254–258). Dropout's position in the overall framework also plays a key role, “We argue that the dropout in Equation (8) is more akin to a robust feature selection process, where a subset of features is randomly selected for feature embedding at each epoch. This process enhances the model’s ability to efficiently leverage node feature information, thereby improving its generalization performance”(lines 268–274). Moreover, the analysis of the GCNII’s mechanism in subsection "Ultra-deep is not necessary" is also responsible for over-generalization, and it is very rare for models to start working first from the part closer to the output. Common models GCN, CNN, Transformer... all give priority to the layers near the input, and ResNet is also designed based on this. The special structural mechanism of GCNII leads to special phenomena. Thank you for your advice. We will further refine the content of this part in detail, including the details of theoretical analysis and experimental results.

Q4: The significance/negative effect of over-generalization is questionable. There is a contradiction point, over-generalization “reduces the utilization efficiency of features in the model training process”, however, “It does not lead to test accuracy degradation”. My point is, if over-generalization only affects the utilization of training data (assuming it’s true) but does not degrade the performance, what is the motivation for solving it?

A4: “There’s no such thing as a free lunch.” Over-generalization is a systematic phenomenon that is a successful aspect of a model in certain scenarios. When over-generalization occurs, the model can perform well on test data despite poor training performance. However, this phenomenon is fundamentally problematic as it reveals inherent design flaws - specifically, reduced feature utilization and excessive reliance on the prior graph structure, which limits the model’s performance ceiling. This is precisely why we propose simple yet effective improvements like GCNIII.

Q5: I really recommend showing a specific formulation of GCNIII, as the case of Equation 6 for GCNII. It’s beneficial for further analysis and comparison. Additionally, the practical applicability of GCNIII is limited when the the model architecture is actually dataset-dependent.

A5: Thank you very much for your advice. We will give the specific formulation of GCNIII in the paper. GCNIII is a framework rather than a fixed model, and we think it will be more usable. In fact, on different data scenarios, the initial residual technique corresponding to APPNP and the identity mapping technique introduced by GCNII may be redundant. Each layer of the identity mapping in GCNII will introduce parameters. If these parameters do not play a sufficient role, the training speed will be greatly improved after deletion. We hope to attract more researchers' attention through GCNIII, try to explore the emphasis on node features and introduce the use of linear models.

审稿意见
2

This paper improves the architecture of the existing model, GCNII, mainly based on the idea "wide & deep". Also, LLM is used to encode the node features.

给作者的问题

Please check the above weaknesses mentioned.

论据与证据

  1. The authors claimed on page 5 that "The former is 0.0018, while the latter is 0.8423, which demonstrates that GCNII’s “attention” captures more information, leading to stronger generalization." However, it is not clear what is the connection between the "attention weights", "information", and "generalization".

  2. On page 4, lines 198, the paper mentioned "The attention matrix between the nodes is the adjacency matrix" which is confusing. The term "adjacency matrix" usually is only used to describe the network structure.

方法与评估标准

The datasets are a bit old, and more heterophilic graph datasets should be included. In addition, the major drawback is the baseline methods.

  1. The baseline methods used in this paper are pretty old. If I did not miss anything, the latest baseline was published in 2020. There are so many latest node classification baselines, and I will not name them here.

  2. The proposed method incorporates LLM into the framework. Hence, for a fair comparison, many LLM-incorporated node classification methods should be compared, such as [1-2].

[1] He, Xiaoxin, et al. "Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning." The Twelfth International Conference on Learning Representations.

[2] Chen, Runjin, et al. "LLaGA: Large Language and Graph Assistant." Forty-first International Conference on Machine Learning.

理论论述

I checked the statement of Theorem 4.1, which is intuitive and reasonable.

实验设计与分析

Mentioned in the section "Methods And Evaluation Criteria".

补充材料

No supplementary material was provided. The appendix attached to the main content is reviewed.

与现有文献的关系

It is based on the previous model GCNII [1] and wide & deep [2].

[1] Chen, Ming, et al. "Simple and deep graph convolutional networks." International conference on machine learning. PMLR, 2020.

[2] Cheng, Heng-Tze, et al. "Wide & deep learning for recommender systems." Proceedings of the 1st workshop on deep learning for recommender systems. 2016.

遗漏的重要参考文献

  1. The authors mentioned on page 5 that "We suggests that graph can be viewed as a form of static, discrete self-attention mechanism (Vaswani et al., 2017)." Actually, this is a well-known fact that "the attention matrix can be viewed as a fully-connected weighted graph", e.g., in papers [1-3]

[1] Chen, Dexiong, Leslie O’Bray, and Karsten Borgwardt. "Structure-aware transformer for graph representation learning." International conference on machine learning. PMLR, 2022.

[2] Zaheer, Manzil, et al. "Big bird: Transformers for longer sequences." Advances in neural information processing systems 33 (2020): 17283-17297.

[3] Wang, Sinong, et al. "Linformer: Self-attention with linear complexity." arXiv preprint arXiv:2006.04768 (2020).

其他优缺点

  1. I think the so-called "over-generalization" is interesting, but it may only be due to the dropout, so the models in training and validation are different. Also, such a difference accumulates when the model is deeper, so the 64-layer GCNII's training error is larger than the validation error (Figure 1).

  2. If I understand correctly, the difference between the proposed GCNIII and the existing model GCNII is that this paper has an extra module "Intersect memory", which is essentially an attention module (line 195). However, incorporating the self-attention layer with the graph convolution layer is not novel, as many graph transformers have similar architecture [1], named the "Parallel architecture".

[1] Min, Erxue, et al. "Transformer for graphs: An overview from architecture perspective." arXiv preprint arXiv:2202.08455 (2022).

其他意见或建议

  1. Many notations are wrapped with {}. E.g., {\psi} is Eq. (7). It is unclear why they are presented like this.

  2. I think the notation in Eq. (12) is problematic. The notation \tilde{G} is a matrix (line 110), but it is also used as a function in Eq. (12).

作者回复

We thank you for your reviews and address your concerns as follows.

Q1: The authors claimed on page 5 that "The former is 0.0018, while the latter is 0.8423, which demonstrates that GCNII’s “attention” captures more information, leading to stronger generalization." However, it is not clear what is the connection between the "attention weights", "information", and "generalization".

A1: Thank you for your questions and suggestions, we will provide additional explanations. The attention matrix corresponds to softmax(QKTdk)softmax(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}). We formally define attention density as the proportion of non-zero elements in this matrix. Higher density indicates that a node's feature representation aggregates information from more other nodes, thereby capturing broader contextual information and enhancing generalization capacity (details in Appendix G).

Q2: On page 4, lines 198, the paper mentioned "The attention matrix between the nodes is the adjacency matrix" which is confusing.

A2: Sorry for the confusion caused by the inaccuracy of our description. The expression should be changed to 'The attention relationship between nodes can be obtained by the adjacency matrix'. If the self-attention mechanism is used for all nodes in the graph, the usual operation needs to learn the node characteristics to get the attention matrix, but the graph itself contains the relationship information among nodes. The neighbor of a node is the prior attention, so the neighbor matrix tells us which other nodes we should pay attention to for a certain node, which can be used as an attention matrix.

Q3: The baseline methods used in this paper are pretty old.

A3: Thank you very much for your suggestion. We conducted comparative experiments between GCNIII and recent baseline methods, with results available at https://anonymous.4open.science/r/GCNIII_supplement.

Q4: The proposed method incorporates LLM into the framework. Hence, for a fair comparison, many LLM-incorporated node classification methods should be compared, such as [1-2].

A4: Thank you for highlighting these key references. We declare at the beginning of Chapter 3 that embedding large language models (LLMs) into the framework for upgrading is only a technical idea, not the focus and main contribution of this paper. We focus on the analysis and improvement of GCNII. If it has an impact on the whole paper, we will consider deleting the part related to LLM.

Q5: The authors mentioned on page 5 that "We suggests that graph can be viewed as a form of static, discrete self-attention mechanism (Vaswani et al., 2017)." Actually, this is a well-known fact that "the attention matrix can be viewed as a fully-connected weighted graph", e.g., in papers [1-3].

A5: Thank you for pointing out this important reference, we have cited this paper. We think that treating the attention mechanism as a fully connected graph is not the same insight as using the graph as a static attention mechanism. A priori attention that does not need to be learned is an important reason for the over-generalization phenomenon in deep GCNs. Although the training stage has not yet ended, the accuracy of the verification set has far exceeded the accuracy of the training set.

Q6: I think the so-called "over-generalization" is interesting, but it may only be due to the dropout, so the models in training and validation are different.

A6: Thanks for your recognition, we believe that the overgeneralization phenomenon has a great impact on understanding and exploring the direction of deep GCNs, and further exploration is needed to fully understand its causes and mechanisms.

Q7: the difference between the proposed GCNIII and the existing model GCNII is that this paper has an extra module "Intersect memory", which is essentially an attention module (line 195).

A7: The most important thing in GCNIII is to introduce the linear model and conduct joint training. Intersect memory, Initial residual and Identity mapping are three kinds of techniques as optional hyper-parameters, but not necessary. Compared to the fixed-structure GCNII, the GCNIII framework is more flexible and can better adapt to different datasets.

Q8: Many notations are wrapped with {}. E.g., {ψ\psi} is Eq. (7). It is unclear why they are presented like this.

A8: We state in lines 117-119 that {} represents non-essential module that need to be adjusted for different datasets and tasks. We found that these often overlooked components can have a big impact on how models perform on different datasets, so the GCNIII framework was designed to be more flexible.

Q9: I think the notation in Eq. (12) is problematic. The notation G~\tilde{G} is a matrix (line 110), but it is also used as a function in Eq. (12).

A9: The notation G~\tilde{G} is still a matrix in Eq. (12) rather than a function. AIMA_{IM} in Eq. (12) also represents a generalized attention matrix rather than a function.

审稿人评论

I appreciate the authors' detailed response and apologize for my late reply. The response solves some of my concerns, and I will list the remaining ones with some suggestions.

For A3, more baselines are included, but they are on the Cora, Citeseer, and Pubmed only. You used many more datasets in your paper; some are heterophilic, and there are many new methods for that. You might need to include the most notable baselines from them.

For A4, yes, please pay special attention to that, as the current organization is very confusing.

For A6, as I mentioned in my original review, "over-generalization" sounds interesting but definitely needs a serious and rigorous explanation. The current explanation is not convincing, in my view.

I will keep my original evaluation, and I believe the current version is below the bar for acceptance.

审稿意见
2

This paper proposes a new model GCNIII which aims to get more effectively balance the trade-off between over-fitting and over-generalization. The framework incorporates three key techniques: intersect memory, initial residual and identity mapping. Experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed framework.

给作者的问题

To achieve a more effective balance between over-fitting and over-generalization, have you conducted comparative experiments with state-of-the-art models specifically designed for heterogeneous datasets?

论据与证据

This work focuses on achieving a more effective balance in the trade-off between over-fitting and over-generalization. To validate the framework's performance, the authors conduct experiments on both heterophily and homophily datasets, demonstrating the framework's effectiveness across different graph characteristics.

方法与评估标准

The proposed techniques demonstrate limited innovation in their conceptual design. Specifically, the intersect memory mechanism appears overly simplistic in its implementation. Furthermore, both the initial residual and identity mapping components show substantial similarity to existing methodologies, raising concerns about their novelty and originality.

理论论述

Yes. The proof of Theorem 4.1.

实验设计与分析

While the experimental methodology is scientifically sound, the results reveal certain limitations. The semi-supervised node classification experiment demonstrates only marginal improvements over baseline methods. Furthermore, the ablation study yields mixed findings: while it confirms the necessity of the initial residual component, it paradoxically shows inconsistent performance when either the intersect memory or identity mapping components are removed, which contradicts the authors' original hypotheses.

补充材料

All.

与现有文献的关系

This paper addresses a valuable and challenging research topic: achieving an effective balance between over-fitting and over-generalization. However, the proposed implementation demonstrates limited novelty, primarily building upon existing methodologies without substantial innovation. Furthermore, the experimental results show only marginal improvements, suggesting that the practical impact of the proposed approach may be limited.

遗漏的重要参考文献

In my understanding, there are none.

其他优缺点

None.

其他意见或建议

None.

作者回复

Thanks a lot for your review, suggestions, and questions. We have carefully addressed your concerns shown below.

Q1: The proposed techniques demonstrate limited innovation in their conceptual design. Specifically, the intersect memory mechanism appears overly simplistic in its implementation. Furthermore, both the initial residual and identity mapping components show substantial similarity to existing methodologies, raising concerns about their novelty and originality.

A1: Thank you for your criticism and concern. Intersect memory is merely an optional technique (hyper-parameter) in our model, while our approach focuses on bringing linear models in and conducting joint training. Despite its apparent simplicity, this design proves remarkably effective in balancing the trade-off between over-fitting and over-generalization. The novelty of our work is primarily reflected in the following aspects: We revisit two highly influential works (with ~2000 citations) on deepening GCNs - APPNP and GCNII, where we are the first to identify the over-generalization phenomenon. We argue this phenomenon holds significant implications for understanding both deep GCNs and even the fundamental mechanism of graph convolution.The reason why GCNIII can alleviate the over-generalization phenomenon is due to the linear model's memory ability of node features during the training stage. Appendix E confirms that node characteristics are critical to the effectiveness of GCN in node classification tasks (often overlooked in previous studies), and Appendix C confirms that linear models also have strong node classification capabilities on these datasets. The analysis in sec.4 offers a novel interpretation of the deep GCNII model and provides insights that led to our proposal of GCNIII. The deep GCNII can be viewed as a chimera of multiple models, specifically 2-MLP, 1-GCN, 2-GCN, ..., k-GCN, where models closer to the output layer play a more significant role. Previous research on GNNs has never explored using the simplest linear model (2-MLP is considered in [1]). We argue that the most straightforward yet effective way to address GCNII’s limitations is to jointly train a linear model with GCNII. This approach is empirically validated by our experimental results (Figure 5 and 'Over-Generalization of GCNIII' in Section 6.2).

Q2: The semi-supervised node classification experiment demonstrates only marginal improvements over baseline methods. Furthermore, the ablation study yields mixed findings: while it confirms the necessity of the initial residual component, it paradoxically shows inconsistent performance when either the intersect memory or identity mapping components are removed, which contradicts the authors' original hypotheses.

A2: Thanks for your criticism and questions. GCNIII demonstrates consistent improvements over the baseline, with detailed experimental results provided in Appendix F. The model achieves peak accuracy of [86.1, 74.0, 81.4] on these three datasets respectively. The results obtained in the ablation experiment are not contradictory. As detailed in Section 3.2, the three techniques - Intersect Memory, Initial Residual, and Identity Mapping - serve as optional hyper-parameters in GCNIII that do not universally improve performance. Our analysis reveals these techniques are dataset-dependent, demonstrating GCNIII's enhanced flexibility over GCNII. Details of the use of these three techniques in all experiments are documented in Appendix B.

Q3: To achieve a more effective balance between over-fitting and over-generalization, have you conducted comparative experiments with state-of-the-art models specifically designed for heterogeneous datasets?

A3: Thank you for your questions and suggestions. The over-generalization phenomenon we identified is specific to deep GCN models (such as the classic work APPNP, GCNII, etc.), which are primarily designed for homophilic and heterophilic graphs. The heterogeneous graphs you mentioned (containing different types of nodes and edges) are outside the scope of this study. Notably, neither the literature we cited nor recent works discussed in Section 5 ('Other Related Work') have performed comparative experiments on heterogeneous graph benchmarks. As far as I know, homogenous graphs and heterogeneous graphs are two different research fields.

审稿意见
4

This work proposes a new gnn architecture which uses the wide & deep neural network framework to address the problems of over-fitting and over-generalization occurring the current deep graph neural networks, which combines the linear model for initial node features and the deep graph convolution layers. In particular, the root of over-generalization in GCNII is investigated. Numerical experiments are conducted on semi-supervised and full-supervised node classification tasks to verify the effectiveness of the proposed model.

update after rebuttal

The authors have addressed my concerns. So I have updated the score.

给作者的问题

  1. In the 'attention is all your need' part, they calculate the attention density values for both on Cora with \alpha be 0.1, ? I guess something missing here.
  2. In Table 5, does the ablation study on the wide component use intersect memory?

论据与证据

In this work, the proposed GCNIII is designed to address over-fitting and over-generalization simultaneously. But there is lack of comparison between the training error and the validation error (like figure 1) for the proposed method, only training losses are reported in Figure 5 for three different models.

方法与评估标准

The proposed method and evaluation criteria make sense.

理论论述

I have checked the correctness of the proof for the theorem 4.1.

实验设计与分析

Checked.

补充材料

I have reviewed the proof of Theorem 4.1 and supplementary experiments on linear models, and OOD of GCNII.

与现有文献的关系

Long-range neighbor unreachability and over-smoothing of deep convolution make the choice of GNN architectures in a dilemma, which drives the emergence of deep GNNs represented by GCNII. While this paper uncovers the shortcoming of GCNII in generaliability, namely, the so-called over-generalization. To address this issue, wide component is introduced to optimize node representation learning.

遗漏的重要参考文献

The related works are essential for understanding the contribution of this work.

其他优缺点

Strengths:

  1. The paper is well-written and easy to follow.
  2. Wide & deep architecture is introduced to GNNs.
  3. The experimental settings are clearly demonstrated.

Weaknesses:

  1. Overf-generalization is the focus of this work. How the proposed model resolves this problem need to be fully justified with theorectical analysis/empirical results, besides the discussion in sec.4.
  2. No results and explanation (if unavailable) about semi-supervised learning on heterophilic graph.
  3. The limitations of the proposed method should be discussed.

其他意见或建议

  1. It is claimed in the last lines of left column, page 2, that wide & deep can more effectively balance the trade-off between over-fitting and over-generalization, which lacks of justification.
  2. why GCNIII achieves marginal improvements compared to GCNII on homophilic graphs for both semi-supervised and full-supervised tasks? is the superiority of GCNIII on heterophilic graphs attributed to the wide component, since this component is less relevant to the graph structure (which is semantically inconsistent with node attributes)? Does intersect memory participate in this senario?
作者回复

We thank you for your reviews and address your concerns as follows. 

Q1: There is lack of comparison between the training error and the validation error (like figure 1) for the proposed method, only training losses are reported in Figure 5 for three different models.

A1: Regarding Figure 5: To save space, we only plotted the training error comparison to demonstrate that our GCNIII model effectively mitigates over-generalization as the hyper-parameter λ increases, without suffering from over-fitting like shallow GCNs. The over-generalization phenomenon we found highlights a discrepancy between training and validation/test performance in deep GCNII, suggesting that the core issue lies in the training phase rather than the validation phase. We appreciate your suggestion and will include the complete comparison figures in the appendix to save space.

Q2: How the proposed model resolves this problem need to be fully justified with theoretical analysis/empirical results, besides the discussion in sec.4.

A2: The reason why GCNIII can alleviate the over-generalization phenomenon is due to the linear model's memory ability of node features during the training stage. Appendix E confirms that node characteristics are critical to the effectiveness of GCN in node classification tasks (often overlooked in previous studies), and Appendix C confirms that linear models also have strong node classification capabilities on these datasets. The analysis in sec.4 offers a novel interpretation of the deep GCNII model and provides insights that led to our proposal of GCNIII. The deep GCNII can be viewed as a chimera of multiple models, specifically 2-MLP, 1-GCN, 2-GCN, ..., k-GCN, where models closer to the output layer play a more significant role. Previous research on GNNs has never explored using the simplest linear model (2-MLP is considered in [1]). We argue that the most straightforward yet effective way to address GCNII’s limitations is to jointly train a linear model with GCNII. This approach is empirically validated by our experimental results (Figure 5 and 'Over-Generalization of GCNIII' in Section 6.2).

Q3: No results and explanation (if unavailable) about semi-supervised learning on heterophilic graph.

A3: As an improvement to GCNII, we used the same datasets of semi-supervised node classification as in the GCNII paper[2] for comparative experiments. The papers we cited, including some recent related studies mentioned in sec.5, also did not carry out semi-supervised learning on heterophilic graphs.

Q4: The limitations of the proposed method should be discussed.

A4: In our proposed method, the three techniques of Intersect memory, Initial residual and Identity mapping are regarded as hyper-parameters, which need to be selected according to the dataset, and the selection of hyperparameter γ should also be very careful when balancing over-fitting and over-generalization.

Q5: It is claimed in the last lines of left column, page 2, that wide & deep can more effectively balance the trade-off between over-fitting and over-generalization, which lacks of justification.

A5: In our original paper, we said 'a Wide & Deep architecture model as shown in Figure 2', meaning GCNIII. Specific evidence for balanced trade-offs can be seen in 'Over-Generalization of GCNIII' in Section 6.2 and [Q2-A2].

Q6: why GCNIII achieves marginal improvements compared to GCNII on homophilic graphs for both semi-supervised and full-supervised tasks? is the superiority of GCNIII on heterophilic graphs attributed to the wide component, since this component is less relevant to the graph structure (which is semantically inconsistent with node attributes)? Does intersect memory participate in this senario?

A6: GCNII remains a highly competitive model. While only marginal gains are observed on homophilic graphs, our method achieves significant improvements on heterophilic graphs. This stems from its more principled structural design, demonstrating GCNIII's superiority over GCNII in heterophily scenarios. Details about the use of intersect memory can be found in Appendix B.

Q7: In the 'attention is all your need' part, they calculate the attention density values for both on Cora with α\alpha be 0.1, ? I guess something missing here.

A7: Sorry, we don't fully understand your question. In our study, attention density value is defined as the proportion of non-zero elements in the attention matrix.

Q8: In Table 5, does the ablation study on the wide component use intersect memory?

A8: Sorry for the confusion caused by our unclear expression. In the original paper, the term 'a basic linear classification model' refers to a linear model that does not include intersect memory.

[1]Yang, C., Wu, Q., Wang, J., and Yan, J. Graph neural networks are inherently good generalizers: Insights by bridging gnns and mlps. In ICLR, 2023.

[2]Chen, M., Wei, Z., Huang, Z., Ding, B., and Li, Y. Simple and deep graph convolutional networks. In ICML, 2020.

最终决定

This work proposes an improved graph neural network (GNN) architecture, building on existing models such as GCNII. The new architecture, denoted GCNIII, incorporates a Wide & Deep architecture, intersect memory, initial residual, and identity mapping to balance the trade-off between overfitting and over-generalization. The model combines linear and deep graph convolution layers to effectively make use of node features. The work also investigates the connection between attention weights, information capture, and generalization, although some aspects of this relationship may require further clarification. Overall, the proposed architecture is validated through experiments on benchmark datasets.

After reviewing all submitted reviews, I found Reviewer toWM's assessment to be particularly informative. One potential area for improvement is that the work could benefit from a more distinct and overarching contribution that ties together its various components and advances the field in a more significant way. As is, the paper looks like a small contribution that needs more work to become a full paper.