Adaptive Message Passing: A General Framework to Mitigate Oversmoothing, Oversquashing, and Underreaching
We propose a message-passing framework for graph learning to adapt the number layers during training and filter outgoing messages, in order to control oversmoothing, oversquashing, and underreaching.
摘要
评审与讨论
This work addresses the challenge of modeling long-range interactions in deep graph networks, which are often hindered by oversmoothing, oversquashing, and underreaching in message passing. The authors propose a variational inference framework that adaptively adjusts message depth and filters information to mitigate these limitations. The approach is theoretically and empirically validated, demonstrating superior performance on five node and graph prediction datasets. This method enhances the ability of deep graph networks to capture long-range dependencies without explicitly modeling interactions.
给作者的问题
-
Could the authors clarify the rationale behind selecting these specific tasks as benchmark objectives? Given that they differ from the benchmarks used for comparison with baselines like IPR-MPNN, I have concerns about the representativeness and competitiveness of these tasks in evaluating the proposed method.
-
Have the authors considered including domain-specific models designed to handle long-range interactions as additional baselines? While I am not certain whether such methods exist for all tasks in this paper (e.g., peptide property prediction), there has been extensive research in machine learning force fields addressing similar challenges. One relevant example is:
[1] Li Y, Wang Y, Huang L, et al. Long-short-range message-passing: A physics-informed framework to capture non-local interaction for scalable molecular dynamics simulation. arXiv preprint arXiv:2304.13542, 2023.
论据与证据
The content has already been provided in the subsequent subsections.
方法与评估标准
The content has already been provided in the subsequent subsections.
理论论述
The content has already been provided in the subsequent subsections.
实验设计与分析
The content has already been provided in the subsequent subsections.
补充材料
The authors provide code, configurations, and data in the supplementary materials.
与现有文献的关系
The content has already been provided in the subsequent subsections.
遗漏的重要参考文献
The content has already been provided in the subsequent subsections.
其他优缺点
Advantages:
- The paper is well-written, clear, and easy to read.
- The experiments are thorough and directly correspond to the target problems introduced, making the findings highly convincing.
- The paper extends variational methods from neural architecture search to GNNs and provides detailed theoretical proofs.
- The authors have released the source code, data, and configuration files, ensuring high credibility and reproducibility.
Weaknesses:
- The introduction focuses heavily on the limitations of standard GNNs, but issues like oversmoothing, oversquashing, and underreaching have already been extensively studied. Given the comparative experiments provided, what is the paper’s unique motivation in this context?
- Based on Table 2, AMP's performance is comparable to IPR-MPNN, with the key difference being whether rewiring is used. Has the paper clearly explained the advantage of ‘not requiring rewiring’?
- Has the impact of AMP’s dynamic depth on efficiency been sufficiently evaluated in the results section? Similarly, how does its efficiency and complexity compare with other methods? For instance, if AMP tends to choose deeper networks for certain tasks, does this lead to a significant drop in overall efficiency?
其他意见或建议
The content has already been provided in the subsequent subsections.
We thank the reviewer for recognizing that the paper is well written and the findings are convincing. We will comment below on the questions raised.
Other Strengths And Weaknesses:
- While oversmoothing, oversquashing, and underreaching are topics of great interest in the community, which are far from being fully understood or solved, our paper’s unique motivation is that, by combining the ability to learn the depth with the adaptive filtering of messages, we want to determine i) the number of required layers and ii) which messages should be propagated. In other words, our contribution is unique in that it determines, during training, how much and when to send information. We hope to have clarified this point.
- In Appendix D, we provide a critical view on how the oversquashing term is defined and understood by our community. In particular, we argue that if oversquashing refers to “informational bottlenecks”, then additive rewiring may hurt performances if the same amount of layers is kept; in contrast, oversquashing defined as sensitivity is always improved by additive rewiring. For this reason, we believe that AMP and rewiring-based approaches are different but equally valid approaches to achieve long-range propagation. We will clarify these aspects in the paper, and we hope the reviewer appreciates our considerations. Thanks to the suggestion of Reviewer ZB4Z, we now have a new Theorem inspired by Di Giovanni et al, 2023 that exposes this inconsistency in the literature!
- In terms of impact of the depth as regards efficiency, we refer to the computational complexity analysis written in Reviewer XraS’s response. We will add these considerations to the paper. Overall, the complexity of using a deeper network increases due to i) the higher parametrization, as usual; ii) the need for a layer-wise readout in AMP, as also done in previous works [JK-Net, Xu et al., 2018].
In addition, we prepared an ablation study where we try different (fixed) depths for AMP, maintaining the other architectural changes intact. In particular, we allow AMP to learn a normalized importance over the fixed number of layers. Here, we want to see the impact of dynamically learning the depth on effectiveness. The depths we tried depend on the range shown in Figure 4 (so, up to 45 different depths for a single model and dataset), while the other hyper-parameters were fixed to the best one we found for the task. The table is shown below for the peptides datasets:
| Func | Struct | |
|---|---|---|
| AMP_GCN | 0.7076 (0.0059) | 0.2497 (0.0009) |
| AMP_GINE | 0.6999 (0.0041) | 0.2481 (0.0014) |
| AMP_GATEDGCN | 0.6750 (0.0029) | 0.2493 (0.0013) |
It appears that learning the importance of fixed-depth network layers does not allow to obtain better performances than the fully adaptive AMP, and it potentially requires many more configurations to be tried, which is the main disadvantage of considering the depth as a hyper-parameter rather than a learnable parameter (as we do).
Questions For Authors: The specific tasks are chosen because there is reason to believe long-range information plays an important role. In other words, all our tasks require the ability to “reason” globally about the graph and have specifically tested for these reasons. In addition, we used the molecular tasks of the LRGB paper because i) it is possible to perform a fair evaluation on these tasks, thanks to the robust re-evaluation of Tönshoff et al.. IPR-MPNN evaluation on additional benchmarks is necessary to empirically validate their increased expressive power compared to 1-WL MPNNs, whereas we believe to already have sufficient empirical evidence that AMP can be beneficial to improve the performance of MPNNs on long-range tasks; AMP always improves the results compared to any of its base versions on all datasets.
An important clarification about AMP’s competitiveness: our goal is not to demonstrate that AMP surpasses the state of the art, as AMP is not a model but a framework that can be wrapped around most MPNNs. In our paper, we used simple MPNNs for their ease of use, showing the performance gains that AMP grants without changing a single line of code of these convolutional layers. In our view, the correct metric to assess AMP’s effectiveness is the gap in performance compared to the base versions AMP is wrapped around, showing that it helps to better capture long-range dependencies.
Finally, thank you for the reference, we will include it in the revised manuscript. We were unable to find such models for the considered datasets, but we should definitely mention architectures specific to molecular tasks in our work, as it is relevant due to the electrostatics interactions necessary to model energy and forces.
Conclusion: We will report these clarifications and additional statements in the paper. We hope the reviewer appreciates our response and hopefully will increase the score.
Thanks authors reply, I will reaise my recommendation to 3.
Graph Neural Networks (GNNs) often struggle to capture long-range dependencies in graphs due to challenges such as oversmoothing, oversquashing, and underreaching. In this work, the authors introduce a variational inference framework that allows GNNs to dynamically adapt their depth and selectively filter message passing. The proposed approach is supported by theoretical insights and empirically validated on multiple graph benchmarks.
给作者的问题
- At first glance, [1,2] seem to be conceptually very similar to the approach in this paper. Could the authors emphasize the major differences/novelties compared to these approaches?
- How expensive is the whole probabilistic modeling process? The modeled distribution has to account for layers, parameters and graphs at the same time, how does the computational complexity compare to GNN variants that do not have to model these components?
- Since a distribution has to be modeled, is there a trade-off in case of scarce high-dimensional data where distribution-modeling is hard?
论据与证据
The claims made in the paper are partially supported by experimental evidence. The results presented in Tables 1 and 2, as well as the synthetic tasks, demonstrate performance improvements when the authors’ approach is integrated with popular GNN variants. On the other hand, the method does not surpass the state-of-the-art (as claimed) in Table 2. It's also not clear how underreaching can be mitigated when messages are filtered which could rather introduce underreaching as the propagation of information is thereby prevented.
方法与评估标准
The proposed probabilistic message passing framework is well-motivated for the graph-processing methodology, with a clear intuitive rationale for its potential to enhance performance. The selection of benchmark datasets and evaluation metrics is appropriate and aligns with standard practices in assessing graph-based methods.
理论论述
The proof of Theorem 3.1 is intuitive and straight-forward.
实验设计与分析
The experimental setup appears to be well-structured for evaluating the stated claims. I could not find a detailed description of the hyperparameter selection process, including dataset splits, search methodology, and search grid specification for all tasks and methods, however.
补充材料
I reviewed the theory section to understand the proof of Theorem 3.1. Additionally, the other sections of the Appendix seem to be comprehensive.
与现有文献的关系
The primary contributions of this paper pertain to probabilistic approaches to graph rewiring [1,2] and established GNN methods, which are utilized as baselines for comparison. [1] Qian et al. Probabilistically rewired message-passing neural networks, 2024 [2] Qian et al. Probabilistic graph rewiring via virtual nodes, 2024
遗漏的重要参考文献
To enhance the comprehensiveness of the study, it would be beneficial to discuss related approaches or incorporate them as baseline methods for comparison. For instance, [3] introduced Graph-Mamba, an adapted state-space model designed to facilitate long-range information propagation in graph data. Similarly, theoretically grounded sequence-processing frameworks [4,5], leveraging randomized signatures from rough path theory, demonstrated promising potential in alleviating oversquashing effects in large graphs.
[3] Wang et al. Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces, 2024
[4] Toth et al. Capturing Graphs with Hypo-Elliptic Diffusions, 2022
[5] Gruber et al. Processing large-scale graphs with G-Signatures, 2024
其他优缺点
None
其他意见或建议
None
We thank the reviewer for the comments. Below, we clarify some of the points raised.
Claims And Evidence: We apologize for the incorrect statement in the abstract about “surpassing” the state of the art. That statement was true in the past, before we revised Table 2. Indeed, we had fixed our claims elsewhere in the paper (lines 80-83, 242-244, 377-379) by talking about “competitive” performances.
However, the goal of the analysis is not to show absolute improvements compared to recent works. Rather, it is to show, as the reviewer correctly identified, that wrapping the AMP framework around basic MPNNs grants a performance improvement in capturing long-range dependencies, due to the particular characteristics of AMP that endow MPNNs to mitigate oversmoothing and oversquashing as measured in Figure 3, in addition to the adaptive learning of the depth to mitigate underreaching. It is our belief that our work should be judged in terms of these contributions, and not solely in terms of absolute numbers, since we, for the ease of prototyping, chose to wrap AMP around classical MPNNs, rather than more recent and convoluted ones.
In terms of mitigating under-reaching, it is possible that the network finds a degenerate solution during training that corresponds to an unsatisfactory depth, but this is similar to how MLPs can end up in poor local minima when trying to learn functions. The design of AMP enables learning the depth and we see that it works empirically.
Finally, in our results, it seems the method learns more layers than those tried in previous works using grid search.
Experimental Designs Or Analyses: Thanks for pointing this out, we partly referred to the hyper-parameter ranges and data splits of Gravina et al. and Tonshoff et al. without explicitly mentioning them in the paper. To improve self-containment and reproducibility, we will add this information to Appendix E. The search methodology was a standard grid search with early stopping on validation performance.
Relation To Broader Scientific Literature: We would like to clarify that our approach does not pertain to probabilistic graph rewiring approaches; it is probabilistic, but it does not perform a topological rewiring of the graph. The graph topology remains the same, and messages are partially filtered to mitigate the oversmoothing and oversquashing issues. Concretely, we never add new edges to the input graph: please also refer to our answer to question 1 below.
Essential References Not Discussed: Thank you for providing these additional references, we agree with the reviewer’s analysis and we will include and discuss them in the revised manuscript. Briefly, in terms of main differences with AMP, [3] relies on a separate sequence model to develop a node selection mechanism whereas we work on the message passing itself; [4] develops a new graph Laplacian that is better suited for long-range propagation; [5] converts a graph into a latent representation that can be passed to downstream classifiers.
Questions For Authors:
- As we also clarified earlier, our approach differs significantly from rewiring approaches, including probabilistic ones. In AMP, the probabilistic formulation is used to dynamically learn the depth (following Nazaret and Blei), something which rewiring approaches cannot do, whereas the “filtering of messages” is not to be intended as a variation to the original graph topology, rather as a “node-based” filtering of outgoing messages. The major difference is that AMP alters the computational graph but not the graph topology, whereas typical graph rewiring methods alter the graph topology. In Appendix D, we have discussed this through the lens of oversquashing.
- Modeling a one-dimensional distribution leads to no particular overhead during training. We point the reviewer to our answer to Reviewer XraS for a discussion on computational complexity. In short, the main complexity is caused by the addition of one readout per layer, something which was also done in previous works such as JK-Net (Xu et al, 2018). In terms of training, optimizing the ELBO reduces to performing backpropagation w.r.t. a standard prediction loss, plus one (optional) weight regularization term and another (optional) depth regularization term. The overall asymptotic complexity is therefore not different from other GNNs that appeared in the literature in previous years.
Conclusion: We hope to have clarified the goal of our contribution and to have addressed the other questions. We would appreciate it if the reviewer could consider increasing the score in light of these considerations and revisions. We remain available should further information be needed.
I thank the authors for their reply and raise my score to 3.
This paper introduces Adaptive Message Passing (AMP) a novel approach to enrich GNNs with learnable depth and message filtering distributions. A variational inference framework is adopted to jointly train netwoks representing these distributions with both mechanisms being aplied to a range of existing GNN baselines as wrapper style enhancments. The paper also "introduce new families of distributions with specific properties to overcome previous limitations" in order to facilitate the variational framework which are outlined in the appendix. The main aim of the work is to dynamically learn these distributions such that models overcome oversquashing, under-reaching, oversmoothing (OUO). To demonstrate the effectiveness of this approach AMP is tested on 2 datasets that represent tasks with long-range dependancies. An analysis of the depth and filtering distributions shows the distributions dynamically adapt for the task and various baseline GNNs.
给作者的问题
It's unclear to me why the MLPs to output the weights over the layer depths and sigmoid MLP for message filtering can't just be trained using conventional methods?
论据与证据
The paper claims to "provide(s) a general framework for improving the ability of any message passing architecture to capture longrange dependencies" which is emperically shown on the selected datasets.
It also claims the framework "mitigates" OUO which is somewhat supported with the analysis in Figure 3, however I would say this coud be improved with a more consistent colouring scheme for GCN vs AMP_GCN for each task. It might also be more informative to standard GCN DE decay for a larger number layers. It also also unclear why the Dirichlet becomes negative.
There is an extensive discussion on OUO in Section App.D with some conjecture but I think the paper is missing a theoretical analysis in the style of the cited papers (more "suggested" below).
One step in this direction Theorem 3.1 but in my opinion the property shown in theorem 3.1 "AMP’s ability to propagate a message unchanged from two connected nodes" is potentially trivial and not useful in the learning setting. It just shows the existence by construction of a message passing scheme for one graph sample between 2 nodes essentially choosing the nessary weights to maintain the feature vector within the ball.
方法与评估标准
As the paper claims to introduce a framework to overcome OUO on tasks requiring long distance information propagation I would say the chosen datasets fit the task. However this might indicate bias selecting datasets for which the framework has been designed to perform well in. I would like to see experiments performed on classical homophilic and heterophilic node classification tasks (ie Cora/Texas) to show the scheme is universal.
理论论述
As above the proof of Theorem 3.1 seems correct but the property seems unuseable in practice and relies on construction. To demonstrate mitigation of OUO it would be better to focus on a gradient based analysis as I suggest below.
实验设计与分析
Experiments and abalations are suitable and exectued well but see my comments above.
补充材料
Yes I went through the proof of theorem 3.1 and discussion on OUO.
与现有文献的关系
Due to the extensive literature review the paper does a good job in positioning itself within the current literature wrt adjustable depth, rewiring and adaptive architectures pointing out here the architeture can be adaptively learned during trainined due to the variational formulation.
遗漏的重要参考文献
I am satisfied a comprehensive literature review of related works has been performed. I would note some benchmarks in Table 1 are not cited.
其他优缺点
I am borderline accept given the novel approach to overcoming the common problems of OUO and promising results and informative emperical analysis. However there are two weaknesses I believe if addressed could significantly strengthen the paper. A proper theoretical analysis of the message passing scheme (as suggested below) and performing exerpiments on classical homophilic and heterophilic node classification tasks (ie Cora/Texas) to show the scheme is universal. It migh be impressive to show where the framework learns whether long or short range information is relevant or useful. One might assume short/long distance for homo/heterophily respectively
其他意见或建议
I think the paper is missing a proper analysis that the framework overcomes oversmoothing. One suggestion might be to look at the proof of Theorem B.1 in "On Over-Squashing in Message Passing Neural Networks: The Impact of Width, Depth, and Topology" (Di Giovanni '23) and split the message passing matrix, isolating the "relevant" signal, as done in proof 3.1 being able to show the sigmoid gating sharpens the signal on relevent pathways wrt to the node sensitivity might be a way to proceed. In addition It might also be benefical to remove the standard symetric degree normalisation of GCN with "sum of sigmoid" based normalisation which would further sharpen the signal.
We sincerely thank the reviewer for the detailed review and the valuable suggestions. We will do our best to clarify doubts and address the points of the reviewer.
Claims And Evidence:
On Figure 3: We will improve the coloring scheme as suggested, thank you. Also, the reason why we do not show GCN results for more layers is because we report the best configuration on the validation set, but we can mention in the paper that it has been already proven how GCNs oversmooth in the long run. Finally, the Dirichlet Energy (DE) becomes negative simply because the y-axis is in log scale to aid readability, as DE values and sensitivities vary a lot.
Theoretical Analysis: We cannot thank the reviewer enough for the suggestion! We indeed extended Theorem 3.2 of Di Giovanni ‘23 to our message filtering scheme, and (in short) we arrived at the conclusion that, if is the Lipschitz constant of our filtering function, is the maximal element-wise value attained by the Filtering function , and is the maximal element-wise value of a node embedding, we arrive at a similar result for (using the notation of Theorem 3.2)
Where controls the upper bound on sensitivity due to the topology of the graph. We are very happy with the Reviewer’s suggestion because this result formally supports what we argued in Appendix D: if we consider for simplicity a constant filtering function, namely and we filter enough, meaning , then filtering will decrease the sensitivity’s upper bound. However, this is actually helping to reduce the “informational oversquashing” defined in Alon et al., contradicting the widely used statements that “improving sensitivity mitigates oversquashing”. As a result, we argue that the community should probably distinguish the “informational oversquashing” of Alon et al. from the “topological oversquashing” of Topping et al.. We will add these results to our paper. Thank you so much for the invaluable suggestion! This will help our community reason more clearly about these issues.
Methods And Evaluation Criteria: Following the reviewer’s suggestion, we perform additional experiments with GCN and AMP_GCN on homophilic and heterophilic datasets. Risk assessment is performed using 10-fold Cross Validation, with an internal hold-out hyper-parameter tuning for both models. For each of the 10 folds, the best configuration on that fold is re-trained 10 times and the test values averaged. The final test score is the average of the test scores across the 10 folds. This procedure is inspired from [*] and results are shown below.
| Cora | Citeseer | Pubmed | Texas | Wisconsin | Chameleon | Squirrel | Actor | |
|---|---|---|---|---|---|---|---|---|
| GCN | 85.1 (2.0) | 72.3 (1.9) | 87.9 (0.9) | 51.9 (8.1) | 45.9 (4.1) | 43.4 (1.5) | 27.6 (1.1) | 28.5 (0.7) |
| AMP_GCN | 86.5 (1.7) | 75.3 (2.2) | 89.8 (0.5) | 78.6 (7.0) | 81.3 (7.1) | 49.8 (2.2) | 35.2 (1.8) | 34.8 (1.2) |
It appears that AMP allows GCN to improve results especially on heterophilic datasets. Therefore results seem to align with the reviewer’s intuition. We will add other common datasets and baselines in the revised version of the paper. Thank you for suggesting these extra experiments.
[*] Errica F., Podda M., Bacciu D., Micheli M., A fair comparison of graph neural networks on graph classification. ICLR 2020
Essential References Not Discussed: Thanks for pointing out that some references in the Table are not cited, we will amend that.
Remaining Sections: We hope to have addressed the weaknesses highlighted by the reviewer, namely the theoretical analysis and the extra experiments on node classification tasks on homo/heterophilic datasets. We will provide a complete proof of the extended version of Theorem 3.2 for AMP in the revised paper.
Questions For Authors: As a matter of fact, the entire architecture is trained using conventional backprop on a prediction loss + some regularization terms. There is no MLP involved in the distribution of the layers’ importance because we simply need to learn its parameters, but note that the variational formulation merely serves to show that AMP arises from a well defined graphical model, grounding our design choices in a principled formulation. The practical implementation is very simple, thanks to the brilliant work of Nazaret and Blei. Please let us know if we can further clarify this point.
Conclusion: Thank you again for the constructive comments, especially those related to the theoretical analysis. We hope the Reviewer appreciates our efforts and that this will lead to a score increase.
Thanks the authors have sufficiently addressed my points and I think the additional theoretical analysis and experiments will improve the paper enough for me to raise my score to a 4. In particular I'm happy they were able to derive a new theorem in the past days which places them well within the curent literature.
The authors propose a general framework to tackle certain long-range interaction problems in GNNs, namely (1) oversmoothing, (2) oversquashing, and (3) underreaching. Their Adaptive Message Passing (AMP) framework extends the work of Nazaret and Blei on unbounded depth networks to the GNN setting. The idea is to use a variational theory to learn the GNN structure, both in terms of GNN depth as well as message passing (message filtering). This allows for a finer control on the flow of messages across the graph.
给作者的问题
论据与证据
Yes, the claims are supported by clear and convincing evidence. For oversmoothing/oversquashing, the authors use standard tools such as Dirichlet energy to examine the quality of the embeddings. They also provide ablation studies to examine the effects of message filtering.
方法与评估标准
Yes, the proposed methods and criteria are overall justified. The number of real-world datasets considered could be larger, since the paper targets a very broad array of GNN limitations.
理论论述
I did not check for all theoretical details: Some of the background needed is out of my knowledge scope.
I am also not sure how Theorem 3.1 supports the overal theory: What does it even mean to propagate a message unchanged from two connected nodes in a graph? Why is that important? And does this really show that a GNN can learn "asynchronously", as alluded by the citation [FW]? Is this case observed experimentally?
实验设计与分析
Yes, the experimental protocols are fairly thorough, in line with recent hyperparameters consideration for long range benchmarks [Tonshoff et al.]
补充材料
Yes.
与现有文献的关系
The paper forms an important contribution in the current research to mitigate oversmoothing/oversquashing/underreaching. The paper brings in novel tools from learning unbounded depth NNs to address these problems simultaneously.
遗漏的重要参考文献
No, as far as I know,.
其他优缺点
Strengths
- The paper employs a very well-motivated theory-driven approach to tackle the target problems. The proposed solution is novel and effective, with potential for further research.
Weakness
- One key advantage of blanket Message-Passing is scalability. The authors have not commented clearly on the exact computational costs of the setting up AMP on top of a GNN.
- The number of real-world datasets considered is fairly small, as compared to usual GNN literature.
其他意见或建议
We thank the reviewer for recognizing the merits of our contribution and for providing constructive criticism. Below we comment on some of the points raised by the reviewer.
Methods And Evaluation Criteria: Following Reviewer ZB4Z’s suggestion, we will include node classification tasks related to homophily and heterophily and increase the number of real-world datasets considered. Please see our response to Reviewer ZB4Z for a first batch of results.
Theoretical Claims: AMP focuses on long-range interactions problems, where far-away nodes may need to communicate effectively (for instance, without oversquashing arising). Theorem 3.1 shows that, for a proper parametrization, it is possible to realize such long-range communication between two specific nodes without oversquashing behavior, which would not be possible on classical (synchronous) message-passing neural networks. This is meant to support the design choices made for AMP, although we do not want to claim that AMP is always able to implement such behavior in practice (we do not observe it empirically). At the same time, such a behavior is reminiscent of the asynchronous updates of [FW]. We will revise the paper to reflect these considerations, and we hope that this clarifies the reviewer’s doubts.
Other Strengths And Weaknesses: Thank you for the kind words about our work. We provide an answer for each potential improvement below.
- It is true that we did not mention additional costs in detail, thank you for pointing this out. The cost of filtering messages is , with nodes (lines 184-185). Therefore, the message passing operation is not altered significantly, since it has a cost of , with edges. However, the additional burden introduced by AMP, compared to classical MPNNs, is the layer-wise readout (lines 344-345) that we implemented as an MLP. Classical MPNNs employ a single readout with cost or depending on the task nature, whereas we use one per layer, so we have and respectively. Note that other popular MPNN architectures, such as JK-Net (Xu et al., 2018), employ a similar scheme by means of concatenating the node embeddings across all layers. In terms of training costs, standard backpropagation with at most two light-weight additional regularizers is employed. We will make this clear in the revised version of the paper.
- Please refer to our additional experiments in Reviewer ZB4Z’s response, as mentioned before.
Conclusion: Thank you once more for the constructive feedback! We hope our clarifications and additional experiments may constitute ground for a score improvement. We remain available for further discussions.
The reviewers unanimously recommend acceptance of the paper with varying degrees of strength. I agree with their assessment and am happy to recommend acceptance of this paper for publication at the ICML conference.
The reviews and the rebuttal have given rise to several interesting points and results that I encourage the authors to include in their revised manuscript. This includes the additional empirical results on further datasets, in my point of view, results on even more datasets would significantly help strengthen your paper. I furthermore encourage the inclusion of the analysis of the computational cost of your method, a more detailed reporting of hyperparameter ranges and datasplits, as well as the additional theoretical results that you derived with Reviewer ZB4Z.