5.8

/10

Poster4 位审稿人

最低5最高7标准差0.8

3.8

置信度

正确性3.3

贡献度3.0

表达3.3

NeurIPS 2024

Towards Dynamic Message Passing on Graphs

Junshu Sun,Chenxue Yang,Xiangyang Ji,Qingming Huang,Shuhui Wang

OpenReview PDF

提交: 2024-05-08更新: 2024-12-20

摘要

Message passing plays a vital role in graph neural networks (GNNs) for effective feature learning. However, the over-reliance on input topology diminishes the efficacy of message passing and restricts the ability of GNNs. Despite efforts to mitigate the reliance, existing study encounters message-passing bottlenecks or high computational expense problems, which invokes the demands for flexible message passing with low complexity. In this paper, we propose a novel dynamic message-passing mechanism for GNNs. It projects graph nodes and learnable pseudo nodes into a common space with measurable spatial relations between them. With nodes moving in the space, their evolving relations facilitate flexible pathway construction for a dynamic message-passing process. Associating pseudo nodes to input graphs with their measured relations, graph nodes can communicate with each other intermediately through pseudo nodes under linear complexity. We further develop a GNN model named $\mathtt{N^2}$ based on our dynamic message-passing mechanism. $\mathtt{N^2}$ employs a single recurrent layer to recursively generate the displacements of nodes and construct optimal dynamic pathways. Evaluation on eighteen benchmarks demonstrates the superior performance of $\mathtt{N^2}$ over popular GNNs. $\mathtt{N^2}$ successfully scales to large-scale benchmarks and requires significantly fewer parameters for graph classification with the shared recurrent layer.

关键词

Graph Neural NetworksGraph Representation LearningNode ClassificationGraph ClassificationMessage Passing

评审与讨论

审稿意见

评分: 6置信度: 32024-07-07

This work proposes a new dynamic message passing method $N^2$ , which initializes a number of pseudo nodes, and apply message passing across graph nodes and pseudo nodes. The message passing scheme relies on the proximity of the node embeddings, and the node embeddings are updated through message passing. The method is quite dynamic and flexible. Besides, it is empirically shown to alleviate oversquashing and oversmoothing problems. Experimental results show the significance of the proposed method.

优点

Overall, the writing is good and clear, the concepts are precisely described, and the introduction of background is easily understandable.
The methodology is quite interesting, and the message passing design makes sense.
The experimental results are strong on most datasets. And the method is widely applicable to both graph and node level prediction. The ablations are carefully designed and the visualizations are pretty good.
The proposed method exhibits good performance on oversquashing and oversmoothing problems.

缺点

It would be good to have theoretical explanations why the method works well against oversquashing and oversmoothing problems, though not mandatory.
It would be good to compare with some graph structure learning or graph rewiring work. The baselines are basically GNNs and graph transformers.

问题

It is not intuitive to me why to use recurrent layers, except for the good property of weight sharing. Can you explain it?
Why use proximity measurement instead of some distance metrics?

局限性

As mentioned by the authors, the performance deteriorates when the number of recurrent layers grow too large.

作者回复

2024-08-07

Thank you for your kind suggestions and insightful questions about our work. We hope the following response can address your concerns. We also kindly refer you to the PDF in the global response for the new figures/tables during the rebuttal due to the limited number of characters.

W1: Theoretical explanations on why $N^2$ works for over-squashing/smoothing.

Response: We provide some intuitive and sketchy theoretical analysis. We will keep working on this in the future.

Over-smoothing: $N^2$ learns the displacements of nodes and gives rise to the linear combination of the layer input, the local message passing output, and the global message passing output. Specifically, the layer input represents the output of an all-pass filter, the local output is from a low-pass filter, and the global output is from a filter that aggregates messages from nodes of learnable similarity, and thus possesses learnable frequency characteristics. The messages from the all-pass filter and the learnable filter prevent the output from over-smoothing.
- We must note that global message passing with the uniform connection provides the SAME global output for all the graph nodes. This output can be regarded as a shared bias term and cannot alleviate over-smoothing. Please refer to Fig R1b in the PDF response for empirical results.
Over-squashing: $N^2$ tackles over-squashing by detouring from the original graph bottlenecks and avoiding forming new pseudo-node bottlenecks.
- First, pseudo nodes provide two-hop message highways for graph nodes. Therefore, graph nodes do not have to pass messages through the original bottlenecks.
- Second, the pseudo-node bottlenecks stem from the overwhelming number of graph nodes in the receptive field of each pseudo node. However, by multiplying edge weights with the messages from these nodes, certain messages can be eliminated and thus exclude the nodes from the receptive field. We can define the effective receptive field size based on the value of the edge weights, where nodes with smaller edge weights are eliminated from the effective field. This elimination reduces the size of the effective receptive field and avoids forming bottlenecks.
  - Uniform connection shares the same edge weights for the messages from all nodes. Therefore, these messages are either completely eliminated or preserved, which cannot reduce the size of the effective receptive field. In Fig R1a in the PDF response, we show that $N^2$ with uniform connection is less effective in tackling over-squashing.
  - Our dynamic connection measures the specific edge weights for each node, learning to eliminate messages from certain nodes, and thus reduce the size of the effective receptive field.

W2: Comparison with graph structure learning or graph rewiring methods.

Response: Thank you for your suggestion. There are some graph transformer or GNN methods in our manuscript that also employ graph structure learning or rewiring strategies, such as Exphormer with the expander graph and DRew, GPRGNN, and H $_2$ GCN with edge shortcuts. Except for these methods, we also compared $N^2$ with DGM (L79-80), a graph structure learning method, and the results are presented in Tab. R3. $N^2$ shows better performance compared to DGM.

Table R3	CORA	CITESEER	PUBMED
DGM	84.6	74.8	88.6
$N^2$	85.4	77.6	90.6

Q1: Why use the recurrent layer?

Response:

Why recurrent layer: The recurrent layer is the core of $N^2$ . It is an implementation of the message passing in the common space from Sec 3. The recurrent layer, including pseudo-node adaptation and dynamic message passing, empowers dynamic interactions between nodes. $N^2$ employs the layer to move nodes recursively to their optimal positions.
Why recurrence: The main reason $N^2$ uses the parameters recurrently for the layer is to largely reduce the number of parameters. According to the ablation study in Fig. 8, nodes require multiple steps to move to their optimal position. Each step corresponds to a new layer with a set of parameters. To reduce the number of parameters, we conducted an ablation study and found that the layers can learn to move nodes based on their current positions. Therefore, a single set of parameters will be sufficient and thus give rise to the recurrent layer.

Q2: Why propose proximity measurement instead of distance metrics?

Response: Compared to distance metrics, such as Manhattan and Euclidean distance, proximity measurement has lower spatial complexity. We tested distance and proximity with torch.cdist and torch.matmul during model design. Results show that distance-based $N^2$ is on par with proximity-based $N^2$ , but the distance version encounters out-of-memory problems under many hyperparameter settings. Therefore, we choose proximity (inner product) for $N^2$ .

2024-08-11

Thank you for your feedback, it answers my questions.

2024-08-13

We sincerely appreciate your feedback and are glad that our responses addressed your questions. We will further revise our manuscript following your guidance.

审稿意见

评分: 5置信度: 42024-07-07

The authors propose a dynamic message-passing scheme where graph nodes and learnable pseudo nodes are projected into a common space. This allows non adjacent nodes to communicate immediately whilst retaining linear complexity. The model performs well on a range of benchmarks and can reduce problems with over-squashing/over-smoothing compared to a baseline MPNN.

优点

The novel (to my knowledge) method is explained in depth and well outlined. It is clearly explained and evaluated in the context of other approaches (rewiring, virtual node, transformers).
The results/experiments are really strong and the method is shown to help alleviate over-squashing/over-smoothing and performs well on a wide range of benchmarks.
Visualising the distribution of embedded nodes helps with understanding what is happening in practice and I really like this empirical understanding.
It is clear how this method lowers the computational complexity over methods such as Graph Transformers and the results indicate it can still outperform these approaches.

缺点

In line 45 the authors argue for their approach over using a single virtual node due to the fact that a VN can have a message-passing bottleneck. Firstly, I think this should be explained more rigorously rather than citing [1] as it is a key argument of the paper to not just have uniform global connections. Additionally, the extent to which your method alleviates this issue needs to be better explained. [Line 297], You use the balanced load argument to say that this helps alleviate the bottleneck. However, increasing connections between nodes within a community and not having global connections between communities would actually `increase' the bottleneck on those edges between the communities. The virtual node itself has less of a bottleneck BUT this does not mean bottlenecks in the graph are reduced.
On a similar note to above, we can reduce the bottleneck of the virtual node by subsampling edges (expander) or by adding more of them (this effectively increases their width) [1]. It is not clear in the paper how your dynamic and weighted approach improves over this (in terms of bottlenecks or some other property).
There are other dynamic message-passing schemes such as DRew [2]. The paper seems to argue for their approach over something like this due to pseudo nodes directly enabling global message-passing [line 82]. Given that these rewiring approaches seem relevant and closely-connected, I think this comparison and why your method can improve over this needs to be extended and evaluated in depth. For instance, in this case why would you need global message-passing in layer 1 when we always use > 1 layers?

[1] Shirzad et al. EXPHORMER: Sparse Transformers for Graphs. ICML 2023.

[2] Gutteridge et al. DRew: Dynamically Rewired Message Passing with Delay. ICML 2023.

问题

Do you use positional/structural encodings on these benchmarks? Does your baseline GIN/GCN + pseudo node also use these encodings?
It is not clear how the hyperparameter sweep is performed and what parameters are optimised over and on what metric. Could you provide some information on this?
Do you have any intuition why the method would outperform GTs on this benchmark? (not just have a lower computational complexity)

局限性

some limitations are outlined in the appendix.

作者回复

2024-08-07

Thank you very much for the valuable suggestions for our paper. Under your guidance, we have discovered new advantages of $N^2$ compared to uniform connections and rewiring. We also kindly refer you to the PDF in the global response for the new figures/tables during the rebuttal due to the limited character number.

W1&W2: More explanation of the pseudo-node bottleneck and why $N^2$ can reduce the bottlenecks in both pseudo nodes and input graphs.

Response:

Uniform connection leads to pseudo-node bottlenecks has been explained in L42-45 and Fig 1, before citing [1]. More rigorously, [1,R1] refer to bottlenecks as nodes with large effective receptive fields. The uniform connection forms pseudo-node bottlenecks by taking all the graph nodes into the receptive field equally. This results in large effective receptive fields on large-scale graphs and thus forms bottlenecks. We also evaluate $N^2$ with the uniform connection in Fig R1 in the PDF response, which shows inferior performance in tackling over-squashing.
Adding pseudo nodes with uniform connection provides more pseudo-node bottlenecks, where each pseudo node still incorporates all graph nodes into its receptive field.
Subsampling edges can remove some informative interaction between nodes, leading to sub-optimal message passing.
$N^2$ for pseudo-node bottlenecks: we have provided both intuitive and empirical analysis in L48-49 and Sec 5.3.3 to show that $N^2$ DOES NOT form pseudo-node bottlenecks. $N^2$ learns the edge weights between graph and pseudo nodes through spatial relations. Therefore, it can eliminate messages from certain nodes while preserving the others, reducing the size of the effective receptive fields.
- Note that L297 refers to the pseudo-node bottlenecks specifically. We will fix this in the revision.
$N^2$ for input-graph bottlenecks: we have provided empirical results in Sec 5.3.2 that $N^2$ can tackle over-squashing. This is because $N^2$ optimizes the global two-hop message highways based on specific tasks and avoids forming new pseudo-node bottlenecks. The global highways detour messages from the input-graph bottlenecks. Therefore, the original bottlenecks may still exist but are well circumvented.

W3: Comparison with rewiring methods (DRew).

Response: Thank you for bringing up this insightful question to us. The rewiring methods and pseudo-node methods actually represent the local-to-global and the global thread, respectively. Both threads are effective in improving message passing. We choose DRew[2] to represent rewiring methods, which has been referred to by many reviewers. The main reasons we need globality in one layer are:

Less layers: Pseudo-node methods require less layers than rewiring methods. For example, 6-layer $N^2$ surpasses 13-layer DRew in Tab S7 in the paper. We note that DRew with positional encodings will not be taken into comparison, because of the injected globality into node features.
Less memory usage: Rewiring methods either directly add edges to the input graphs or gather messages from multi-step neighbors in one layer. Both require more memory usage. We evaluate DRew on the same dataset in Sec 5.3.1. DRew is out-of-memory when the number of graph nodes surpasses 200,000 with 2 convolution layers and 8,000 with 3 layers.
Better efficiency: Rewiring edges is also time-consuming. We compare the computational complexity between $N^2$ and DRew with the same number of convolution layers/recursive steps. Fig R2 in the PDF response shows that DRew requires more time when scaling to large graphs.
Simpler preprocess: $N^2$ only requires the pre-construction of the adjacency matrix while DRew sample or select multi-step neighbors. We even encountered the timeout problem when trying a larger number of layers (5,6) on DRew on large graphs such as ogbn-arxiv.

Besides DRew, we have also compared with other rewiring methods, such as GPRGNN and H $_2$ GCN in Tab 3 and 4.

Q1: Positional/structural encodings usage.

Response: No positional or structural encodings are used for $N^2$ . Except for being specifically noted, all the baseline results are from the original papers. Among X+pseudo node methods, only MPNN+pseudo node uses the positional encoding.

Q2: Hyperparameter settings.

Response: We perform grid search for the hyperparameters based on the loss of the validation split, including the recursive steps $\in$ [1,10], the hidden and the state space dimension $\in$ {64,128}, the number of pseudo nodes $\in$ {8,16,32,64,128,256,300,320}, and dropout $\in$ [0,0.8] with step equals 0.1. The rest of the hyperparameters are fixed as reported in the Appendix and have not been specially tuned on individual datasets.

All the learnable parameters in $N^2$ are optimized during training, including the weights in the linear transformation, $\lambda$ in the proximity measurement, and the pseudo/class-node states. The cross-entropy loss is adopted for classification and the L1 loss for regression.

Q3: Why $N^2$ outperforms graph transformers.

Response: We attribute this to the flexible pseudo nodes:

Simpler objective. Instead of optimizing relations with all the other nodes, $N^2$ empowers the graph nodes to focus on relations with pseudo nodes, which are of a much smaller amount.
Information filter. The pseudo nodes can also be regarded as the information bottleneck [R2] in the graph space which filters out redundant information. This can be supported by the pseudo-node ablation in Fig. 7, where node amounts are upper-bounded to serve as an effective information filter.

[R1] On the Bottleneck of Graph Neural Networks and its Practical Implications, ICLR21

[R2] Deep Variational Information Bottleneck, ICLR17

2024-08-12

Thank you for your detailed response.

The comparison to Drew and other rewiring based approaches is very compelling. I understand that your method can decrease pseudo-node bottlenecks but a better theoretical understanding of overcoming input-graph bottlenecks beyond the TreeMatch experiment is still lacking. The addition of a VN for the TreeMatch experiment does improve the paper and I have raised my initial score.

2024-08-13

Thank you very much for your supportive feedback. We acknowledge that a theoretical understanding of how pseudo-node methods can detour messages from the input-graph bottlenecks is important. Due to time limitation, we provide two primary pathways to study this problem:

Measuring the curvature between graph nodes with and without pseudo nodes.
Given the messages to be passed between nodes i and j, recovering the messages from the pseudo-node pathway and the input-graph pathway to measure the information loss.

For the first pathway, we performed an evaluation on amazon-ratings. By employing pseudo nodes, $N^2$ creates message highways with positive curvature for 95% of graph-node pairs that are negatively curved on the input graphs. Edges with high negative curvature cause the input-graph bottlenecks[R3]. This shows that $N^2$ can overcome the input-graph bottlenecks by producing non-negatively curved message highways.

Your suggestions have enlightened us to further validate our methods from a theoretical perspective. We will keep studying this problem.

[R3] Understanding over-squashing and bottlenecks on graphs via curvature. ICLR22

审稿意见

评分: 5置信度: 32024-07-11

This paper proposes an adaptive message passing scheme for Graph Neural Networks that is based on learnable "pseudonodes" which, to a certain extent, decouple the paths along which node features are propagated from the topology of the underlying graph. Both pseudonodes and regular nodes in the underlying graph are embedded in a common space, which allows to utilize the common embeddings to generate proximity-dependent relations between nodes and pseudonodes that are used for a sparse "global" message passing layer. The proposed method is evaluated in a node and graph classification experiment for several benchmark data sets and the authors find superior performance on several data sets.

优点

[S1] The authors propose a new adaptive message passing scheme that introduces learnable pseudo-nodes and pseudo-edges, thus introducing dynamic pathways for message passing that are independent of the graph topology and that can be trained for a given learning task. The specific combination of pseudonde message passing and the use of recurrent layers is - to the best of my knowledge - new and original.

[S2] The method is evaluated against several baseline methods for a node and graph classification task in 18 small and large-scale benchmark data sets. The experiments show superior performance for the proposed model in several of the data sets (all six data sets for graph classification, eight out of twelve data sets for node classification).

[S3] Addressing limitations of GNNs that are due to over-squashing and over-smoothing, the authors address an important open issue in deep graph learning.

缺点

[W1] I did not find the motivation to add additional learnable parameters to Graph Neural Networks that decouple message passing pathways from the topology of the input graph convincing. The authors motivate their work based on over-smoothing and over-squashing in GNNs but in my view the paper lacks an intuitive explanation why the proposed dynamic messing passing scheme should mitigate those problems, especially since the architecture additionally includes regular (local) message passing. To this end, I think that the different, complex local and global components of the dynamic message passing - though formally defined - are not explained well in the paper and some of the design choices appear to be rather arbitrary.

[W2] I similarly could not follow the motivation for the addition of the recurrent layer, which the authors argue is added "to parameterize the displacements of embedded nodes" and to "revise the learned distribution [] of all embedded nodes and reshape the dynamic message passing pathways". A better explanation would be helpful.

[W3] The idea to add pseudonodes for neural message passing has been previously explored, e.g. in the form of a sparse attention mechanism with so-called "global nodes" in the Exphormer architecture (https://arxiv.org/pdf/2303.06147). A better explanation of the contribution of the authors would be helpful, especially since the experiments show that the performance is very close to this architecture in many of the experiments.

[W4] There are important details missing in the description of the experimental setup, namely whether hyperparameter tuning has been performed (i) for the proposed mod4el and (ii) for the baseline models. Moreover, in section B.2.2 the authors claim that "the detailed hyper-parameter settings on all benchmarks are reported in Tab. S6", however this table only includes hyperparameters for the proposed N2 model and not for baseline methods. I also checked the provided code, which does not cover the baseline experiments.

As it is - despite the claims made in the answer to Q4 in the checklist - I do not consider the results showing superior performance compared to the baseline models reproducible based on the information provided in the paper and in the supplementary material.

[W5] One could argue that the proposed approach to define pseudo-nodes and pseudo-edges that participate in the message passing could also be seen as a learnable graph pooling layer for GNNs, where pseudonodes take the role of supernodes. As such, I believe that the paper lacks a more detailed discussion of related works on (trainable) graph pooling operations (there is a single mention of graph pooling in the first paragraph of the related work section).

[W6] Similarly, while the paper briefly mentions hierarchical Graph Neural Networks as related work that combines local and global message passing to learn multi-scale features, no explicit comparison to such approaches is included in the experimental evaluation.

问题

I kindly ask the authors to address the following questions, which emerged during my review of the work:

Please provide more insights on the motivation behind the different ingredients of the dynamic message passing scheme, especially on (i) why the proposed local and global message scheme is supposed to address over-smoothing and over-squashing, and (ii) the motivation of the additional recurrent layer (see comments in [W1] and [W2]). For the first question, I see that there is an experimental analysis but I failed to see how this analysis supports the claims. For the latter question, it would be helpful to include an ablation study that removes the recurrent layer altogether. If I understood the ablation study correctly, currently only the influence of the number of recurrent layers is investigated. Or does the removal of the pseudo-code adaptation (Table 5) correspond to the removal of the recurrent layers?
Please clarify your contribution over other approaches using adaptive message passing architectures that combine local and non-local message exchanges (see comments in [W3], [W5] and [W6]).
Please provide details on the choice of hyperparameters and the use of hyperparameter tuning, both for the baseline methods and the proposed architecture (see detailed comments in [W3])
Please add details on the ablation study results shown in Table 5, especially the number of experimental runs and the standard deviation of results.
Please increase the font of the results in tables 1 - 4. The text is too small to be readable in a printed version. Please also increase the size of the figures, especially figure 5 and 6, which are way too small to be readable in a printed version.

During my review, I found the following typo:

line 259: Complex_i_ty Analysis

局限性

In light of the open questions outlined above, I consider the discussion of limitations in appendix E overly short and not complete.

评论- anonymized scripts with experiment hyperparams

2024-08-07

Dear reviewer,

Take note that, as per the conference instructions, the authors forwarded me a link to an external anonymized page that shows the scripts they used & the associated hyperparameters.

Let me know if you would like to see something specific and I will forward you the information.

kind regards AC

作者回复

2024-08-07

Thank you very much for your valuable suggestions that help us improve our paper. We respectfully refer you to the PDF in the global response for the new figures/tables during the rebuttal due to the limited character number.

W1&Q1(a): Why $N^2$ mitigates over-smoothing/squashing.

Response:

Over-smoothing
- Unlike the uniform connection that shares edge weights for all the graph nodes, our dynamic connection measures specific relations for each graph node. Therefore, although two nodes are connected on the input graphs, they receive different outputs from global message passing (MP) and avoid becoming too similar to each other.
- Learning node displacements allows the addition of layer input, local and global output. Thus, even with local MP, the other two features maintain high-frequency signals.
Over-squashing
- Pseudo nodes produce two-hop message highways, detouring messages from the graph bottlenecks.
- Dynamic connections avoid forming new pseudo-node bottlenecks that hinder global MP.
Why local MP: previous studies show the importance of encoding graph structures[R1]. Therefore, we adopt local MP to implicitly encode graph structures.

To further evaluate our dynamic connection, we apply the uniform connection to $N^2$ , which shows performance degradation in tackling over-smoothing/squashing problems in Fig R1 in the PDF response.

W2&Q1(b): Motivation and ablation on the recurrent layer.

Response:

Motivation for the layer: We propose the recurrent layer, including BOTH pseudo-node adaptation and dynamic MP, to move nodes from current positions towards the optimal positions (L147-148).
- To move towards optimal position: Given the current node positions, a recurrent layer learns the displacements the nodes should take, changing their distribution in the space.
- To base on current positions: The recurrent layer employs pseudo-node adaptation and bases the edge weights for MP on the spatial relations between nodes, to adapt to nodes in various positions. The changes in position also reshape the spatial relations and thus reshape the message pathways.
Motivation for the recurrence: We adopt a single recurrent layer used recurrently to reduce the number of parameters, instead of modeling node displacement by different layers for each step.
Ablation. We have already evaluated layers without recurrence in Sec 5.3.4. Multiple recurrent layers do not share parameters and thus are not recurrent. We further remove the layer completely. With only input and output modules employed, Tab R1(b) in the PDF response shows performance degradation.

W3&Q2(a): Contribution compared to Exphormer.

Response:

Compared to Exphormer, we highlight our $N^2$ as follows
- A new perspective to model MP, where nodes are distributed in a common space; they communicate based on their spatial relations and learn to optimize their positions.
- Fewer parameters by a single shared recurrent layer.
- Better scales up to large graphs (Tab 4).
Technically speaking, our $N^2$ :
- decouples global MP into g(raph)-nodes to p(seudo)-nodes, p to p, and p to g. Instead, Exphormer performs MP between all nodes, which cannot differentiate between messages from pseudo or graph nodes.
- multiplies proximity with messages. Instead, Exphormer employs normalized attention, which can yield equally small weights and attenuate the messages, especially when the optimal assignment involves a large number of graph nodes to the same pseudo node.

We also evaluate the mixed messages or attention from Exphormer on $N^2$ , where both show performance degradation in Tab R1(c) in the PDF response.

W4&Q3: Hyperparameters.

Response: We have performed grid search for the hyperparameters of $N^2$ (Tab S6) and the reproduced models based on validation loss. We emphasize that the superiority of our results is valid and the baseline results can be fully reproduced. Please refer to our AC for the URL of detailed hyperparameter config files.

W5&Q2(b): Design comparison with graph pooling.

Response: First, we clarify that we have provided a discussion on graph pooling in L86-91, not just in L76. Second, the differences between pseudo nodes and graph pooling can be summarized as:

Different objectives. Pseudo nodes optimize communication efficiency while pooling nodes capture the hierarchy in the graph structures, ensuring better structure compression.
Physical connection. Pseudo nodes are physically connected to graph nodes as part of the graphs and participate in the MP. In contrast, pooling nodes are higher-level abstractions of graph nodes and are not physically connected with them.
Intercommunication. Pooling nodes use the coarse adjacency matrix for the MP. Pseudo nodes are free from structure-preserving constraints and learn fully connected pathways.

W6: Comparison with hierarchical GNNs.

Response: We have performed comparisons in Sec 5.1 with some representative hierarchical GNNs from L86-91. Please let us know if you are referring to other types of GNNs.

Q4: Details for Tab 5.

Response: Tab 5 follows the settings from Sec 5.1/5.2. and the hyperparameter setups in Tab S6. Tab R1(a) in the PDF response shows the standard deviation results for three runs.

Q5: Font size, typo.

Response: Thank you for your suggestions. We will fix them in our revision to ensure readability.

Q6 (Limitation): More limitation discussion based on the reviewer's open questions.

Response: We will revise our paper based on your reviews. For Q1(i), we can only provide intuitive/empirical analysis, not rigorous proof due to limited rebuttal time. We will add this to our limitation part and keep working on it.

[R1] Do Transformers Really Perform Bad for Graph Representation? NeurIPS21

2024-08-12

I thank the authors for the very detailed response. While I do not consider all of my criticism to be addressed fully, I do acknowledge that some of my questions on the motivation and the relation tyo graph pooling have been answered, which is why I decided to raise my score.

2024-08-13

Thank you very much for your positive feedback. We will refine our manuscript based on the rebuttal, including clarifying our motivations and contributions, delineating hyperparameter setups (based on our provided config files), and increasing the font size of tables and figures. We sincerely appreciate your suggestions, which have significantly helped us improve our work.

审稿意见

评分: 7置信度: 52024-07-17

The paper considers the problem of flexible message passing with low complexity in GNNs. To tackle this concern, the paper proposes a novel dynamic message-passing mechanism for GNNs via projecting graph nodes and learnable pseudo nodes into a common space with measurable spatial relations. Based on this dynamic message passing mechanism, the paper constructs a GNN model Named N^2 to estimate the effectiveness and efficiency of the proposed message passing mechanism.

优点

A novel dynamic message passing mechanism for GNNs is proposed to provide flexible message passing with low complexity. The dynamic message passing is interesting.
The constructed N^2 GNN model is simple and effective. The complexity requirement is guaranteed.
The proposed N^2 holds superior performance.

缺点

See questions.

问题

The analysis of the dynamic message-passing mechanism. Can we learn an additional pseudo node to provide global guidance (like a cluster center)?
Can authors provide the discussion between dynamic and adaptive message passing?
Can authors show the possibility of incorporating the dynamic message passing mechanism into other networks?

局限性

Yes.

作者回复

2024-08-07

Thank you very much for your support and valuable suggestions that further broaden the potential application of our method to graph interpretations and other neural networks. We will keep studying these problems in the future.

Q1: Can we learn an additional pseudo node to provide global guidance?

Response: If we understand your question correctly, the additional pseudo node refers to the pseudo node of the current pseudo nodes. Please let us know if we make a mistake.

Global guidance in $N^2$ . Given the number of pseudo nodes is relatively small, we directly apply dense pairwise message passing between pseudo nodes in the original $N^2$ , which empowers the pseudo nodes to capture global information. The captured global information can be further fed back to graph nodes and serve as global guidance. The learning of global information can be supported by $N^2$ for graph classification where we employ pseudo-node states as the graph-level outputs.
Additional pseudo nodes. Following your suggestion, we further add an additional pseudo node that aggregates messages from the current pseudo nodes. The results are presented in Tab. R2. We can see that the additional pseudo node cannot further benefit $N^2$ . However, regarding the pseudo nodes as cluster centers is interesting, and may help us to extend $N^2$ to graph interpretations. We will keep studying it in the future.

Table R2	Addition Pseudo Node	Original
amazon-ratings	49.41	50.25
AmazonPhoto	95.36	95.75
PROTEINS	76.48	77.53

Q2: Discussion between dynamic and adaptive message passing.

Response: We kindly refer you to the introduction and related works in our manuscript where we have discussed dynamic message passing with adaptive message passing. To summarize, dynamic message passing achieves global message passing without forming new bottlenecks and requires linear computational complexity. Dynamic message passing also empowers shared parameters thus reducing the number of parameters required.

Q3: Broader application of the dynamic message passing.

Response: Thank you for bringing up this interesting and inspiring question to us. First, in the context of graph representation learning, dynamic message passing currently only employs neighbor smoothing for local message passing to keep our method as simple as possible. However, it can be further combined with other graph models to improve the effectiveness of our local message passing.

Second, our dynamic pseudo-node strategy can also be applied to networks such as Transformer to reduce their computational complexity. This is a very interesting question and we will dig it up in the future.

2024-08-14

Thank you for the detailed response. The authors make a clear explanation of my concerns. I retain my rating score.

2024-08-14

Thank you very much for your positive feedback and support. Your suggestions have helped us improve our work. We are glad that our responses addressed your concerns.

作者回复

2024-08-07

We appreciate all the valuable suggestions and the time the reviewers take on our work. Your reviews help us a lot to improve the manuscript.

Summary of strengths. We sincerely appreciate that you find our method:

novel and interesting (reviewers Qj5n, ZLZY, 4Q2o, and UBFx);
clearly explained and evaluated (reviewers 4Q2o and UBFx);
mitigates over-squashing/smoothing, an important open issue in deep graph learning (reviewers ZLZY, 4Q2o, and UBFx);
shows guaranteed complexity requirement (reviewers Qj5n and 4Q2o);
provides good visualization and analysis of the model (reviewers 4Q2o and UBFx);
exhibits promising experimental performance (reviewers Qj5n, ZLZY, 4Q2o, and UBFx);
widely applicable (reviewers 4Q2o and UBFx).

In the subsequent responses, we aim to address your concerns comprehensively, providing an item-by-item response to each of your comments.

We have provided new empirical results suggested by the reviewers. Due to the limitation of character numbers, Tab R1, Fig R1, and Fig R2 are displayed in the global response PDF.

最终决定Accept (poster)

2024-09-25

The paper presents a new approach to address the limitations of traditional message-passing in Graph Neural Networks (GNNs). The proposed dynamic message-passing introduces learnable pseudo nodes projected into a common space with graph nodes and provides a flexible and low-complexity alternative to existing methods. The introduced GNN model, referred to as N2, demonstrates good performance across a wide range of tasks.

The reviewers recognized the originality of the proposed method and appreciated the thoroughness of the experimental evaluation, which spans multiple benchmarks. Additionally, the authors provided convincing explanations in their rebuttal regarding the motivation behind pseudo-bottlenecks and the relationship of their work to related methods like graph pooling, graph rewiring, and graph transformers. The presentation of these connections in the paper could be improved. Nonetheless, the paper’s contributions are substantial, and the dynamic message-passing mechanism offers a promising direction for future GNN research. The authors are encouraged to revise the paper based on the reviewer feedback to better explain its context and contributions.