PaperHub
4.8
/10
Rejected4 位审稿人
最低3最高6标准差1.1
5
6
3
5
3.8
置信度
正确性2.8
贡献度2.3
表达3.3
ICLR 2025

Rethinking Message Passing for Algorithmic Alignment on Graphs

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05

摘要

关键词
Graph Neural NetworksAlgorithm LearningMessage Passing

评审与讨论

审稿意见
5

The authors propose a novel mechanism where messages are propagated outwards from an origin node and then back to the node. This is shown to be more expressive than 1-WL and have less memory complexity than standard MPNNs. The authors then demonstrate through experiments that the method can generalize better to larger graph sizes on some tasks.

优点

To my knowledge, the mechanism is novel and using an origin node relates to other methods (eg. Subgraph GNNs) whilst the propagation mechanism is more efficient. Seeing this approach will be beneficial to the community and could impact other areas outside algorithmic alignment such as improving long-range interactions.

The paper is well-written and the memory complexity and expressivity improvements are argued with diagrams, text and theorems. It is very clear what the contributions and goals of the method are and many experiments have been run to ablate the model.

缺点

[Enhanced Expressiveness]

The paper says that the enhanced expressiveness comes from the ‘unique message propagation strategy’ and the ‘structured activations of the nodes’. However, Theorem 4.3 suggests that the expressivity improvements comes purely from marking a node. Therefore, the expressivity advantages from the actual propagation scheme is not actually shown. Could you clarify or provide evidence for how the propagation scheme itself contributes to enhanced expressiveness? Additionally, the choice of node to mark can effect expressivity and randomly choosing a node can be suboptimal [1] - this may be an issue when the task doesn't align with a specific choice of node.

[Generalize to large graph sizes]

Firstly, the theoretical argument for improved generalization isn’t convincing. You argue that the new message-passing scheme is more natural ‘as the computation inherently involves the entire graph’. You could increase the neighbourhood size of standard message-passing to the whole graph (eg. Graph Transformers). It is not clear that this would improve generalization (it should perform worse without positional encodings), so what specific aspects of the architecture contribute to improved generalization, beyond just involving the entire graph? Secondly, the experimental section isn’t convincing. For example, your method improves on PrefixSum (this is a path graph which means all nodes interact in your scheme) but [all, random] are worse than RecGNN on the other two tasks. Additionally, your method is only better than old baselines on some of the SALSA-CLRS tasks. This is in contrast to concurrent work [2] that seems to solve these tasks [I don’t expect a comparison to concurrent work but it does suggest that the improvement of your method is not substantial].

[Minor Weaknesses]

  • Paper is limited to size generalization and does not account for other factors such as change in connectivity distributions.
  • Time comparisons to RecGNN would improve the paper, given that it outperforms your approach on some tasks.
  • The benchmarks chosen seem to have a natural ordering of nodes. Your method breaks symmetries through the origin node and so may be less beneficial for tasks without this ordering.

问题

  • GIN will struggle to solve tasks that require long-range interactions due to over-squashing and under-reaching. For example, for path graphs of size 100, you may need a large number of layers so that two nodes interact. If your method improves over GIN on unseen larger graphs - does that actually imply that it is better at generalizing? (Given that GIN won’t be able to solve the task on these larger graphs even when they are in the training set). To me, it is less about generalization and more about efficient long-range interactions.
  • Is your method easily parallelizable? Although message complexity may be less it is not clear to me that this method would have a favorable runtime.
  • How would your method work with directed or disconnected graphs? Wouldn’t this mean that some nodes will never receive information from other nodes. Additionally for Theorem 4.2, as well as being connected, do you not also need the graphs to be undirected? For example, a path graph where the direction of the edges is the same way. If I pick the last node as the origin node then would I be less expressive (I guess depends on the implementation)?

[1] Efficient Subgraph GNNs by Learning Effective Selection Policies. Bevilacqua et al. ICLR 2024.

[2] Discrete Neural Algorithmic Reasoning. Rodionov et al.

评论

Thank you very much for your responses to my questions! I still have a couple of outstanding clarifications based on my initial points.

My initial comment: "The paper says that the enhanced expressiveness comes from the ‘unique message propagation strategy’ and the ‘structured activations of the nodes’. However, Theorem 4.3 suggests that the expressivity improvements comes purely from marking a node"

I am not sure that your answer is related to this, but to my point after where I say "Additionally, the choice of node to mark can effect expressivity". My point was that I don't see how the message propagation strategy itself improves expressivity, just that the node marking does (as suggested by Theorem 4.3.). Could you clarify or provide evidence for how the propagation scheme itself contributes to enhanced expressiveness?

Initial comment: "Your method breaks symmetries through the origin node and so may be less beneficial for tasks without this ordering."

Sorry, I think there may be some confusion about what I mean here. I am not saying that "this aspect of the mechanism should be helpful" for the tasks which you use but that for tasks where there is no ordering, it may be harmful. For instance, you can cause two isomorphic graphs to be distinguished by randomly marking nodes in each. The results for "Flood and Echo random" (which seems more relevant for tasks without an ordering?) are underwhelming and don't seem to improve on RecGNN whilst being more computational complex. This led me to wonder about the suitability of the approach for tasks without a fixed ordering.

Your response: "We believe that the FE Net should improve generalisation due to multiple factors. On one side, there is the involvement of the entire graph, without relying on external inputs such as scaling the number of rounds."

Are you saying that you can use "m" rounds of message-passing in the train set to cover the full graph but if the graphs are larger in the test set then you may not cover the full graph with standard MPNN approaches. So you have some dependence on the number of rounds for generalisation? This also relates to my long-range interactions point where I state that we may have under-reaching between two nodes in a large graph if we don't use enough layers for the small training graphs. Have I understood this point correctly? If so, are there practical scenarios where we need to generalise to graphs with vastly different diameters?

"Are there any direct applications for direct graphs that you are considering?"

Not particularly, I was just considering the generality of the approach to a wide range of different graph types. I can see how this method can be adapted in these cases now - thank you.

评论

I am not sure that your answer is related to this, but to my point after where I say "Additionally, the choice of node to mark can effect expressivity". My point was that I don't see how the message propagation strategy itself improves expressivity, just that the node marking does (as suggested by Theorem 4.3.). Could you clarify or provide evidence for how the propagation scheme itself contributes to enhanced expressiveness?

Unfortunately the first paragraph of the initial answer was lost when reformatting in openreview, our apologies. In Theorem 4.3, we show equivalence to a model which has access to a marked node. Note that in the FE net, the origin is never in any way distinguished or contains explicit distance information that would distinguish it from the rest of the nodes, so there is no marking present. Only through the propagation mechanism do the nodes gain additional information that they can leverage.

Sorry, I think there may be some confusion about what I mean here. I am not saying that "this aspect of the mechanism should be helpful" for the tasks which you use but that for tasks where there is no ordering, it may be harmful. For instance, you can cause two isomorphic graphs to be distinguished by randomly marking nodes in each. The results for "Flood and Echo random" (which seems more relevant for tasks without an ordering?) are underwhelming and don't seem to improve on RecGNN whilst being more computational complex. This led me to wonder about the suitability of the approach for tasks without a fixed ordering.

Whether this is helpful or not probably depends a bit on the task as well. Theoretically, already random features give you universal approximation power - with the downside that this is much harder to train. On the other hand, not breaking any symmetries gives you problems with 1-WL indistinguishability - i.e. if you want to predict missing edges but have nodes with the same embedding that can give you edges that cross the entire graph, even if all present edges are very local. In that sense, our method resides somewhere in between, using “just a bit” of symmetry breaking for more information but not too much to destabilise or completely lose the graph induced priors. On a more practical note, it is likely that the graph has not that many symmetries that need to be broken, especially if it is an attributed graph [1]. Therefore this likely has a limited impact (beyond expressivity benchmarks), and also gives you more reason/good options to switch over choosing a fixed node.

[1]https://arxiv.org/abs/2202.10156

Are you saying that you can use "m" rounds of message-passing in the train set to cover the full graph but if the graphs are larger in the test set then you may not cover the full graph with standard MPNN approaches. So you have some dependence on the number of rounds for generalisation? This also relates to my long-range interactions point where I state that we may have under-reaching between two nodes in a large graph if we don't use enough layers for the small training graphs. Have I understood this point correctly? If so, are there practical scenarios where we need to generalise to graphs with vastly different diameters?

If your task is depending on the whole graph, then using MPNNs on the original graph do require some sort of adjusting the number of layers. Otherwise underreaching will happen at some point. In that case, from the perspective of a single node more computation is done (as the number of rounds is increased). Whereas in the FE Net it can be that the total computation is adjusted according to the graph, but the number of phases remains unchanged. Then, from the perspective of a single node, there is no change in the computation - even though from a graph perspective there are more steps executed. We believe that this behaviour can be very helpful for generalising to larger graphs.

Regarding generalisation, we are not aware of many settings where this is currently studied besides NAR. Note, that proper size generalisation is very hard, and probably the main reason to study and progress the field of NAR as it focuses on how to learn a generalizable behaviour. There will always be a limit on what sizes you can train a system. Ideally it would suffice to train on these (maybe small/short molecular graphs?) and then still be able to apply it to much larger systems where the same underlying rules should hold. However, this is rather a long term vision than the current state of the field.

评论

Thank you again for the clear answers to my questions.

The approach outlined in the paper does seem promising in terms of both expressivity and improving the generalisation abilities of the network. However, I still remain concerned about the performance of "Flood and Echo random". This highlights that the method may not perform well on tasks where an origin node is not easily defined. Additionally, whilst the possible improved generalisation of the propagation scheme is argued in a "intuitive" manner, in the paper and in the rebuttal, it is not theoretically grounded in the manuscript. Given that the performance is much improved with the same propagation scheme but with a different chosen node (fixed vs random), it is not clear (beyond some intuition) why/how this propagation scheme sufficiently improves generalisation.

It is for these concerns, that I will be maintaining my original score.

评论

We thank the reviewer for his insights and will go into the raised points and questions in the following.

The paper says that the enhanced expressiveness comes from the ‘unique message propagation strategy’ and the ‘structured activations of the nodes’. However, Theorem 4.3 suggests that the expressivity improvements comes purely from marking a node...

In general it is true that a randomised selection policy might not be optimal. One could think of other heuristics (sample upon some graph statistics) or even learn a policy as in the paper you suggested. In our studied context, we found that many tasks yield a natural starting point or tiebreak for an appropriate selection - this might be related to the fact that for many (attributed) graphs the 1-WL expressiveness already suffices. But it would be an interesting future direction to explore if learning the start can improve performance in general.

Firstly, the theoretical argument for improved generalization isn’t convincing. You argue that the new message-passing scheme is more natural ‘as the computation inherently involves the entire graph’...

We believe that the FE Net should improve generalisation due to multiple factors. On one side, there is the involvement of the entire graph, without relying on external inputs such as scaling the number of rounds. Note that essentially RecGNN does exactly this, but seems to perform worse in comparison. Moreover, FE net only updates the nodes sparsely with few messages throughout the computation - this can allow it to find and redistribute information using fewer updates to the graph, always using O(m) messages to do so.

The mentioned variants of the FE Net don’t perform as good as fixed - this is likely due to the fact that training them is more difficult. We have conducted an ablation to assess this in Appendix G.3 where we find that different models can differ, but within a model there is not as much randomness. However, for many tasks you can find an appropriate choice as a starting node, so this is why we put more emphasis on the fixed variant, and put less emphasis on stabilising the other two modes.

Regarding the reference to [2]: As far as we understood, in this paper they also consider the SALSA-CLRS tasks, however the actual data that is used is not the same. Recall that the FE-Net does not use any hints throughout training. DNAR on the other side cannot do that (outlined in section 6) as they heavily rely on hints during training. Moreover, if we understood the description and code correctly they actually use custom hints closer to their formulation instead of the hints that are included with CLRS/SALSA-CLRS. Whereas we propose an algorithmic alignment of the general message passing architecture and train without any tailored supervision. Therefore, I don’t think the results are at all comparable.

Time comparisons to RecGNN would improve the paper, given that it outperforms your approach on some tasks.

We do provide some timing results in the Appendix. However, keep in mind that the fixed variant of FE Net consistently outperforms RecGNN.

The benchmarks chosen seem to have a natural ordering of nodes. Your method breaks symmetries through the origin node and so may be less beneficial for tasks without this ordering.

The method does introduce a component of symmetry breaking as nodes are able to figure out the distance to the origin node. As the nodes already have an order, there should be no more symmetries that can be broken. Therefore, we do not see how this aspect of the mechanism should be helpful. But it is likely that we misunderstood the question and we would be glad if the reviewer could clarify their intentions if this is the case.

GIN will struggle to solve tasks that require long-range interactions due to over-squashing and under-reaching. For example, for path graphs of size 100, you may need a large number of layers so that two nodes interact. If your method improves over GIN on unseen larger graphs - does that actually imply that it is better at generalizing? (Given that GIN won’t be able to solve the task on these larger graphs even when they are in the training set). To me, it is less about generalization and more about efficient long-range interactions.

Long-range interactions certainly are an important aspect, as we involve the entire graph instead of executing a constant number of graph convolutions. However, even if the computation is scaled appropriately for GIN (i.e. RecGNN or on SALSA) the FE Net seems to perform better on these tasks. We would argue that it is more about generalisation as you apply it on larger unseen graphs rather than efficient long-range, which you could study without size generalisation during testing. Moreover, learning general long-range interactions on large graphs might be its own challenge, whereas you could learn a principle on small and generalise the same principle to larger instances and incorporate some long range effects.

评论

Is your method easily parallelizable? Although message complexity may be less it is not clear to me that this method would have a favourable runtime.

There is a tradeoff between involving the entire graph and sequential operations. In order to facilitate information exchange across the entire graph, sequential operations are required for our approach. Note that this impacts any approach - even MPNNs require more sequential multiple rounds to achieve similar reach. As a benefit, our wave-like pattern activates only relevant nodes at each step, potentially reducing memory usage compared to simultaneous updates of all nodes. Most importantly, this structured approach with fewer messages enables performance improvements in tasks, as demonstrated by our results on PrefixSum, which aren't achievable with standard (parallel) message passing.

How would your method work with directed or disconnected graphs? Wouldn’t this mean that some nodes will never receive information from other nodes. Additionally for Theorem 4.2, as well as being connected, do you not also need the graphs to be undirected? For example, a path graph where the direction of the edges is the same way. If I pick the last node as the origin node then would I be less expressive (I guess depends on the implementation)?

Because the graphs are usually undirected and connected, we mainly focused on this scenario. The principle could be generalised for disconnected graphs by just choosing a starting node in each component, for the directed case it depends what you would like to achieve. You could treat each edge as undirected and just encode the direction as a feature, or really enforce that the message only flows one way. In that case it would not be favourable to start in a sink, but instead you could start in one (or all) sources instead. Are there any direct applications for direct graphs that you are considering?

审稿意见
6

This work proposes a framework, Flood and Echo network (FENet), which breaks the synchronous limit of MPNNs. Theoretically they prove that FENet has higher expressivity than 1WL. Empirically they show great performance of FENet on three algorithmic alignment tasks.

优点

  • Overall, the idea is pretty interesting and highly novel.
  • The writing is clear and good, also the illustrations make sense for understanding the framework.
  • The theory is sound.
  • The performance of FENet is validated in experiment part.

缺点

  • One concern of FENet is over-squashing problem. The sensitivity of a node to the original node is unclear, the message from the origin node pass out to the furthest nodes, then gathered back, the over-squashing issue on large graph may not be solved, even aggravated.
  • Though the theoretical runtime complexity is the same as MPNN, the messages are executed in sequence, while in MPNN it is in parallel. Modern GPUs can handle large batches of data executed in parallel, but this sequential property of FENet might make it super slow on GPU, especially with a lot of phases.
  • One minor point, please make the tables more readable, for example, highlight the best candidate in the table, so it is clearer to readers.
  • Some design details are not quite clear to me. See questions below.

问题

  • Why do you pick FCrossConv to exchange messages between nodes at the same distance?
  • Why the FConv and FCrossConv are reversed in the echo phase?
  • For fixed mode, how exactly do you design which node to be the origin node?
  • The authors claim FENet message passing is more efficient. If I understand correctly, in a whole phase, the number of node updates as well as messages conveyed are not reduced. It's just the origin node is able to reach further neighbors in a phase.
  • Is there a reason why you evaluate the framework on algorithmic alignment? Certainly it performs well, but it can also be a general framework for other tasks, say molecular prediction etc.
评论

We thank the reviewer for his insights and will go into the raised points and questions in the following.

One concern of FENet is over-squashing problem...

Our work does not explicitly aim to address the over-squashing problem, as this is primarily a property of the underlying graph topology rather than the message-passing architecture itself. We also include a statement about over-squasing in the Appendix. Recent work (Giovanni et al., 2023) shows that graph topology is the dominant factor in over-squashing, with architecture choices having a marginal effect compared to bottlenecks in the graph structure. The FE Net's wave-like propagation pattern offers a different approach to information flow, but we acknowledge that fundamental topological bottlenecks would still affect performance. The architecture could potentially be combined with existing approaches that modify graph structure to address over-squashing, but this was beyond the scope of our current investigation which focused on algorithmic alignment of the message-passing mechanism itself.

Though the theoretical runtime complexity is the same as MPNN, the messages are executed in sequence, while in MPNN it is in parallel. Modern GPUs can handle large batches of data executed in parallel, but this sequential property of FENet might make it super slow on GPU, especially with a lot of phases.

There is a tradeoff between involving the entire graph and sequential operations. In order to facilitate information exchange across the entire graph, sequential operations are required for our approach. Note that this impacts any approach - even MPNNs require more sequential multiple rounds to achieve similar reach. As a benefit, our wave-like pattern activates only relevant nodes at each step, potentially reducing memory usage compared to simultaneous updates of all nodes. Most importantly, this structured approach with fewer messages enables performance improvements in tasks, as demonstrated by our results on PrefixSum, which aren't achievable with standard (parallel) message passing.

One minor point, please make the tables more readable... We will adjust this, thank you for the input.

Why do you pick FCrossConv to exchange messages between nodes at the same distance?

We include FCrossConv to include the entire graph structure and all edges. While the model could function without these cross-connections (equivalent to FCrossConv having no effect), including them provides flexibility. This allows the model to learn whether and how to utilise these additional connections based on the task requirements.

Why the FConv and FCrossConv are reversed in the echo phase?

We reverse the order in the echo phase to maintain consistency with the overall information flow pattern. Since the echo phase represents information flowing back toward the origin, we thought it to be more natural to first process nodes at the same distance level (ECrossConv) before passing messages inward (EConv), mirroring but reversing the outward flow pattern of the flooding phase.

For fixed mode, how exactly do you design which node to be the origin node?

For the fixed mode, the origin node is determined by the task. In the SALSA-CLRS experiments, we use the starting node provided by the task (e.g., source node in BFS or Dijkstra). When no special node is specified by the task, we default to using the node with id 0 as the origin.

The authors claim FENet message passing is more efficient. If I understand correctly, in a whole phase, the number of node updates as well as messages conveyed are not reduced. It's just the origin node is able to reach further neighbors in a phase.

At each individual computation step you have less updates/messages with the FE Net. If you consider an entire phase the number of messages is the same as one round in a standard MPNN. However information has propagated throughout the entire graph - and not just the origin node is influenced by this, but all other nodes (through their update) as well.

Is there a reason why you evaluate the framework on algorithmic alignment? Certainly it performs well, but it can also be a general framework for other tasks, say molecular prediction etc.

As of now we have primarily focused on the setting of algorithm learning on graphs. We thought the generalisation to larger graphs while performing computation throughout the entire graph could be best illustrated in that setting. However, it is true that one could use it in other settings as well instead of regular MPNNs. It is possible that for molecular predictions that have long-range interactions, reducing message complexity through alignment could be of interest. However, how such interactions could be learned directly on large systems (whereas in our setting they are always smaller during training) is still a challenging open research question, so it might not yet be directly applicable.

评论

Thank you for your great effort on the work and the rebuttal. I agree the EF Net is pretty interesting and fits some of the algorithmic reasoning setting. The authors' responses answer my questions. However, through reading other reviewers' comments, there still exist some limitations of the work. Therefore I would like to keep my scores.

审稿意见
3

This paper introduces a message-passing graph neural network called Flood and Echo Network (EF Net). The message-passing structure mimics a breadth-first traversal (BFS) starting from a source node. The authors demonstrate that EF Net is more expressive than traditional MPNNs, which are limited by the 1-WL test, and is also provably more efficient in terms of message complexity. Applied to the SALSA-CLRS benchmark, EF Net shows improvements in generalization and correct execution, achieving higher accuracy on this benchmark.

优点

  1. This paper introduces a message-passing neural network that operates similarly to a breadth-first traversal, requiring O(m) messages, where m is the number of edges in the graph. For many tasks evaluated, FE Net requires fewer messages compared to several other APNN models.
  2. The authors demonstrate that FE Net can be more expressive than standard APNN models, particularly in the context of the Weisfeiler-Lehman (WL) test performance.
  3. The paper evaluates FE Net on a subset of tasks from the SALSA-CLRS benchmark using randomly generated ER graphs. According to this benchmark, FE Net shows improved generalization compared to GIN and other APNN models.

缺点

  1. I recommend revising the introduction to clarify that the paper specifically focuses on "neural algorithmic reasoning on graphs." While the title mentions this, the introduction could be clearer, as it currently suggests a focus on general graph learning. Additionally, there is no evidence provided that FE Net outperforms MPNNs on typical supervised or semi-supervised tasks, such as node classification and link prediction.
  2. The exact aggregation and update operations used in FE Net are not clearly discussed. For instance, GIN has been shown to be more expressive with sum pooling as the message aggregator. To establish the expressiveness of FE Net, it would be beneficial to specify these operators explicitly. Furthermore, Theorem 4.2 could be strengthened by showing that FE Net does not fail in cases where the 1-WL test succeeds, in addition to distinguishing graphs where the 1-WL test fails.
  3. Some assumptions in the paper require further clarification. For instance, it seems to assume that a typical MPNN exchanges O(m) messages per layer, where m is the number of edges. However, the actual number of messages depends on the computational graph. If there is a batch of k nodes with an average degree of d, a 2-layer GCN will involve O(kd^2) messages. Additionally, models like GraphSAGE use sampling to reduce message volume, so O(m) messages per layer may not apply to most APNNs.
  4. FE Net operates similarly to a BFS traversal, where nodes at the same distance from the source are processed together. Given that the algorithms studied also use BFS-like traversal, it is unsurprising that FE Net outperforms some other GNN models. It is unclear, however, how FE Net would perform with algorithms that don’t resemble BFS, as FE Net (and other GNNs) struggled with generalizing to tasks like DFS and MST.
  5. Most of the experiments in the paper are conducted with Erdős-Rényi (ER) graphs, which may not fully represent practical graph structures. Showing results on scale-free graphs or real-world graphs could provide more comprehensive insights into FE Net’s performance.
  6. Since FE Net begins from a source node and reaches all nodes in a connected component, it is unclear how it would handle graphs with multiple connected components. Providing a discussion on this aspect would improve understanding of FE Net's applicability to such cases.

问题

  1. Do the authors have any insights on how EF Net performs on scale-free graphs?
  2. Can EF Net be applied to semi-supervised learning tasks, such as node classification?
  3. Is EF Net suitable for non-BFS style algorithms, like clustering or triangle counting?
  4. What specific aggregation and update operations are employed in EF Net?
  5. How is it ensured that EF Net does not fail in cases where the 1-WL test succeeds?
  6. In the experiments with GIN, how many layers were used?
  7. Does EF Net's performance decline when applied to high-diameter graphs, such as road networks?
  8. How does EF Net handle graphs with multiple connected components?
评论

We thank the reviewer for his insights and will go into the raised points and questions in the following.

I recommend revising the introduction ...

We will revise our introduction and put more emphasis on neural algorithmic reasoning, our intention was to outline the proposed architecture first and then dive into nar on graphs. While our novel take on the message-passing mechanism that could be relevant for general graph learning, this work specifically investigates algorithmic reasoning and size generalisation. As such, application to semi-supervised tasks remains an interesting direction for future work (possibly in the context of NAR), but is not the current focus of this paper

The exact aggregation and update operations used in FE Net are not clearly discussed. For instance, GIN has been shown to be more expressive with sum pooling as the message aggregator. To establish the expressiveness of FE Net, it would be beneficial to specify these operators explicitly. Furthermore, Theorem 4.2 could be strengthened by showing that FE Net does not fail in cases where the 1-WL test succeeds, in addition to distinguishing graphs where the 1-WL test fails.

We provide details in Appendix F or in the attached code base, the FE net intends to do a GIN like update step including a GRU. During the experiments the same aggregation was used consistently among the baselines. Regarding expressivity, we believe that Theorem 4.1 should already provide the suggested strengthening. Namely, that FE Net can simulate any MPNN, thus preserving 1-WL expressivity. As such, if the WL test distinguishes two graphs, so should our method. In the case of testing the same graph twice, the origin node needs to be chosen the same in order for the test to succeed.

Some assumptions in the paper require further clarification. For instance, it seems to assume that a typical MPNN exchanges O(m) messages per layer, where m is the number of edges. However, the actual number of messages depends on the computational graph. If there is a batch of k nodes with an average degree of d, a 2-layer GCN will involve O(kd^2) messages. Additionally, models like GraphSAGE use sampling to reduce message volume, so O(m) messages per layer may not apply to most APNNs.

The reviewer raises an important point about message complexity. Our analysis assumes the standard MPNN setting where messages are passed along in the entire graph, including all edges, leading to O(m) complexity per round. While sampling techniques like in GraphSAGE can reduce this by redefining the graph in question to be limited to a specific subgraph, our comparison focuses on architectures operating on the full graph topology to ensure fair evaluation of algorithmic capabilities. Note, that in principle you could also use the FE Net in combination with these subsampling techniques, then there would be O(m’) messages where m’ = k*d^2 the number of edges in the subgraph.

FE Net operates similarly to a BFS traversal, where nodes at the same distance from the source are processed together. Given that the algorithms studied also use BFS-like traversal, it is unsurprising that FE Net outperforms some other GNN models. It is unclear, however, how FE Net would perform with algorithms that don’t resemble BFS, as FE Net (and other GNNs) struggled with generalizing to tasks like DFS and MST.

Extrapolation is a very hard objective and a fundamental challenge in neural algorithmic reasoning. Our results demonstrate that architectural alignment is a viable option to significantly improve generalisation. Another promising indicator of this are the results on MIS (maintaining >80% accuracy at 160 nodes) and the consistent improvement in performance when increasing the number of phases (Figure 7). These findings suggest that our approach of aligning neural architectures with algorithmic principles is a promising direction for improving generalisation, even for tasks not directly aligned.

Most of the experiments in the paper are conducted with Erdős-Rényi (ER) graphs, which may not fully represent practical graph structures. Showing results on scale-free graphs or real-world graphs could provide more comprehensive insights into FE Net’s performance.

While our evaluation primarily uses ER graphs following the SALSA-CLRS and CLRS benchmark setup, the full results in Appendix I also include performance across WS and Delaunay graphs. Besides, Distance and PrefixSum is evaluated on Trees and Line graphs, where the latter has very high diameter. As such, we think we already provide a reasonably general and challenging family of graphs. Is there a particular reason to include scale-free networks in the ablations, which might even apply to NAR in general?

评论

Since FE Net begins from a source node and reaches all nodes in a connected component, it is unclear how it would handle graphs with multiple connected components. Providing a discussion on this aspect would improve understanding of FE Net's applicability to such cases.

The current implementation focuses on connected graphs, but could be naturally extended by selecting an origin for each connected component. Since no information can be exchanged between disconnected components (which also applies to standard MPNNs), this extension is straightforward but was not our primary focus.

Do the authors have any insights on how EF Net performs on scale-free graphs?

We have not specifically evaluated FE Net on scale-free graphs. The current experiments focus on ER, WS and Delaunay graphs following the SALSA-CLRS benchmark setup and additionally includes Trees and Linegraphs.

Can EF Net be applied to semi-supervised learning tasks, such as node classification?

In principle, FE Net can be used like any GNN for semi-supervised tasks. However, our work focuses on size generalisation for algorithmic reasoning where architectural alignment is likely to tap into unused potential.

Is EF Net suitable for non-BFS style algorithms, like clustering or triangle counting?

For tasks like clustering and triangle counting that primarily rely on local neighborhood information, the global information exchange of FE Net may not provide significant advantages over standard MPNNs.

What specific aggregation and update operations are employed in EF Net?

We use GRU-based updates and GIN-style aggregations, with full implementation details provided in Appendix F and our code repository.

In the experiments with GIN, how many layers were used?

For the algorithmic tasks we use 5 layers for GIN as specified in the text. This setup was chosen as more layers led to instability where GIN did not learn anything useful.

Does EF Net's performance decline when applied to high-diameter graphs, such as road networks?

Performance naturally depends on whether information from distant nodes is relevant for the task. Our results suggest FE Net maintains better performance if a correct solution is learned than standard GNNs when such long-range information is important.

How does EF Net handle graphs with multiple connected components?

The current implementation focuses on connected graphs. Since information cannot be exchanged between disconnected components (also true for MPNNs), extending to multiple components is straightforward by selecting origins for each component.

评论

Thank you for your detailed rebuttal. I have reviewed it carefully and have decided to maintain my original score.

The paper has several limitations that require further experimental validation. For instance, the use of ER and other small random graphs in the experiments is problematic, as it does not adequately represent a wide range of graph types. Additionally, the proposed method is limited to undirected and connected graphs, which significantly restricts its applicability.

Thank you again for your thoughtful responses.

审稿意见
5

The paper proposes the flood and echo algorithm, that simply works by selecting an origin node, sending messages outward from this node. When the messages reach the end of the graph when centered at the origin node, they reflect back and trace the same path back to the origin node. It is theoretically proven that for the distance, path-finding and prefix sum tasks, the proposed model is a perfect fit.

优点

  • The proposed approach is theoretically and empirically validated.
  • The paper is well written.
  • There are other algorithmic tasks in the experiments than the ones tailored to the proposed approach.

缺点

  • The proposed algorithm does not generalize to other algorithmic tasks as well as the tasks it is designed for.

问题

  • The impact of the algorithm is not clear, which real world use-cases would the proposed approach be the most beneficial for?
  • What is the dependency of the proposed method to the chosen origin node?
评论

We thank the reviewer for his insights and will go into the raised points and questions in the following.

The proposed algorithm does not generalize to other algorithmic tasks as well as the tasks it is designed for.

The architecture seems to perform better where the algorithmic tasks are aligned with the flooding/echo patterns. However, we would like to point out that extrapolation is a very hard objective and a fundamental challenge in neural algorithmic reasoning. Our results demonstrate that architectural alignment is a viable option to significantly improve generalisation. Another promising indicator of this are the results on MIS (maintaining >80% accuracy at 160 nodes) and the consistent improvement in performance when increasing the number of phases (Figure 7). These findings suggest that our approach of aligning neural architectures with algorithmic principles is a promising direction for improving generalisation, even for tasks not directly matching the flooding/echo pattern.

The impact of the algorithm is not clear, which real world use-cases would the proposed approach be the most beneficial for?

While we focus on leveraging algorithmic alignment on algorithmic tasks due to our interest in size generalisation, the concepts of sparse but parallel computation patterns involving the entire graph efficiently could be of independent interest for other applications across graph learning. Especially in dealing with long-range interaction systems, reducing message complexity through alignment could be of interest. However, how such interactions could be learned directly on large systems (whereas in our setting they are always smaller during training) is still a challenging open research question.

What is the dependency of the proposed method to the chosen origin node?

We have conducted an experiment in Section G.3, Table 8 to assess the effect of the origin node. Our analysis shows that while random starts lead to some training instability, as not all models achieve perfect accuracy. However, individual trained models show consistent performance with narrow deviations across 50 runs on 1000 graphs.

评论

Thank you for the rebuttal, I acknowledge that I have read it and would like to keep my score.

AC 元评审

(a) Scientific Claims and Findings: The paper introduces the Flood and Echo Net, a novel architecture that aligns neural computation with distributed algorithm principles. Unlike traditional Graph Neural Networks (GNNs) that rely on simultaneous message-passing among neighboring nodes, this method employs sparse activations upon message receipt, creating a wave-like activation pattern that traverses the graph. This approach enhances expressiveness beyond the limitations of the 1-WL test and improves message-passing efficiency. The architecture's ability to generalize across graphs of varying sizes makes it suitable for algorithmic learning tasks. Empirical evaluations on synthetic tasks demonstrate that the Flood and Echo Net's algorithmic alignment improves generalization.

(b) Strengths:

  • Innovative Approach: The introduction of sparse, wave-like activations in GNNs offers a novel method for information propagation, potentially overcoming limitations of traditional message-passing mechanisms. Multiple reviewers appreciate the high novelty.
  • Enhanced Expressiveness: By aligning neural computation with distributed algorithm principles, the Flood and Echo Net demonstrates increased expressiveness, surpassing the constraints of the 1-WL test.
  • Scalability: The architecture's ability to generalize across graphs of varying sizes positions it as a practical solution for algorithmic learning tasks involving diverse graph structures.
  • Validation: The method's effectiveness is supported by empirical results on synthetic tasks and the SALSA-CLRS benchmark.
  • Presentation: The paper is well written and understandable.

(c) Weaknesses:

  • Computational Efficiency: The paper does not provide a detailed analysis of the runtime of the Flood and Echo Net, particularly concerning large-scale graphs.
  • Theoretical Justification: While the paper introduces a novel architecture, it lacks a comprehensive theoretical analysis explaining why the sparse, wave-like activation pattern leads to improved generalization performance (beyond just using the whole graph according to Reviewer ), which would strengthen the understanding of the method's advantages.
  • Comparative Analysis: The paper would benefit from a more detailed comparison with existing GNN architectures to clearly highlight the advantages and potential limitations of the proposed approach.
  • Real world impact/application relevance: The real world application use cases are unclear. The authors have proposed long-range benchmarks but have not conducted related experiments. The experiments are limited to unrealistic graph structures (mostly ER graphs).
  • (Minor) Limitations: The approach is limited to undirected and connected graphs.
  • Parallelisation potential and potential of memory reduction should be discussed and quantified/analyzed in the paper (and not only the rebuttal).
  • Raised concern (Reviewer tssR): Performance of method unclear on tasks where an origin node is not easily defined.

(d) Reasons for Rejection: After careful consideration, the decision to reject the paper is based on the following reasons:

  1. Insufficient Computational Analysis: The lack of a detailed examination of the Flood and Echo Net raises concerns about its practical applicability to large-scale graphs and the generalization performance on real world tasks.
  2. Need for Comparative Evaluation: A more thorough comparison with existing state-of-the-art GNN models is necessary to substantiate the claimed improvements and to position the proposed method within the current research landscape. In particular, a comparison with other approaches that are more expressive than 1-WL would support the story of the paper. Addressing these concerns would enhance the paper's contribution to the field and its potential for acceptance.

审稿人讨论附加意见

Reviewers largely have appreciated the novelty of the proposal. Yet, concerns (as detailed under weaknesses) remain. In particular, a more thorough experimental analysis that highlights the relevance of the Flood and Echo Net to solve real world tasks was requested.

最终决定

Reject