PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
4
4
3
3
ICML 2025

Balancing Efficiency and Expressiveness: Subgraph GNNs with Walk-Based Centrality

OpenReviewPDF
提交: 2025-01-21更新: 2025-07-24
TL;DR

We introduce a framework which drastically reduces the computational cost of Subgraph GNNs by leveraging walk-based centralities, both as an effective subgraph sampling strategy and as a form of powerful Structural Encoding.

摘要

关键词
Graph Neural NetworksSubgraphsExpressivityGraph Centrality

评审与讨论

审稿意见
4

This paper proposes a technique to reduce the computational complexity of subgraph-GNNs, which have higher expressivity than 1-WL but are often too computationally expensive to be used in practice. In particular, the method, called HyMN, relies on sampling subgraphs based on some node centralities, which can also be used as a structural encoding, akin to the RWPE. Experiments showcase the applicability of the prosed method, showing that the proposed method obtains a predictive performance that is comparable or better compared to full-bag subgraph GNNS, while having a significantly lower time complexity.

给作者的问题

Most real word datasets, especially molecular ones, don't really require expressive power higher than 1-WL, see for example Table 2 in [WL meet VC, ICML'23] or Table 2 in [Graph Neural Networks Can (Often) Count Substructures, ICLR'25]. Do you have an intuitive explanation on why Subgraph GNNs nonetheless show higher predictive performance compared to simpler baselines? It would be interesting to discuss this in the paper.

.

Update

I believe that during the rebuttal the paper has improved (e.g., new experiments, intuitive figures, and a stronger motivation). Therefore, I raised my score to 4 to suggest acceptance.

论据与证据

The (few) theoretical results on expressivity are correct. The experimental results are backed up by publicly available code.

方法与评估标准

The proposed method to select the marked subgraphs for Subgraph GNNs is reasonable and effective, and motivated by some experimental results in section 3.2. However, there is a discrepancy between these observations and the proposed method itself (see W1).

The method is benchmarked on relatively few datasets, all of which are molecular ones, except MalNet-Tiny. For a mostly experimental paper like this one, more datasets would be preferred. The results substantiate the validity of the method.

理论论述

The (few) theoretical results on expressivity (Thm 4.1 and 4.2) seem to be correct.

实验设计与分析

Apart from the lack of a diverse selection of real-world datasets, the experimental methodology is sound. The authors provide results on both synthetic and real-world datasets, they provide confidence intervals, and code for reproducibility.

补充材料

I checked the proofs and the provided code (but I did not run it).

与现有文献的关系

The proposed method seems to produce accuracies on par with subgraph GNNs that consider all subgraphs, at a fraction of the cost. This is an important advance in the field, which was hindered by computational costs. Moreover, the proposed method is simple to implement, which would make it easier to be used by practitioners and to be improved upon by further research.

遗漏的重要参考文献

None.

其他优缺点

Strengths:

  1. the proposed method seems to produce accuracies on par with subgraph GNNs that consider all subgraphs, at a fraction of the cost. This is an important advance in the field, which was hindered by computational costs. Moreover, the proposed method is simple to implement, which would make it easier to be used by practitioners and to be improved upon by further research.
  2. the paper is well-written and very well motivated. The observations on which node makings impact the most the GNN output and the subsequent exploitation of these observations make for a compelling story that makes it easy to follow the paper. 

Weaknesses:

  1. There seems to be a large discrepancy between the motivating observations and experiments of Section 3.2, and the proposed method itself. Indeed, in the section it is observed that, given a graph G, marking the node with highest walk-based centralities and obtaining a single graph G’ yields the highest change in the GNNs output |f(G) - f(G’)|. This however does not align with subgraph GNNs, where the graph is transformed into a bag of transformed (sub)graphs. I understand that one might want to diversify the embeddings of each of the subgraphs in the bag, and this could be the motivation for your choice, but this should be discussed explicitly in the manuscript. 
  2. The subgraph centrality structural encoding proposed in the paper is simply a rescaling of the RWPE, and can hardly be considered a contribution. This maybe should be stated more clearly.
  3. The baseline of GIN + CSE should be added to the experiments on ogb data.
  4. The method is benchmarked on relatively few datasets, all of which are molecular ones, except MalNet-Tiny. For a mostly experimental paper like this one, more datasets would be preferred.

The paper, although somewhat incremental, seems to be solid and impactful, and it can be considered for acceptance.

其他意见或建议

  1. It would be useful to have a figure to illustrate node-marking-based Subgraph GNNs, with and without a subsampling approach like HyMN.

Remark: I’d be open to raising my score (which is anyway mostly positive) if all my concern are properly addressed.

作者回复

The reviewer found the paper “solid and impactful”, the method as “an important advance in the field”, “reasonable”, “effective” and “motivated” by experimental results. They manuscript is also found “well-written".

We address comments below.


“There seems to be a large discrepancy between the motivating observations and experiments of Section 3.2, and the proposed method itself. [...] this should be discussed explicitly in the manuscript.”

In Section 3.2 we look at the optimal initial subgraph selection to (i) alter the base MPNN’s graph representation sufficiently and (ii) in a way that correlates with predictive information (e.g. subgraph counts). When only one subgraph is chosen (T=1), this approach exactly aligns with the proposed method, and we disagree there is a discrepancy in this case. Also, T=1 is, by far, the most compelling setting from a complexity perspective and our model attains very strong performance for this.

Overall, we optimized for a sampling strategy that was efficient and non-learnable, and accordingly considered the case where every subgraph selection is made w.r.t. the original graph representation. We acknowledge that further theoretical insights could be gained on partially filled bags; we mention this in line 435 and will emphasize this more.


“The subgraph centrality structural encoding proposed in the paper is simply a rescaling of the RWPE, [...] This maybe should be stated more clearly.”

We acknowledge the similarity between RWSEs and CSEs. In fact, we devote Section F.4 to discuss their relation.

Proposing CSEs as a new form of SEs is not the (main) contribution of our manuscript. CSEs emerge from the centrality computations required for subgraph sampling: we propose to retain them and show the benefit of this in terms of discriminative power. We will clarify this point further.


“The baseline of GIN + CSE should be added to the experiments on ogb data.”

We agree. We conducted these experiments during the rebuttal period and report the results below:

molhivmolbacemoltox
GIN+CSEs77.44 ± 1.8776.58 ± 2.2975.81 ± 0.39
HyMN (GIN, T=2)81.01 ± 1.1781.16 ± 1.2177.30 ± 0.35

These results support the effectiveness of our centrality-based strategy. We will include them in Table 2.


“The method is benchmarked on relatively few datasets, all of which are molecular ones, except MalNet-Tiny. For a mostly experimental paper like this one, more datasets would be preferred.”

We note that, although effectively molecules, peptides differ substantially from the small drug-like molecules in ogbg datasets—in size, structure, and biological role. Actually, we argue this makes for three different data domains.

That said, we agree broader benchmarking would strengthen our work. To this end, during the rebuttal period we additionally evaluated our method on the Reddit-binary (RDT-B) dataset—a fourth domain (social networks). RDT-B is significantly larger than ogbg graphs and is too large for full-bag Subgraph GNNs. Results, obtained under the setup from [1], are below:

MethodRDT-B Accuracy
GIN92.4 ± 2.5
FULLOOM
RANDOM (T=20)92.6 ± 1.5
RANDOM (T=2)92.4 ± 1.0
POLICY-LEARN (T=2)93.0 ± 0.9
HyMN (T=2)93.2 ± 2.2

HyMN matches the performance of the learnable POLICY-LEARN, coherently with results obtained on other datasets. We will add the above in the next paper revision.

[1] https://arxiv.org/abs/1810.00826


“It would be useful to have a figure to illustrate node-marking-based Subgraph GNNs, with and without a subsampling approach like HyMN.”

Thanks for the suggestion – we have accordingly prepared a figure currently hosted at this anonymous link. We plan to add it in the next paper revision (Section B).


“Most real word [sic.] datasets, especially molecular ones, don't really require expressive power higher than 1-WL [...] Do you have an intuitive explanation on why Subgraph GNNs nonetheless show higher predictive performance [...] ?”

This is an open question in graph learning. Higher separation power may not be needed, or even harmful if obtained by spurious features. Yet, expressive models like Subgraph GNNs often generalize better. A reasonable hypothesis is that these form a representation space with “more convenient” graph separation, one that makes solving the downstream task more amenable, e.g. by “aligning” with predictive motif counts. Some of these intuitions are echoed in our approach.


We hope our responses clarify the raised concerns. We would kindly ask the reviewer to reconsider their scores in view of them.

审稿人评论

Thank you for your reply. I think the addition of the additional experiments further strengthens the paper.

I am however confused about the figure you linked, as the bag of subgraphs does not contain the original graph with marked nodes, but rather subgraphs for the original graph. I think the figure should be more aligned with your methods. In fact, thinking about node markings made me realize that your method, especially with T=1, is closely linked with random node features/node individualizations literature, where nodes are marked in order to increase expressivity. You might want to consider discussing this relationship in the paper.

Finally, I don't understand how the method setting T=1 aligns with the intuition give in Section 3.2. For example, you say that "subgraphs associated with low centrality nodes can be interpreted as redundant w.r.t. the original input graph". If you choose T=1 there is no redundancy whatsoever. To me it's also very unclear why one would want the representation of the marked graph (with T=1) to be as different as possible to the original graph, as this could lead to unpredictable behavior (although clearly this is not the case in practice, as the method performs really well). I think that providing a more convincing alignment between intuition and practice could make the difference between a good and a great paper.

作者评论

We are glad the reviewer appreciated our additional set of results. Let us reply to the outstanding points here below.


“I am however confused about the figure you linked, as the bag of subgraphs does not contain the original graph with marked nodes [...]”

Thanks for pointing this out – yes, the figure we had created refers to a generic Subgraph GNN with a bag-sampling approach. In the specific example, the subgraph selection policy was ego-networks. We agree this can cause confusion, as, in our paper, we assume to be always working with the (more general) node-marking selection policy. Accordingly, we improved the figure to better align it with our approach. Please find it here (link).

Concretely, the figure represents the setting T=1, where only one marked subgraph is sampled (out of six possible ones). In particular, our approach chooses the subgraph associated with the maximum-centrality node.


“I don't understand how the method setting T=1 aligns with the intuition give in Section 3.2. For example, you say that "subgraphs associated with low centrality nodes can be interpreted as redundant [...]”

In the setting T=1, the sampling method picks exactly one node-marked subgraph out of all possible ones. For T=1, the downstream GNN will then effectively process a node-perturbed version of the original graph (as illustrated in the aforementioned figure). Now, the Observation we make in Section 3.2 is that if the marked node is a low-centrality one, then such a GNN will output a graph representation (necessarily) similar to the one it would have output when processing the original graph. We agree that the term “redundant” in line 196 is not fully rigorous to refer to this and we will remove it in the next revision. We will only leave the term “poor marking candidates” to refer to them (line 197). See more on this below.


“[...] why one would want the representation of the marked graph (with T=1) to be as different as possible to the original graph, as this could lead to unpredictable behavior (although clearly this is not the case in practice, as the method performs really well).”

Out of all possible node-marked subgraphs, our method prefers to sample those associated with higher-centrality nodes. This is because:

  • As mentioned above, low-centrality nodes perturb representations only up to a limited extent, causing the method to work in a “regime” close to a vanilla 1-WL GNN;
  • We empirically observe the perturbations induced by marking high-centrality nodes correlate much more with potentially predictive features, like subgraph-counts.

According to our observations, thus, higher-centrality nodes are those which have the potential to perturb the original graph representation the most, and in a way that captures predictive features. As the reviewer correctly observes, marking these nodes does not lead to unpredictable behaviour in practice: ultimately, the downstream GNN will learn, via training, how to make best use of the additionally provided information. In principle it can learn to ignore this completely –should it not be predictive– or to leverage it to extract expressive features –in the case they correlate with targets. This flexibility is something that may not be effectively provided by marking low-centrality nodes, whose induced perturbations have a lower upper-bound.

In order to further prove our point above, we have run additional experiments for counting triangle and 4-cycle substructures, studying the impact of sampling marked subgraphs associated with the minimum values of the Subgraph Centrality (SC). We report the results in the tables below, along with those of random and max-centrality sampling (MAE, the lower the better).

Triangles

Policy10 subgraphs20 subgraphs30 subgraphs
Min SC0.780.520.43
Random0.620.480.40
Max SC0.200.100.03

4-cycles

Policy10 subgraphs20 subgraphs30 subgraphs
Min SC0.740.630.41
Random0.590.450.36
Max SC0.380.120.08

We observe that min-centrality sampling is indeed outperformed by the two other strategies on all settings.


“[...] with T=1, is closely linked with random node features/node individualizations literature, where nodes are marked in order to increase expressivity.”

Thanks for pointing this out. For T=1, node-marking can be interpreted as the first call to a node-individualization algorithm. Also, in a way, node-marking breaks symmetries between nodes as random features would do. We will discuss these connections in the next revision.


We will clarify all the points above based on our responses, hoping they have addressed your outstanding concerns.

审稿意见
4

The authors propose a centrality-based scheme for selecting expressive subgraphs for subgraph GNNs. Essentially, their idea is based on the observation that message passing themes and walks on graphs are the same thing, which makes using walk-based centralities a natural proxy for selecting nodes with high influence on the learning process of GNNs. The authors evaluate their approach on synthetic (Erdős–Rényi random graphs) and real-world graphs and demonstrate that their approach outperforms subgraph-GNN-based and transformer-based baselines both in terms of performance and runtime.

update after rebuttal

As mentioned during the rebuttal period, I am satisfied with the authors' responses and planned revisions to the manuscript. Hence, I maintain my score which I updated during the rebuttal period.

给作者的问题

  1. Are there any limitations regarding what kinds of graphs the proposed approach can be applied to? For example, do the graphs need to be connected or undirected? What about disconnected graphs or directed graphs that are only weakly connected? Or is the only requirement that the chosen centrality measure must be able to assign centrality scores to all nodes?
  2. Why is there a difference between degree centrality and PageRank in Figure 3? In undirected graphs, the two should select the same nodes as the most central. Is the difference only due to the randomness in the initialisation of GNN weights?
  3. What is the reason for the missing values in Table 2?
  4. The paper title mentions "balancing efficiency and expressiveness", but I am wondering to what extent it can actually be claimed that the subgraph centrality addresses the expressiveness goal? Yes, selecting the nodes with the highest centrality values works better than random selection or selecting those nodes with low centrality values. But is there another set of nodes that should be preferred over those with the highest centrality values? My intuition tells me that this should be the case -- and I might be wrong with that. But given the relatively small number of chosen nodes in the empirical experiments, it should be possible to quantify how well the proposed selection scheme works in comparison to the optimal set of nodes of the same size that could be chosen.

论据与证据

The claims made by the authors are supported by clear arguments and empirical evidence.

方法与评估标准

I believe that the chosen evaluation datasets and measures are sufficient for supporting the authors' claims. As they point out, their approach focuses on learning graph representations with the intention to provide a basis for various downstream tasks.

理论论述

I find that the authors' arguments are clearly formulated and convincing. I have read the proofs in appendix D and did not find any obvious issues, however, I am not trained as a mathematician and may have missed more involved subtleties.

实验设计与分析

The experimental setup seems sound. However, I find that the authors should be more explicit regarding how the results of their evaluation should be interpreted. Specifically, they count sub-structures in Erdős–Rényi random graphs and measure the performance of their scheme in terms of Pearson correlation. But how should we interpret those values? Obviously a higher correlation is better, but what results do we expect? And what results does their proposed method return? Also, based on the datasets used in Table 2, I guess that the considered task is graph classification, I think it would be better to state the considered task explicitly in the text.

补充材料

I've read the supplementary material but did not carefully check it for correctness.

与现有文献的关系

As the authors point out, current subgraph GNNs suffer from computational challenges when they consider full bags of subgraphs and, therefore, are limited to applications involving very small graphs. While the network science community has long considered walk-based measures as an extremely useful tool, I am happy to see that they are now also finding broader adoption in the deep learning community.

遗漏的重要参考文献

I am not aware of any essential references that the authors have missed to include.

其他优缺点

I find the paper extremely well written. The authors' arguments are clearly laid out. They are guiding the reader by asking and answering relevant questions on the reader's behalf. Overall, I found it enjoyable to read the paper.

其他意见或建议

I found the explanation regarding what effectiveness and efficiency are (Section 3.1, paragraph "Goal") very intuitive.

作者回复

We are very pleased to note the positive recommendation made by the reviewer. They have appreciated the proposed application of Network Science results to Deep Learning, and have found the paper “extremely well written”, with claims supported by “clear arguments and empirical evidence”.

The reviewer had a clarity concern pertaining to tackled tasks and result interpretation, along with some questions. We address these points in the following.


“I find that the authors should be more explicit regarding how the results of their evaluation should be interpreted. [...] they count sub-structures in Erdős–Rényi random graphs and measure the performance of their scheme in terms of Pearson correlation. But how should we interpret those values? [...] Also, based on the datasets used in Table 2, I guess that the considered task is graph classification [...]”

Substructure counts: in Sections 3 we report results in terms of Pearson correlation. The aim of these experiments is to understand how much substructure counts are recapitulated by the perturbations induced by different marking strategies. We agree that absolute correlations may be difficult to interpret per se, but, in relative terms, it is very valuable to observe how the various marking strategies rank w.r.t. each other.

Eventually, these rankings give us an indication on how the strategy will perform in the actual task of counting substructures, which we study in Section 5. This time, these results are quantified in terms of Mean Absolute Error, and are reported in Figures 3 and 7. We will better highlight these aspects in the next revision. As for results reported in Table 2 – yes, we confirm all the three benchmarks involve graph classification. We will make sure to state this clearly.


“Are there any limitations regarding what kinds of graphs the proposed approach can be applied to? [...] is the only requirement that the chosen centrality measure must be able to assign centrality scores to all nodes?”

Yes, this is correct. As long as the centrality measure ranks all the nodes in a graph, the method can be run without any caveat.

Regarding directionality: our focus is mostly on undirected graphs, consistently with virtually all previous works on Subgraph GNNs. As such, our analysis does not specifically discuss the case of directed graphs, but one could clearly consider a centrality measure that is well-defined thereon. An interesting avenue for future research would be to extend perturbation analyses on directed graphs and study whether known centrality measures allow to well capture these effects, similarly to how we proceed in our manuscript.


“Why is there a difference between degree centrality and PageRank in Figure 3? [...]”

While it is true that the two centrality measures are often found to produce similar node rankings in undirected graphs, we note that they are not generally equivalent in this setting. The degree centrality only accounts for first-order node interactions, whereas Page-Rank also considers the effect of higher-order neighbours. In the latter case, two nodes with the same degree would, e.g., be ranked differently if they connect to a different number of high-degree neighbours.


”What is the reason for the missing values in Table 2?”

We thank the reviewer for pointing this out. We chose these ogbg-molecular benchmarks to capture a good overlapping set of datasets on which previous sampling-based Subgraph GNNs were tested on. The results in Table 2 are directly taken from the original papers and, unfortunately, some methods were not uniformly benchmarked on these datasets. This explains the “missing numbers”. We will be more clear about this in the next revision.


”[...] Yes, selecting the nodes with the highest centrality values works better than random selection or selecting those nodes with low centrality values. But is there another set of nodes that should be preferred over those with the highest centrality values? [...] it should be possible to quantify how well the proposed selection scheme works in comparison to the optimal set of nodes of the same size that could be chosen.”

This is an interesting point, and we agree that quantifying optimality could be particularly insightful. We also note, however, that optimality is ultimately defined with respect to a particular learning task. To find an optimal set which also accounts for learning would quickly become computationally challenging. We respectfully believe this experiment could not be run in a thorough manner within the rebuttal period. We will, either way, point out this interesting direction in the next revision.

审稿人评论

Thank you for your responses. I find that my concerns will be addressed by the planned revisions and have no further questions.

审稿意见
3

This paper explores subgraph GNNs as a way to overcome the expressivity limitations of standard Message Passing Neural Networks (MPNNs). In a subgraph GNN, a graph is transformed into a "bag of subgraphs," where each subgraph is processed independently using an equivariant architecture, and the results are aggregated for final prediction. While these methods enhance expressivity, they suffer from high computational costs due to the large number of subgraphs that need to be processed.

To address this challenge, this paper proposes a hybrid subgraph selection and feature augmentation method. The key idea is to sample subgraphs centered around nodes with high Subgraph Centrality, which is a metric introduced in prior work that measures a node's importance based on walk-based connectivity patterns. Moreover, the paper introduces a feature augmentation technique that incorporates centrality scores as structural encodings.

The paper provides some theoretical analysis to support the use of Subgraph Centrality. In the experiments, the method overall matches or outperforms some other Subgraph GNNs and Graph Transformers on some benchmarks, and is more scalable.

给作者的问题

Can you please address weaknesses (W3) and (W4)?

论据与证据

(S1) The authors claim that their method overcomes the expressive limitations of MPNNs while remaining scalable. Section 3.2 provides interesting progress toward substantiating these claims by establishing a connection between Subgraph Centrality and perturbation analysis. (However, I think this analysis could be refined/strengthened, as I describe in W1-W3 below.)

Second the paper claims that the proposed method outperforms other subgraph selection strategies and achieves performance comparable to or better than full-bag Subgraph GNNs while sampling only one or two subgraphs. I'm not fully convinced about this claim, as I elaborate on in (W4) below.

(S2) Finally, the paper claims that the proposed approach is competitive with Graph Transformers and SOTA GNNs while being significantly more computationally efficient. I think this is reasonably substantiated by Tables 3 and 4.

方法与评估标准

Section 3.2 presents a combination of theoretical and empirical analyses to motivate the use of Subgraph Centrality. Observation 1 states that if a node has low centrality, training a GIN on its induced subgraph will have little impact on the GIN's prediction. The implication is that to improve expressivity, one should, at minimum, avoid selecting subgraphs centered around low-centrality nodes.

Next, Section 3.2 presents two experiments:

  1. Selecting subgraphs based on high-centrality nodes leads to large variations in the GIN's predictions.
  2. Subgraphs centered on high-centrality nodes contain a higher prevalence of structural motifs, such as triangles and 4-cycles.

Here are a few suggests for developing this section:

(W1) The two experiments use very different types of graphs---real world TU datasets for the first and Erdos-Reyni for the second. It's not clear why there is this discrepancy and it would be helpful to standardize.

(W2) While the theoretical results rule out low-centrality nodes as good candidates, they don't strongly establish that choosing high-centrality nodes is necessarily optimal. This makes this analysis interesting but not entirely conclusive.

(W3) The discussion on why structural motifs are desirable could be expanded. The paper states that the “presence and number of strucutral ‘motifs’ are often related to graph-level tasks” but this doesn't explain why subgraphs with more motifs are desirable for improving GNN performance.

理论论述

I have a question about observation 1: why don't node features come into play?

实验设计与分析

(W4) The paper introduces many related subgraph GNN methods in its citations, but the experiments primarily compare the proposed method against POLICY-LEARN, not including the other potentially relevant baselines. Additionally, POLICY-LEARN is omitted from Tables 3 and 4, while it is included in other results. It would be helpful to provide a more comprehensive evaluation or explain why methods were omitted.

补充材料

I skimmed it and have nothing to comment on.

与现有文献的关系

I think this paper would be of interest to the community interested in GNN expressivity, but I think the weaknesses I presented should be addressed first.

遗漏的重要参考文献

None that I'm aware of.

其他优缺点

None beyond those I described above.

其他意见或建议

None.

作者回复

(W1) “The two experiments use very different types of graphs [...] it would be helpful to standardize.”

We used TU datasets for perturbation analyses aligning with [1], but we agree standardization would be preferred. As suggested, we re-run them on the same ER graphs in the subgraph-count correlation study. The results (link) are consistent with our prior findings, and will be included in the new revision.


(W2) “[...] theoretical results rule out low-centrality nodes [...] don't strongly establish that choosing high-centrality nodes is necessarily optimal. [...] analysis interesting but not entirely conclusive.”

We agree that our approach is not necessarily optimal and, in fact, note that an optimal selection would also require access to targets and to account for learning. Our goal is to design an efficient subgraph selection that is run as preprocessing and achieves effective performance.

Although our approach lacks optimality guarantees, it:

  • is theoretically grounded via the connection to perturbation analyses (found “interesting”);
  • leads to consistent empirical gains with contained overhead.

Also, HyMN always performed at least as well as a learnt selection strategy, indicating its actual effectiveness once more. We think the above already provides an important contribution to the community, and believe further theoretical analyses on optimality warrant interesting future research.


(W3) “The discussion on why structural motifs are desirable could be expanded.”

Motif counts are widely recognized as important graph features: in Network Science, motif profiles have been used to compare and classify networks; the GNN community has also explored their utility due to message-passing limitations in detecting them. A seminal paper [2] even proposed to employ them as SEs, showing the significant expressivity and generalization boost they can provide.

Starting from these premises, and provided that node-marking subgraphs enhances expressivity, we argue it would be preferable to choose those subgraphs which, amongst others, better enable the model to capture subgraph counts. We quantify this by studying the correlations between counts and the mark-induced graph perturbations. Our results indicate they are maximized when marking based on the highest Subgraph Centrality (see Table 1). These findings – as well as our more general argument – find supporting evidence in the experimental results reported in Figure 3.


“[...] observation 1: why don't node features come into play?”

Because we have to consider the TMD not between two generic graphs, but between a graph and the node-marked version thereof, where node feats. coincide.


(W4) “[...] experiments [...] not including the other potentially relevant baselines. [...] POLICY-LEARN is omitted from Tables 3 and 4 [...]”

In fact, Table 2 already includes additional efficient Subgraph-GNNs baselines beyond POLICY-LEARN: OSAN and MAG-GNN. These are in a separate table lane, likely causing confusion. We will improve the layout in the next revision.

Either way, we agree that reporting the performance of other methods would strengthen our claim. In fact, molhiv is a reference test-bed for most full-bag methods. We will add these additional results:

molhiv
Reconstr. GNN76.32±1.40
Nested-GNN78.34±1.86
DSS-GNN76.78±1.66
GNN-AK+79.61±1.19
SUN80.03±0.55
HyMN (T=2)81.01 ±1.17

HyMN competes with the best full-bag approaches with only two subgraphs.

As for non-full-bag Subgraph GNNs, we will additionally include results (ROC-AUC) for CS-GNN [3], which we found has been benchmarked both on molhiv, and molbace. Note HyMN’s competitive performance compared to this method.

molhivmolbace
CS-GNN (T=2)77.72±0.7680.58±1.04
HyMN (T=2)81.01 ±1.1781.16 ±1.21
CS-GNN (T=5)79.09±0.9079.64±1.43
HyMN (T=5)80.17 ±1.4080.64 ±0.48

Peptides, MalNet (Tables 3, 4): POLICY-LEARN was not originally evaluated thereon. We attempted this using the authors’ code.

On MalNet-Tiny, memory usage exceeded our server’s capacity (1TB CPU RAM) due to the full bag materialization required by their implementation. Optimizations may be possible but infeasible within the rebuttal window.

On Peptides, we obtained:

peptides-func (AP ↑)peptides struct (MAE ↓)
POLICY-LEARN (T=1)0.6336 ±0.00710.2475 ±0.0008
HyMN (T=1)0.6857 ±0.00550.2464 ±0.0013
POLICY-LEARN (T=2)0.6459 ±0.00180.2475 ±0.0011
HyMN (T=2)0.6863 ±0.00500.2457 ±0.0012

We finally refer to the response to Reviewer d9AX for additional results on Reddit-binary.


We hope your concerns are clarified, while we are open to elaborate further.

[1] https://arxiv.org/abs/2210.01906

[2] https://arxiv.org/abs/2006.09252

[3] https://arxiv.org/abs/2406.09291

审稿人评论

Thank you for the clarification on the experiments, this helps alleviate some of my concerns.

审稿意见
3

The paper proposes HyMN (Hybrid Marking Networks), a combination of walk-based structural encodings and centrality-based subgraph marking strategies. The goal is to design a subgraph GNN which is simultaneously effective (sampled bags tend to optimal) and efficient (few operations / no learnable components).

给作者的问题

论据与证据

C1: Centrality Measures can be conceptualized as graph perturbations. E: Analysis based on Chuang/Jegelka Tree Mover paper. Experiments to support the hypothesis.

C2: A combination of SE and Centrality Measures works well. E: Ablation proofs, in the sense that they extend 1-WL in incomparable ways. Experimental support for HyMN.

方法与评估标准

Yes.

理论论述

Yes. The arguments seem correct.

实验设计与分析

Yes. The experimental protocol are standard.

补充材料

Yes, entirely.

与现有文献的关系

The paper forms a solid contribution to the existing work on Subgraph Sampling. The nice result of Chuang and Jegelka forms the real engine of the paper, though.

遗漏的重要参考文献

Not that I know of.

其他优缺点

Strengths

  1. The paper is extremely well-written. The presentation is thorough.
  2. Even though the paper is a bit light on theory, the paper cherry-picks the right ideas and puts them together nicely: Experimental support is provided wherever necessary. It was quite comfortable to grasp the essential ideas behind the model.
  3. The idea of walk-centrality holds promise for spurring on further work in this area.

Weaknesses

  1. The use of both SE and Centrality Subgraph Sampling really makes HyMN a bit of a Swiss knife. This is not really a criticism, but this weakens the claims for the effectiveness of Centrality Marking as a technique.

其他意见或建议

作者回复

We are very glad to note the reviewer found the paper “extremely well-written”, that it “picks the right ideas and puts them together nicely” and, importantly, that they believe the underpinning idea “holds promise for spurring on further work in this area”.

The reviewer shared the view that the joint use of both SEs and Centrality-based sampling may weaken the claim for the effectiveness of the latter as an approach on its own. We understand that the hybrid nature of our approach may lead to this perception, but we believe it is important to underscore the effectiveness of the centrality marking alone.

Even without our positional encodings (CSEs), on the ogbg-molecular benchmarks HyMN outperforms MPNN baselines and methods based on random subgraph sampling, while it matches the performance of the learnable approach Policy-Learn (see Table 2). HyMN without CSEs also attains successful results on the larger MalNet dataset, scoring the highest average test score in the configuration T=1 when compared to other HyMN variants and transformer-based baselines (see Table 4).

We conclude by complementarily remarking how baselines methods like GIN and GCN, when augmented with CSEs, fail to match the full HyMN performance. We observe this in Table 3 on the peptides datasets and on the molecular ogbg benchmarks as well. We report, indeed, that we have additionally run GIN with CSEs thereon. We obtained the following results, which we report, for reference, along with the performance attained by HyMN in the configuration T=2:

molhivmolbacemoltox
GIN+CSEs77.44 ± 1.8776.58 ± 2.2975.81 ± 0.39
HyMN (GIN, T=2)81.01 ± 1.1781.16 ± 1.2177.30 ± 0.35

We hope this clarifies the raised concern. We would kindly ask the reviewer to reconsider their scores in view of our response.

最终决定

This paper presents a technique to accelerate Subgraph GNNs by incorporating walk-based centrality metrics to guide the subgraph sampling process. The approach includes a preprocessing step involving the computation of subgraph centrality scores, selection of the top‑T nodes, generation of structural encodings, and marking of these nodes for use in a full-graph GNN, all without explicit subgraph extraction.

During the rebuttal phase, the authors significantly improved their results in several ways, including additional experiments with the MolHIV dataset, GIN + CSE, the Reddit-Binary (RDT-B) dataset, and simple substructure counting experiments to clarify the impact of sampling node-marked subgraphs with minimal Subgraph Centrality (SC) values. All these additions significantly reinforced the claims and value.

There are three (in my opinion minor) limitations worth noting. First, the novelty may appear limited, given the incremental nature of the proposed approach. Second, the combination of subgraph sampling and structural features makes it difficult to isolate the key factors responsible for performance gains (although this was also partially clarified in the rebuttal). Finally, while selecting nodes with the highest centrality clearly outperforms random or low-centrality choices, the method still retains a heuristic flavor.

However, the contribution is notable, and the benefits clearly outweigh these concerns. The paper makes a solid contribution, and I thus recommend it be accepted.