Generalizing Weisfeiler-Lehman Kernels to Subgraphs
We propose WLKS that extends the WL graph kernel to subgraphs, captures high-order structural similarities in k-hop neighborhoods, efficiently outperforming state-of-the-art GNNs.
摘要
评审与讨论
This work focuses on addressing challenges related to capturing arbitrary interactions both within and between subgraph structures. To provide a more expressive and efficient alternative, this work proposes WLKS, a Weisfeiler-Lehman (WL) kernel generalized for subgraphs by applying the WL algorithm on induced k-hop neighborhoods.
优点
W1. The research significance of subgraph learning in graph representation learning is undoubtedly valuable.
W2. The research in this manuscript is interesting, and the proposed framework of both efficiency and effectiveness is exciting.
缺点
W1. The analysis of existing challenges is unconvincing, and this manuscript lacks an analysis of existing works. The author identifies the main challenge as capturing arbitrary interactions between and within subgraph structures. However, the local-global interactive learning strategy has been well studied by GNN-AK[1].
W2. This manuscript lacks sufficient interpretability analysis of the proposed method. For instance, the right panel of Figure 1 only shows the final coloring result, leaving us without an understanding of the iterative process of color refinement.
W3. I argue this work lacks a fundamental understanding: the goal of using WL-based deep learning methods or traditional machine learning methods is to improve the distinguishability of isomorphisms that would otherwise be indistinguishable. However, I do not see this insight or explanation reflected in the work.
[1] From stars to subgraphs: Uplifting any GNN with local structure awareness, iclr 2022.
问题
Q1. What is the purpose of the discussion and analysis in Sec. 2.2? The main content of Sec. 2.2 is the proof of Proposition 2.2, which shows that the expressiveness of is not strictly greater than . However, in the subsequent analysis, the authors do not seem to propose a method for designing based on this insight.
Q2. I am puzzled as to whether the design of actually addresses the goal of capturing arbitrary interactions between and within subgraph structures, which is the stated research motivation of this work.
Thank you for your review and constructive feedback and for acknowledging our research's significance and interestingness. Here, we provide answers to your individual questions below.
W1. The analysis of existing challenges is unconvincing, and this manuscript lacks an analysis of existing works. The author identifies the main challenge as capturing arbitrary interactions between and within subgraph structures. However, the local-global interactive learning strategy has been well studied by GNN-AK.
We appreciate the opportunity to clarify the main challenge and our contributions. In the next revision, we will describe the challenges that existing works are experiencing. We will include:
- SubGNN: While SubGNN employs message-passing within subgraphs, its reliance on ad hoc patch sampling and its separation of hand-crafted channels (e.g., position, neighborhood, structure) introduces complexity and potential suboptimality in information aggregation.
- GLASS: GLASS uses separate message-passing for node labels that distinguish internal and global structures. This can enhance expressiveness by mixing representations from local and global structures similar to WLKS. However, GLASS has a limited ability to handle multiple labels in batched subgraphs; thus, a small batch size is required for GLASS.
- S2N: S2N efficiently learns subgraph representations by compressing the global graph. However, this compression results in a loss of structural information and expressiveness in tasks where the global structure is important. In particular, since the approximation bound of S2N depends on how many subgraphs are spread out in the global graph, we cannot always expect a robust approximation.
- GNN-AK: GNN-AK generates a graph representation by aggregating the information of each node's locally induced encompassing subgraph. Although there are local and global interactions, there are fundamental differences between WLKS. First, GNN-AK is designed for graph-level tasks, so the interactions between the graph itself and its local subgraphs are modeled. However, dealing with subgraph-level tasks is more challenging since modeling both the inside and the outside of the subgraph is required. WLKS encodes them by using multiple k-hop kernels. Second, GNN-AK has a large complexity that depends on the total number of neighbors and the sum of edges between neighbors, so it cannot be applied to a large global graph, unlike WLKS. In fact, the average number of nodes covered by the GNN-AK paper is much smaller, ranging from 25 to 430. In these perspectives, our study takes a complementary approach to GNN-AK, addressing aspects not covered in their work.
We hope this description helps you understand our contributions. We will include this technical comparison in the next revision soon.
W2. This manuscript lacks sufficient interpretability analysis of the proposed method. For instance, the right panel of Figure 1 only shows the final coloring result, leaving us without an understanding of the iterative process of color refinement.
Regarding the lack of interpretability analysis, we would like to clarify that our primary focus is not on the model's interpretability. Our paper emphasizes the theoretical underpinnings of subgraph representations and efficient applications rather than the model's interpretability.
We understand the importance of clarity in presenting our method's intermediate steps. If the depiction of the iterative process in Figure 1 appears insufficient, we appreciate this perspective and are happy to improve upon it. Our revised paper now includes additional visualizations that explicitly detail the intermediate steps of the color refinement process.
W3. I argue this work lacks a fundamental understanding: the goal of using WL-based deep learning methods or traditional machine learning methods is to improve the distinguishability of isomorphisms that would otherwise be indistinguishable. However, I do not see this insight or explanation reflected in the work.
We would like to address your concern about the lack of focus on the isomorphism distinguishability in our work. While we agree that it plays a foundational role in WL-based models, our work aims not merely to improve distinguishing isomorphic structures but to explore the WL histograms as subgraph similarity measures situated within the WL kernel framework.
Specifically, WL kernels (Shervashidze et al., 2009) compute graph similarity using WL algorithm outcomes. This model does not strictly rely on the distinguishability of isomorphisms but instead histogram-based representations of subtree patterns to quantify the similarity between graphs. WL kernels allow for capturing similarities even when graphs are not strictly identical (or isomorphic) (Shervashidze et al., 2011). Aligning with the kernel perspective, our WLKS focuses on leveraging the WL histogram as a similarity measure between a pair of subgraphs, not testing isomorphic structures.
In addition, the distinguishability of isomorphisms has theoretically characterized the expressive power in Proposition 2.2. Isomorphism distinguishability might be overly restrictive in practical scenarios where graphs do not need to be topologically identical, but it can be a theoretical framework to characterize the expressiveness. Our analysis implies that the WLS^k histogram cannot represent all information on a smaller k′-hop structure (k′ < k) from the perspective of graph isomorphism. Here, we would like to emphasize that this provides an insight into linearly combining kernels from multiple ks (Equation 3).
We appreciate your critique and understand that further highlighting these distinctions and their implications can strengthen our presentation. We revised our manuscript to explicitly articulate the differences between WL isomorphism testing and WL kernels, and to clarify how our method builds upon.
Q1. What is the purpose of the discussion and analysis in Sec. 2.2? The main content of Sec. 2.2 is the proof of Proposition 2.2, which shows that the expressiveness of WLSk+1 is not strictly greater than WLSk. However, in the subsequent analysis, the authors do not seem to propose a method for designing k based on this insight.
We understand how our explanation may have been unclear and would like to clarify. Proposition 2.2 highlights the limitations of the WLS histogram in representing structural information for a smaller k′-hop structure (k′ < k) from a graph isomorphism perspective. In the subsequent paragraphs (lines 193–201), we have discussed how this insight motivates the design of WLKS-K, which combines WLS histograms across multiple values of k to encode structural information at various levels effectively.
We think that this connection may not have been sufficiently explicit, and we revised the manuscript to emphasize this more clearly.
References
- Shervashidze et al., “Fast subtree kernels on graphs”, NeurIPS 2009
- Shervashidze et al., “Weisfeiler-Lehman Graph Kernels”, JMLR 2011
Q2. I am puzzled as to whether the design of k=0,D actually addresses the goal of capturing arbitrary interactions between and within subgraph structures, which is the stated research motivation of this work.
We appreciate the opportunity to address this point and clarify our motivations and methods. This rebuttal provides a detailed explanation of how this design aligns with our stated research motivation.
While it may seem that using k=0 and D oversimplifies the structural representations, our approach is grounded in the following rationales.
- At k=0, the focus is on the internal structure of the subgraph, which allows the model to capture localized patterns and interactions within subgraphs without the additional complexity of extending neighborhoods.
- At k=D, by leveraging the stabilized coloring from the WL algorithm, the WL histogram of k=D represents tree patterns spanning the global graph. This process is analogous to message-passing across long-range interactions in a very deep GNN. As a result, our approach enables the representation of interactions between subgraphs within the global context.
- Combining these kernel matrices of k=0 and D allows us to leverage the complementary strengths of local (within) and global (between) structural information, which is consistent with our goal of capturing rich structural interactions.
- This is clearly demonstrated in experiments on the Coreness dataset. Coreness serves as a benchmark for predicting the average core number of nodes within a subgraph. Accurate prediction of coreness requires the ability to model both internal structures, global structures, and subgraph positions. Our experiments show that combining kernels of k=0 and k=D achieves superior performance compared to using either kernel independently.
Thanks again for pointing this out and we revised our paper to highlight it.
Dear Reviewer UnAb, thank you for your reviewing efforts. We uploaded the revised paper based on your feedback. The revised parts are highlighted in red. We summarize the modified parts below:
- We add Appendix A to formally and detailedly compare WLKS with state-of-the-art baselines and representative related work, including SubGNN, GLASS, S2N, and GNN-AK. In addition, we move the ‘Related Work’ section early in the paper.
- In addition to Figure 1, Appendix C includes additional visualizations that explicitly detail the intermediate steps of the color refinement process.
- In Appendix B and Section 2, we explicitly articulated the differences between WL isomorphism testing and WL kernels and clarified how our method builds upon them.
- In Section 3.3, we revised to make a clear connection between the insight of Proposition 3.2 and the design of WLKS-K.
- In Section 6, we provided a detailed explanation of how the model design aligns with our stated research motivation.
We kindly ask if our revised paper has addressed your concerns as the discussion period comes to an end. If there is no remaining concern, we would greatly appreciate it if you could kindly consider updating the score accordingly.
Thanks for your efforts, I will update my score.
Thank you for reconsidering your rating of our work and for the valuable feedback provided. We would appreciate any additional suggestions or questions where you believe further improvements are necessary. We remain open to further discussion and will address any concerns to enhance our paper.
The paper addresses the challenge of effectively representing subgraphs in learning tasks by proposing a new approach called the Weisfeiler-Lehman Kernel for Subgraphs (WLKS), which adapts the classic WL algorithm for graph-level tasks to handle subgraph-level tasks by utilizing induced k-hop neighborhoods. By combining kernel matrices from different k-hop levels, WLKS captures complex subgraph interactions without relying on neighborhood sampling. The model achieves similar or better classification performance while reducing training time significantly on various datasets.
优点
-
WLKS improves subgraph-level task performance by encoding both internal subgraph structures and their k-hop neighborhoods, offering a more expressive alternative to traditional GNNs.
-
WLKS achieves superior performance with reduced training times, outperforming several state-of-the-art models by less training time, without requiring GPUs, pre-training, complex hyperparameter tuning, or extensive pre-computation.
-
WLKS can be integrated into GNN frameworks (like S2N) as an adjacency matrix, allowing users to benefit from both the structural expressiveness of kernels and the feature-rich nature of GNNs.
-
The paper is generally well-written and easy to follow.
缺点
-
WLKS depends on the node coloring algorithm, making it less suitable for datasets where nodes or edges have continuous feature values, which are challenging to map to discrete color values. It would be beneficial for the authors to discuss potential adaptations of WLKS for handling continuous node and edge features, or to clarify if existing techniques could address this limitation.
-
The selection of k in WLKS is somewhat limited, focusing solely on the original subgraph and the entire global graph, which may be insufficient for larger, more complex datasets where intermediate k values could capture essential substructures. This is especially relevant given that most datasets in the study are not large (as shown in Table 1), raising concerns about how well this approach will generalize to larger graphs. It would be beneficial for the authors to provide theoretical or empirical justification for why using only k=0 and k=D is sufficient. Additionally, the authors could include experiments on larger graphs to demonstrate the scalability of this approach.
-
WLKS leverages a basic WL color histogram on subgraphs and a similar histogram within the global graph. This approach, while efficient, may result in a loss of structural information (e.g., datasets requiring detailed interactions between subgraphs).
-
The paper lacks a comprehensive discussion comparing WLKS to baseline models. For instance, although WLKS outperforms baselines such as SubGNN and GLASS on several benchmarks (Table 2), the paper does not explain clearly why WLKS might be more expressive in specific contexts (e.g., powerful in terms of distinguishing isomorphic graphs). It would be beneficial for the authors to provide a detailed analysis of key examples where WLKS outperforms the baselines, highlighting the structural properties that WLKS captures more effectively in these cases.
问题
-
Could you provide further discussion on the selection of k? Specifically, does the optimal k value depend on the size or sparsity of the graph/subgraph?
-
The paper does not employ labeling tricks, such as those used in GLASS. Could you clarify the reasoning behind this choice, particularly since distance-based labeling could be effectively integrated with WL kernels?
Thank you for your review and constructive feedback. We appreciate that you acknowledge our paper’s performance improvement, efficiency, adaptability to GNNs, and clarity. In this response, we provide answers to each point in the review.
Integration with continuous features
WLKS depends on the node coloring algorithm, making it less suitable for datasets where nodes or edges have continuous feature values, which are challenging to map to discrete color values.
While WLKS leverages a node coloring algorithm to encode structural information, continuous features can be seamlessly incorporated into our method by computing similarity into the kernel matrix.
- We can linearly combine multiple kernel matrices (as in Equation 3), which can encode both structural (via node coloring) and continuous features.
- We can linearly combine multiple kernel matrices (as in Equation 3), which can encode both structural (via node coloring) and continuous features.
- We can use the continuous Weisfeiler Lehman operator (Togninalli et al. 2019) on node and edge features, to compute the kernel matrix for WLKS.
- As in Subsection 2.5, WLKS for S2N can integrate both the expressiveness of WLKS for structures and that of GNNs for features.
It seems that our paper did not sufficiently discuss the WLKS with continuous features. We included a section on methods for incorporating continuous features to address the reviewer's concern. This can clarify that WLKS can effectively leverage both structural and feature-based information.
More discussion on the selection of k
The selection of k in WLKS is somewhat limited, focusing solely on the original subgraph and the entire global graph, which may be insufficient for larger, more complex datasets where intermediate k values could capture essential substructures. This is especially relevant given that most datasets in the study are not large (as shown in Table 1), raising concerns about how well this approach will generalize to larger graphs.
Could you provide further discussion on the selection of k? Specifically, does the optimal k value depend on the size or sparsity of the graph/subgraph?
We sincerely thank the reviewer for the insightful questions. Below, we address the concerns and clarify the selection of k in WLKS and its implications for larger, more complex graphs and subgraphs.
- Similar to the number of layers in GNNs, the selection of k can be treated as a hyperparameter that is empirically optimized based on particular tasks. While we chose 0 and D for efficiency, we agree that intermediate k values might capture significant substructures in some scenarios. Specifically, for global graphs with small diameters, small k can yield outcomes similar to k=D (the diameter as an upper bound). Hence, we recommend starting with k = 0 and D as a starter for these data. For graphs with large diameters, incorporating domain knowledge about the data and task may guide the exploration of intermediate k values.
- We respectfully rebut the doubt that structural properties (size, sparsity, or density) are related to the optimal k. Instead, we argue that optimal k is better associated with the task-specific levels of structures. In our empirical evaluation, size, global density, and average subgraph density (reported in the Table below) did not show meaningful correlations with the task's performance by k. This suggests that size or density is not a key factor in the optimal k.
| PPI-BP | HPO-Neuro | HPO-Metab | EM-User | Density | Cut-Ratio | Coreness | Component | |
|---|---|---|---|---|---|---|---|---|
| Density of G | 0.0022 | 0.0304 | 0.0304 | 0.0028 | 0.0024 | 0.0067 | 0.0095 | 0.0002 |
| Average density of S | 0.216 | 0.757 | 0.767 | 0.010 | 0.232 | 0.945 | 0.219 | 0.150 |
We will include the above discussions and results in the revised paper.
References
- Togninalli et al. "Wasserstein weisfeiler-lehman graph kernels." NeurIPS 2019
A loss of structural information
WLKS leverages a basic WL color histogram on subgraphs and a similar histogram within the global graph. This approach, while efficient, may result in a loss of structural information (e.g., datasets requiring detailed interactions between subgraphs).
We appreciate your recognition of the efficiency of our method. Here, we would like to justify the loss of structural information in our method.
We agree that some structural information may be lost in WLKS. However, we argue that the information retained by our method is highly task-relevant. Our empirical results are clear evidence that WLKS encodes the structural information required for real-world and synthetic tasks better than state-of-the-art deep GNNs. Moreover, we would like to note that structural information loss is a fundamental challenge shared by all graph representation learning of encoding structures into a fixed-length vector. Our approach can generate representations that outperform existing deep neural methods, even with the simplicity of a WL color histogram.
We value your critique and discussed it in our revised paper.
Comprehensive discussion comparing with baselines
We appreciate the opportunity to discuss our contributions in comparison with existing models.
The paper lacks a comprehensive discussion comparing WLKS to baseline models. For instance, although WLKS outperforms baselines such as SubGNN and GLASS on several benchmarks (Table 2), the paper does not explain clearly why WLKS might be more expressive in specific contexts (e.g., powerful in terms of distinguishing isomorphic graphs).
First, we agree that a comprehensive discussion comparing WLKS to baseline models can strengthen our paper. We will add paragraphs on this in the next revision, which includes technical comparisons and educated explanations of performance improvement.
- Performance improvement over SubGNN: While SubGNN employs message-passing within subgraphs, its reliance on ad hoc patch sampling and its separation of hand-crafted channels (e.g., position, neighborhood, structure) introduces complexity and potential suboptimality in information aggregation. In contrast, WLKS enhances expressiveness by leveraging a simple yet effective structural representation based on the theoretical rationale that structures at multiple levels are important.
- Performance improvement over GLASS: GLASS uses separate message-passing for node labels that distinguish internal and global structures. This can enhance expressiveness by mixing representations from local and global structures similar to WLKS. However, GLASS has a limited ability to handle multiple labels in batched subgraphs; thus, a small batch size is required for GLASS. WLKS provides a generalized framework to represent fine-grained levels of structures around subgraphs, which can process multiple subgraphs efficiently by leveraging kernel methods.
- Performance improvement over S2N: S2N efficiently learns subgraph representations by compressing the global graph. However, this compression (including sparsification) results in a loss of structural information and expressiveness in tasks where the global structure is important. In particular, since the approximation bound of S2N depends on how many subgraphs are spread out in the global graph, we cannot always expect a robust approximation. In contrast, WLKS does not rely on lossy compression and can yield informative representations.
Thanks again for this feedback, and we included these explanations in Appendix A of the revised paper.
The paper does not employ labeling tricks, such as those used in GLASS. Could you clarify the reasoning behind this choice, particularly since distance-based labeling could be effectively integrated with WL kernels?
We understand that node labeling has been effective in other contexts and it can be integrated into the kernel methods (i.e., aggregating node labels to a histogram). However, we believe that our current design choice may not benefit from existing node labeling techniques.
Node labeling encodes information about a node's membership within a subgraph (GLASS) or its distance to the target structure (distance-based). When aggregated into histograms, GLASS's label-histograms capture distributions of sizes of the subgraph and its neighborhoods, and distance-based label-histograms capture the distribution of distances with the target within an enclosing subgraph. While such distributions are somewhat informative, size and distances represent coarser structural summaries compared to the fine-grained subtree patterns encoded by WL color histograms. The expressive power of WLKS arises from its ability to capture these detailed structural features across multiple k-hop neighborhoods, obviating the need for additional node-level labeling.
We included this discussion in Appendix A of the revised paper.
Regarding your comment: "While such distributions are somewhat informative, size and distances represent coarser structural summaries compared to the fine-grained subtree patterns encoded by WL color histograms," I would argue otherwise. The message-passing mechanism, when augmented with additional distance information, is provably more expressive than the basic message-passing approach [4], which is as expressive as the WL color histograms under certain conditions [1]. Therefore, I believe that incorporating distance information should indeed enhance the model's expressive power.
We agree with you that additional distance information can enhance the expressiveness of message-passing mechanisms under certain conditions. Our previous rebuttal does not mean that this distance information is not useful, and we acknowledge that our earlier response may have led to some misinterpretations.
After revisiting the literature on labeling techniques for graph neural networks, we want to argue that existing methods can have different effectiveness for subgraph-level tasks and kernel-based methods:
- GLASS’s zero-one labeling: This labeling (assigning 0 to internal nodes and 1 to external nodes) shows limited expressiveness when aggregating labels to histograms as kernel inputs. Its histogram is represented as a length-2 vector (0 or 1), which only counts the number of nodes inside and outside the subgraph, thereby omitting finer structural details.
- SEAL’s double-radius node labeling [1]: SEAL computes distances with respect to target structures (e.g., links) and can be applicable to k-hop neighborhoods of subgraphs but computationally challenging. While efficient for link prediction tasks due to the smaller size of enclosing subgraphs, extending this approach to general subgraphs becomes infeasible due to the computational overhead of calculating all pairwise distances.
- Distance Encoding (DE) [2] and Random Walk Structural Encoding (RWSE) [3]: DE uses landing probabilities of random walks from nodes in the node set to a given node. To reduce computational costs, we use RWSE as DE, which uses diagonal elements of random walk matrices (the walk length of 1 — 64). We linearly combine WLKS-{0, D} with an inner product kernel of DE sum-aggregated per subgraph, yielding a significant performance boost on the Cut-Ratio dataset (from 60.0 to 96.0). However, this improvement was not shown across the other seven benchmarks we tested.
Based on your feedback, we now recognize that node labelings can empirically enhance the performance of WLKS. However, the open questions of which specific node labeling methods are effective, which aggregations of node labels are effective, and which kernels (e.g., linear, polynomial, RBF) best complement node labels remain unresolved. While these questions fall outside the immediate scope of our current work, we agree that their investigation could contribute valuable insights to the broader field.
We incorporated these discussions into our paper. Thank you again for your valuable input, which has significantly enriched our work.
[1] Link Prediction Based on Graph Neural Networks
[2] Distance Encoding: Design Provably More Powerful Neural Networks for Graph Representation Learning
[3] Graph Neural Networks with Learnable Structural and Positional Representations
We hope these comments resolve remained concerns. If you have additional questions or concerns, we would be happy to discuss them.
I appreciate the authors' detailed responses to my earlier comments. However, I respectfully disagree with some points raised in their replies. Below, I outline my concerns in detail:
-
In most empirical studies investigating the theoretical capabilities of models—such as their ability to differentiate non-isomorphic graphs [1], detect bi-connectivity [3], or count substructures [2]—the models’ performance on synthetic datasets designed to evaluate their expressive power typically aligns with their theoretical results. While these models might not perform optimally on real-world benchmarks (potentially due to issues like suboptimal implementation, as you mentioned in the "Comprehensive discussion comparing with baselines"), their results on synthetic benchmarks generally reflect their theoretical strengths as these specially designed benchmarks. However, in your synthetic benchmarks, the baseline models fail to perform as expected. Could you please clarify this discrepancy?
-
Regarding your comment: "While such distributions are somewhat informative, size and distances represent coarser structural summaries compared to the fine-grained subtree patterns encoded by WL color histograms," I would argue otherwise. The message-passing mechanism, when augmented with additional distance information, is provably more expressive than the basic message-passing approach [4], which is as expressive as the WL color histograms under certain conditions [1]. Therefore, I believe that incorporating distance information should indeed enhance the model's expressive power.
[1] Xu et al. How powerful are graph neural networks? ICLR 2019.
[2] Chen et al. Can graph neural networks count substructures? NeurIPS 2020.
[3] Zhang et al. Rethinking the Expressive Power of GNNs via Graph Biconnectivity. ICLR 2023.
[4] Zhang et al. Labeling trick: A theory of using graph neural networks for multi-node representation learning. NeurIPS 2021.
In most empirical studies investigating the theoretical capabilities of models—such as their ability to differentiate non-isomorphic graphs [1], detect bi-connectivity [3], or count substructures [2]—the models’ performance on synthetic datasets designed to evaluate their expressive power typically aligns with their theoretical results. While these models might not perform optimally on real-world benchmarks (potentially due to issues like suboptimal implementation, as you mentioned in the "Comprehensive discussion comparing with baselines"), their results on synthetic benchmarks generally reflect their theoretical strengths as these specially designed benchmarks. However, in your synthetic benchmarks, the baseline models fail to perform as expected. Could you please clarify this discrepancy?
The synthetic benchmarks proposed in the SubGNN paper require models to predict target subgraphs’ structural properties, such as the density, cut ratio, average core number, and number of components. When creating datasets, these properties are transformed into labels by binning them into 2 or 3 classes to form a classification task. These classes are divided by quantiles, and there are no clear boundaries. As such, models that can cover these benchmarks must demonstrate not only the expressive power to predict these graph properties but also the ability to generalize from imperfect signals.
While GLASS has been theoretically proven to capture these structural properties (Proposition 2 in their paper), it does not achieve 100% accuracy due to the inherent challenges of generalizing from noisy information. It is also worth noting that baselines such as SubGNN and S2N do not have strict theoretical guarantees for learning these properties. Consequently, their performance on these synthetic benchmarks reflects the absence of such guarantees.
We hope this explanation addresses your concerns. Thank you for bringing up this important point.
Dear Reviewer iDGm.
Thank you again for your constructive feedback and engagement on our paper. Based on your comments, we have been happy to improve our work. We uploaded the revised paper, and the revised parts are highlighted in red. We summarize the modified parts based on your review below:
- We added Appendix A to formally and detailedly compare WLKS with state-of-the-art baselines and representative related work. In addition, we moved the 'Related Work' section early in the paper. Following your comment, we also discussed the node labeling (in GLASS and distance-based) for WLKS in Appendix A.
- In Sections 4 and 6 (Table 5), we introduced WLKS models incorporating continuous features and their results on all benchmarks.
- In Section 6, we discussed the relation between the selection of k and the graph properties, including the size and the sparsity (or density).
- In Section 6, we discussed the potential loss of structural information in our method.
- In Section 6 and Appendix D, we discussed the combination with labeling tricks (distance or structural encoding) and presented the improved performance.
We would like to ask whether our response and revised paper address all your concerns and questions. If they have been resolved, could you raise the score to reflect your decision?
I thank the authors for their detailed feedback and will raise my score accordingly.
We appreciate your careful consideration of our rebuttal and your updated evaluation of our work. Your insights and feedback were valuable in refining our paper. Please let us know by comment if you have any further questions or thoughts. We are happy to discuss more.
The authors tackle the problem of subgraph representation learning, and propose an simple, yet effective method based on Weisfeiler-Leman Kernels. The ideas is quite straightforward: that is, to run the WL algorithm over k-hop neighbourhoods around subgraphs of interest, to collect the output colour histograms, and to define kernels based on those. The authors show that choosing different ks together is beneficial from a discriminative point of view, and propose to consider a specific choice that guarantees a good trade-off between complexity and performance. Experiments validate the effectiveness of such trade-off and discuss additional analyses on the choice of k values and additional hyper-parameters. The authors additionally discuss hybrid approaches that combine their kernel with Graph Neural Network architectures, showing these can be beneficial depending on the relative importance of feature information w.r.t. structure.
优点
- The paper is well written.
- The method is simple: it is easy to understand while offering a compelling trade-off between complexity and performance.
- Different aspects of the method are empirically analysed, while a couple of theoretical insights are offered.
- The combination of the approach with GNN models feels slightly arbitrary, but nonetheless it can already provide initial insights on the relative importance of feature information, and shows how the approach is compatible with other methods.
缺点
- The manuscript would benefit from a preliminary section that would more formally describe the problem setting of subgraph representation learning, and the shortcomings of state-of-the-art approaches which the authors seek to overcome.
- The paper lacks a more detailed, formal comparison with other state-of-the-art approaches. From a methodological standpoint, beyond experimental results, how is the method different than other approaches such as SubGNN? Is it capturing some information that this last cannot? What are, reasonably, the methodological components and technical features that lead WLKS to outperform them? A better contextualisation in terms of technical components and whether they are captured or not by other methods would also help readers chart a landscape of subgraph representation methods. Expanding Section 5 would go in this direction.
- The manuscript lacks a more direct discussion on the kind of information WLKS is employing. Is the method completely structural? How can features be included? Is the approach in subsection 2.5 the only way to do so?
- The proof for Proposition 2.2 does not seem complete. Essentially, it builds upon assumptions that are, themselves, proving most of the claim already (see “Assume that we have […]”, line 181 and similarly in line 187).
- I believe that part of the proof should consist in showing these assumption effectively hold. The examples exhibited in Figure 2 already make a step in this direction, but the authors should better and fully formalise this. For example, technically, the subgraphs should be in the same graph.
- Related to the above, I believe it would also help if the authors were more explicit in how quantification is made over . Is the claim for any possible ? In that case, for example, one should show the existence of counterexamples for any possible , perhaps with an inductive construction. Incidentally, exemplary pairs beyond could be more informative for readers.
- Like 189 requires checks: the equation seems to be specific for ? Is this wanted?
问题
- What does the “Total training time” refer to exactly for WLKS? what does this term include?
- What about inference run-time? Can the authors provide some figures on that and discuss this in comparison to other approaches?
- Line “418”: what does “[…], which are universal for subgraphs […]” mean exactly?
Also, see “Weaknesses”. I am willing to improve my score after insightful clarifications / additions by the authors.
Thank you for your quick response. We revised the paper reflecting your comment, and it would be nice if the reviewer could check it. We are happy to discuss any points that could improve our paper.
(Specifically for the new proof of Equation 1) This is not necessarily the case, because 1-WL is not complete. I would rather point out the existence of (k+1)-hop neighborhoods such that their T>L colouring is distinct, e.g., possibly referring to the example in the figure.
Thank you for your thoughtful feedback. We appreciate your careful attention to the nuances of the argument. You are correct that the current statement could lead to misunderstandings, and we apologize for any confusion caused.
To address your concern, we revised the paper to describe the existence of two subgraphs with strictly identical (k)-hop structures (identical colorings) and 1-WL distinguishable (k+1)-hop structures (distinct colorings).
Specifically, the updated sentences are: "Consider two subgraphs and of a global graph such that their -hop neighborhoods are isomorphic, i.e., , but their -hop neighborhoods have non-identical subtree patterns of height- rooted at subgraphs. That is, within the -hop radius, and have identical structures, but beyond that, their structures are distinguishable by the 1-WL algorithm (i.e., distinct colorings)."
We have also uploaded the revised version of the paper reflecting these clarifications. We hope this resolves your concerns, but please do let us know if there are any remaining ambiguities or further suggestions.
(As for the proof of Equation 2) Shouldn't the equation refer to k (and not k+1) and convey that the colouring is instead different? Is that a typo?
You are correct that the equation should refer to k rather than k+1 and convey the difference. I appreciate your careful reading. It was indeed a typo, and it is likely due to an oversight when copying and pasting the phrase during the proof-writing process. I revisited the proof and corrected it.
(As for the proof of Equation 2) I would check for typos in the subscripts (a,b vs. 1,2).
I appreciate your opinion, and I understand how the use of different subscripts (a, b vs. 1, 2) might be confusing. However, the distinction was intentional to emphasize their different roles: subscripts 1 and 2 represent any pair of indices in Proposition, while subscripts a and b denote particular instances in an example. This differentiation was made to avoid ambiguity in the proof, but I see how this choice could potentially lead to misunderstanding. To address this, I changed all subscripts (a, b) to (1, 2).
Restructuring the Proposition 2.2 (Now 3.2 in the revised paper)
The proof for Proposition 2.2 does not seem complete. Essentially, it builds upon assumptions that are, themselves, proving most of the claim already
I believe that part of the proof should consist in showing these assumption effectively hold. … technically, the subgraphs should be in the same graph.
I believe it would also help if the authors were more explicit in how quantification is made over k. Is the claim for any possible k? In that case, for example, one should show the existence of counterexamples for any possible k, perhaps with an inductive construction. Incidentally, exemplary pairs beyond k=0 could be more informative for readers.
Like 189 requires checks: the equation seems to be specific for k=0? Is this wanted?
We appreciate your careful reading and valuable feedback on Proposition. Our rewritten proof based on your feedback is now in the revised paper, and here, we explain this to address your concerns.
Considering your feedback, we formalized ours into the proof by contradiction in the revised paper. The core of our argument lies that the (non-)equivalence of histograms in the (k+1)-hop neighborhood does not guarantee (non-)equivalence in the k-hop neighborhood. We believe this is more appropriately framed as a proof by contradiction rather than an inductive construction.
Plus, you are correct in noting that the proof could benefit from a clearer discussion of quantification over k. Our claim holds for any k, and we reflected this in the new proof by rewriting it in a general formalization.
Please note that the Figure chooses a simple case for clear visualization, and examples of contradictions for different values of any k can be constructed. Reflecting on your comments on the Figure (i.e., the subgraphs should be in the same graph), we replaced examples and illustrated the more general case.
More efficiency analysis
What does the “Total training time” refer to exactly for WLKS? what does this term include?
Total training time is the time taken for the entire training process including data and model loading, training steps, validation steps, and logging. Note that this metric does not include the pre-computation or embedding pretraining required in state-of-the-art GNN-based models, so the actual training of WLKS is more efficient.
Thank you for your question, and we added this description in the revised paper.
What about inference run-time? Can the authors provide some figures on that and discuss this in comparison to other approaches?
We ran experiments to measure 1-epoch inference time (seconds) on the validation set for WLKS and baselines. This metric measures the time that takes to load the data and model and infer the first epoch on the validation set. The table below describes the results and we can see that WLKS is efficient in inference as well.
| Model | \PPIBP | \HPONeuro | \HPOMetab | \EMUser |
|---|---|---|---|---|
| SubGNN | N/A | 432.9 | 257.1 | 35.8 |
| GLASS | 8.2 | 27.0 | 26.4 | 39.0 |
| S2N+0_{GCNII} | 9.9 | 9.3 | 8.3 | 14.6 |
| S2N+A_{GCNII} | 8.4 | 11.1 | 9.6 | 13.4 |
| WLKS-{0,D} | 1.0 | 11.7 | 2.7 | 1.8 |
We added this result in Table 3.
Questions
Line “418”: what does “[…], which are universal for subgraphs […]” mean exactly?
It means that recently proposed deep GNNs can universally operate on any subgraphs without specific assumptions or domain-specific constraints, unlike early methods (lines 414 – 417). In the revised paper, we changed this sentence to “Recent deep graph neural networks designed for subgraph-level tasks can apply to any subgraph type without specific assumptions.”
I thank the authors for their response.
Specifically for the new proof of Equation 1
However, because their (k + 1)-hop neighborhoods differ, the subtree patterns of height-T rooted captured by will differ
This is not necessarily the case, because 1-WL is not complete. I would rather point out the existence of (k+1)-hop neighborhoods such that their T>L colouring is distinct, e.g., possibly referring to the example in the figure.
As for the proof of Equation 2
the local structures can differ such that the rooted subtree patterns [...] are not identical,
Shouldn't the equation refer to k (and not k+1) and convey that the colouring is instead different? Is that a typo? Also, I would check for typos in the subscripts ( vs. ).
Thank you for your insightful comments and positive feedback on our paper. We appreciate that you acknowledge our paper’s writing, simplicity, analysis, and insights. In this response, we provide answers to each point in the review.
Formal setting and comparison with state-of-the-art approaches
We appreciate the reviewer’s thoughtful comments in comparison with existing studies.
The manuscript would benefit from a preliminary section that would more formally describe the problem setting …
Our revised paper now introduces a formal description of subgraph representation learning (Section 3.1)
… formally describe … the shortcomings of state-of-the-art approaches which the authors seek to overcome. … The paper lacks a more detailed, formal comparison with other state-of-the-art approaches.
From a methodological standpoint, beyond experimental results, how is the method different than other approaches such as SubGNN? Is it capturing some information that this last cannot? What are, reasonably, the methodological components and technical features that lead WLKS to outperform them? A better contextualisation in terms of technical components and whether they are captured or not by other methods would also help readers chart a landscape of subgraph representation methods. Expanding Section 5 would go in this direction.
We agree that an in-depth discussion about the limitations of current methods would enhance the paper. To address this, we will include a dedicated section early in the paper, which includes:
- Comparison with SubGNN: While SubGNN employs message-passing within subgraphs, its reliance on ad hoc patch sampling and its separation of hand-crafted channels (e.g., position, neighborhood, structure) introduces complexity and potential suboptimality in information aggregation. In contrast, WLKS captures a unified and expressive structural representation without requiring handcrafted patch designs or sampling strategies.
- Comparison with GLASS: GLASS uses separate message-passing for node labels that distinguish internal and global structures. This can enhance expressiveness by mixing representations from local and global structures similar to WLKS. However, GLASS has a limited ability to handle multiple labels in batched subgraphs; thus, a small batch size is required for GLASS. WLKS provides a generalized framework to represent fine-grained levels of structures around subgraphs, which can process multiple subgraphs efficiently by leveraging kernel methods.
- Comparison with S2N: S2N efficiently learns subgraph representations by compressing the global graph. However, this compression results in a loss of structural information and expressiveness in tasks where the global structure is important. In particular, since the approximation bound of S2N depends on how many subgraphs are spread out in the global graph, we cannot always expect a robust approximation. In contrast, the efficiency of WLKS comes from the efficiency of the algorithm itself, not from lossy compression.
Thanks again for this feedback and we included this technical comparison in Appendix A of the revised paper.
Integration with continuous features
The manuscript lacks a more direct discussion on the kind of information WLKS is employing. Is the method completely structural? How can features be included? Is the approach in subsection 2.5 the only way to do so?
WLKS is primarily designed to capture structural information through the WL algorithm. However, in addition to subsection 2.5, the method can be generalized to include continuous features using simple modifications.
- We can linearly combine multiple kernel matrices (as in Equation 3), which can encode both structural (via node coloring) and continuous features.
- We can use the continuous Weisfeiler Lehman operator (Togninalli et al. 2019) on node and edge features, to compute the kernel matrix for WLKS.
It seems that our paper did not sufficiently discuss the WLKS with continuous features. To address the reviewer's concern, we included a section on methods for incorporating continuous features. This can clarify that WLKS can effectively leverage both structural and feature-based information.
References
- Togninalli et al. "Wasserstein weisfeiler-lehman graph kernels." NeurIPS 2019
Dear Reviewer ZP6Y.
Thank you for reviewing our work. We especially appreciate your detailed feedback and active engagement with us. We uploaded the revised paper, and the revised parts are highlighted in red. We summarize the modified parts based on your feedback below:
- We added Appendix A to formally and detailedly compare WLKS with state-of-the-art baselines and representative related work. In addition, we moved the ‘Related Work’ section early in the paper.
- In Section 3, we included a formal description of subgraph representation learning.
- In Sections 4 and 6 (Table 5), we introduced WLKS models incorporating continuous features and their results on all benchmarks.
- In Section 3.3, We reconstructed Proposition 2.2 (now 3.2 in the revised paper) in a more formal and rigorous form. We corrected the typos in the proof through one round of discussion (please let us know if there are any remaining mistakes). Plus, we supplemented Figure 2 with general examples.
- In Section 6, we added an analysis of inference time and described how we measure training and inference runtime in detail.
- In Section 2 (Related Work), we revised our ambiguous writing.
We kindly ask if our revised paper has addressed your concerns as the discussion period comes to an end. If there is no remaining concern, we would greatly appreciate it if you could kindly consider updating the score accordingly.
I appreciate the authors' effort in improving the manuscript based on the reviewers' suggestions. In general, I believe that the method could provide an interesting contribution to the community, leading it to more focussed analyses and understanding on the efficacy of simpler and more scalable methods w.r.t. more sophisticated ones.
I updated my score accordingly.
We appreciate your thorough review and constructive feedback, which really helped us improve our work. Your detailed comments and engagement have been meaningful experiences for us during this submission. Thank you for recognizing the contributions of our paper and for your support.
Dear Reviewers,
As the discussion period is coming to a close, please take a moment to review the authors’ responses if you haven’t done so already. Even if you decide not to update your evaluation, kindly confirm that you have reviewed the responses and that they do not change your assessment.
Thank you for your time and effort!
Best regards, AC
The paper addresses the challenge of subgraph representation learning by extending theWL kernel to subgraph-level tasks. The proposed method applies the WL algorithm to k-hop neighborhoods around subgraphs, aggregates structural information, and combines kernel matrices across multiple k-hop levels to achieve a balance between discriminative power and computational efficiency. Experimental results show that the approach achieves competitive or superior performance compared to existing methods.
A key strength of this work is its principled extension of WL kernels for measuring subgraph similarity, supported by corresponding theoretical analysis. This provides an important alternative to widely used deep learning techniques for subgraph tasks, such as SEAL, distance encoding, labeling tricks, and GLASS. The paper is generally well-written and clear, particularly in its methodological explanations. However, some weaknesses include the limited selection of hop numbers and challenges in incorporating node and edge features effectively.
Despite some limitations, the strengths of the work outweigh its weaknesses. Therefore, I lean towards an acceptance. The authors are encouraged to address the reviewers' concerns in the manuscript, as discussed during the rebuttal period.
审稿人讨论附加意见
One of the concerns raised was the novelty of this work, given prior GNN approaches for subgraph-level tasks. This was effectively addressed by the authors, who provided an extensive discussion in the manuscript. Additionally, the proposed method is a kernel-based approach (with accompanying theoretical analysis) rather than a neural network, which further distinguishes it. Therefore, I do not have any concerns regarding its novelty.
Other concerns pertain to the choice of baselines, algorithmic complexity, and the handling of continuous-valued node/edge attributes. These, in my opinion, are more engineering or implementation-related issues that do not diminish the scientific value of the work. The authors also provided empirical evidence to mitigate these concerns.
Accept (Poster)