PaperHub
7.0
/10
Poster3 位审稿人
最低7最高7标准差0.0
7
7
7
4.0
置信度
正确性3.3
贡献度3.0
表达2.7
NeurIPS 2024

Deep Homomorphism Networks

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

We study parameterized multi-layers graph homomorphism networks where homomorphism mappings acted as convolutional kernels. The proposed multi-layers graph homomorphism can be understood by using rooted graph product homormorphism.

摘要

Many real-world graphs are large and have some characteristic subgraph patterns, such as triangles in social networks, cliques in web graphs, and cycles in molecular networks. Detecting such subgraph patterns is important in many applications; therefore, establishing graph neural networks (GNNs) that can detect such patterns and run fast on large graphs is demanding. In this study, we propose a new GNN layer, named graph homomorphism layer. It enumerates local subgraph patterns that match the predefined set of patterns $\mathcal{P}^\bullet$, applies non-linear transformations to node features, and aggregates them along with the patterns. By stacking these layers, we obtain a deep GNN model called deep homomorphism network (DHN). The expressive power of the DHN is completely characterised by the set of patterns generated from $\mathcal{P}^\bullet$ by graph-theoretic operations; hence, it serves as a useful theoretical tool to analyse the expressive power of many GNN models. Furthermore, the model runs in the same time complexity as the graph homomorphisms, which is fast in many real-word graphs. Thus, it serves as a practical and lightweight model that solves difficult problems using domain knowledge.
关键词
graph homomorphismsubgraph countinggraph neural network expressivity

评审与讨论

审稿意见
7

The authors propose a deep graph learning architecture where each layer consists of two key components: (i) the computation of homomorphism count vectors relative to a set of graph patterns, and (ii) a subsequent non-linear transformation. The homomorphisms computed in a layer incorporate features from the previous layer. Within this framework, the authors examine its expressive power, providing a characterization in terms of a variant of Weisfeiler-Leman and homomorphism indistinguishability related to an expanded set of graph patterns. Additionally, a brief section discusses the architecture's continuity properties and universality. Preliminary experimental results are also presented.

优点

S1 Architecture The proposed architecture generalizes standard MPNNs while enhancing their expressive power. Furthermore, by incorporating different graph patterns in each layer, it extends the approach taken by Barceló et al. [4], resulting in a significant increase in expressive power. The elegant idea involves combining the computation of weighted homomorphism counts of patterns in each layer, treating the host graph as being augmented with features computed from previous layers.

S2 Versatility By appropriately selecting graph patterns, the proposed architecture demonstrates exceptional flexibility in capturing various kinds of subgraph information.

S3 Analysis Theorem 5.3 is a notable result, clearly illustrating the expressive power of the proposed method.

S4 Theoretical Connections To provide insights into the properties of the proposed architecture, the authors leverage a wide range of theoretical results and connect their work to recent developments in the field.

缺点

W1 Incorrect statement Unless there is a misunderstanding, Theorem 3.3 is known not to be true. That is, the classical result relating isomorphisms and hom counts is known not to generalize to weighted graphs. A recent reference for this is https://arxiv.org/pdf/2308.15283 (but this was observed before).

W1 has been dispelled by the authors due to a misunderstanding of my part

W2 Imprecise formalizations

  • When defining the notion of generalized homomorphism, it is worth pointing out that this is not a new concept. Weighted homomorphisms have been studied before. Now, the definition on page 4 is not very clear: You mention “continuous” but with respect to which topology? The input and output of the functions μp\mu_p only becomes clear later on. Please define this more precisely. I also understand that μp\mu_p maps into a dd-dimensional space. What is the product in this space? You use this in equation (3) without explaining how to multiply two vectors. I assume you use Hadamard (pointwise) product?

  • In section 4.1., and in particular in the crucial definition Equation (5) of a homomorphism layer, you use hh without saying what it is. Please make this more clear.

  • Example 4.1 is hard to decipher. What is \circ in xπ()x_{\pi(\circ)} in Equation (6)? I was expecting a product in Equation (7) since the graph pattern has two vertices? I am also not sure what is going on l193. What are these sets μ\mu? Since this example is supposed to illustrate your main architecture, this should be clear and consistent with the introduced notation and definitions.

  • In the definition of your version of WL, in Equation (11), shouldn’t the homomorphisms π\pi used also take colors from previous iterations into account?

W3 Universality The results about universality in Section 5.3 are of less interest, since no quantitative bounds are given about the size of networks needed to approximate any function to arbitrary precision. Furthermore, the topology used needs more motivation and explanation.

W4 Heavy reliance on results in theoretical CS Many of the theoretical results and observations are heavily influenced (and are oftentimes restated versions of) results known in the literature. An example is section 4.2, but also many proofs are close to existing ones. Similarly, the results in section 5.2 all rather easily follow from recent advances in TCS, especially from the work of Roberson et al, and from previous work in the ML community.

W5 Experiments The experiments appear very preliminary and do not give much insight into how much the flexibility of the proposed architecture is needed to obtain good accuracy. Indeed a very limited number of patterns and two layers suffice? Also, it is unclear why the number of parameters varies when changing graph patterns? It would make a more fair comparison if parameters were fixed? Also, there is no comparison with the work by Barceló et al and NT and al, which served as the motivation for the current paper. What is the reason for this?

W6 Unclear sentences

  • L38 “for a common choice” what do you mean by this?
  • L87 “expressive power is .... exponential in the number of GNN layers” it is unclear what it means for the expressive power to be exponential in something.
  • Figure 1 should be improved. It is unclear what is going on in this figure at this point in the paper.
  • L125 “although sacrificing invariance ...” Why would you want to consider functions that can distinguish isomorphic graphs? This would indicate that you want to learn something depending on the specific encoding of the graphs?
  • L196-L198: It is unclear what these sentences intend to convey (even after consulting the appendix).
  • L202 and onwards, you mention that your model is an algebra. Please detail what you mean by this and why this is important (e.g., universality).

W7 Minor comments

  • L141 The results in Dell et al was previously shown in “On recognizing graphs by numbers of homomorphisms” by Z. Dvořák.
  • L146 Why does ϕ()\phi^{(\star)} need to be permutation invariant. It is defined as function which takes a multiset as input so it is permutation invariant by definition?
  • L149 “We can prove”-> “One can prove” (since it is already shown by others)
  • L152 “where T\cal T is the set of all trees” -> Should be removed here I assume?

问题

I would appreciate it if the authors could respond to the weak points W1, W2 and W5.

局限性

This has been addressed in a satisfactory way by the authors.

作者回复

Thank you very much for reading our manuscript. In particular, we are pleased to have a review from a specialist of homomorphisms. Below, we address your questions and comments. Especially, we claim our Theorem 3.3 is correct ([W1]) because of our definition of generalised homomorphism. Since all your major concerns (W1, W2, and W5) all related to the definition of our generalized homomorphism, we hope our answer clarifies your concern and you will reconsider the evaluation of our work.

Questions

[W1] As you mentioned, weighted graph homomorphisms can only distinguish weighted graphs without twins (https://arxiv.org/abs/1909.03693). However, our generalised homomorphism circumvents this issue by introducing non-linear transformations μ\mu on node features. For instance, Example 4 in your reference is distinguished by setting μp(x)=1\mu_p(x) = 1 for all pp and xx. This transformation gives the homomorphism number of the underlying unweighted graphs so it detects the isomorphism of the underlying graph.

The proof of this theorem is in the Appendix. The key point is that our generalised homomorphism can count the graphs-with-features's homomorphisms (i.e., mappings that preserve the adjacency and feature values) by suitably choosing non-linear transformations (say, μp(x)=1[x=ap]\mu_p(x) = 1[x = a_p] for the relation that node pp has a feature apa_p.). This observation with the classical Lovasz theorem for general relational structure (https://www.cs.elte.hu/~lovasz/scans/opstruct.pdf) proves the theorem.

[W2]

When defining the notion of generalized homomorphism, it is worth pointing out that this is not a new concept.

Weighted homomorphism is a classical concept, but we believe our version that applies non-linear transformation is new, and it is best-suited for GNNs. One significant difference between weighted and our generalised appeared in Theorem 3.3 above.


In section 4.1., and in particular in the crucial definition Equation (5) of a homomorphism layer, you use h without saying what it is.

It's the node feature of GG as we wrote as (Gu,h)(G^u, h). We will add this clarification in the updated manuscript.


Example 4.1 is hard to decipher.

Sorry for the imprecise presentation. The right-hand side of Eq (7) is by definition vN(u)μ(xu)μ(xv)\sum_{v \in N(u)} \mu_\bullet(x_u) \mu_\circ(x_v). In line 193, what we wanted to convey was:

  • For a single-node pattern (node: \bullet), μ(x)\mu_\bullet(x) is arbitrary
  • For a two-node pattern (nodes: \bullet and \circ), μ(x)=1\mu_\bullet(x) = 1 and μ\mu_\circ is arbitrary.

This shows Eq (6) and (7). We update this part as the above.


shouldn’t the homomorphisms π\pi used also take colors from previous iterations into account?

It takes colours from previous iterations into account, since it aggregates the colour of node π(p)\pi(p) at k-th step. We might have misunderstood your question, so please elaborate.

[W5] Please refer to the global rebuttal for a general remark about the real-world dataset experiment. About the number of parameters, our model has transformation μ=μp:pV(F)\mu = \\{ \mu_p : p \in V(F) \\} on each node in each pattern. We implement them as neural networks. Therefore, the number of parameters is basically the number of patterns times the number of parameters in each neural network.

Weaknesses

[W3] In principle, we agree the universality/BS-continuity doesn't add much value. But, we think it's an important property, esp., in the large graph applications. Also, it gives a brief comparison with the Zhang 2023's seminal model that captures the bi-connectivity (ICLR'23 outstanding paper) as "ours is continuous, theirs is not". See also 2WsS [Q5].

[W4] We also, in principle, agree that our theorems are easily followed from the TCS results. However, our goal here is not to prove new TCS results, but is to state the relationship between the GNN architecture (layer-wise aggregation) and the expressive power of the model (pattern graphs for hom) to extend the GNN expressive power study, and suggests a generic framework how to build a GNN architecture suitable for tasks.

[W6] We revise our manuscript to clarify all the points raised here.

L38

This means "Several existing GNN papers used such architectures."

L87

We identify the expressive power as the set of captured patterns. In this sense, this means "k layer GNN can count patterns in F0...FkF^0 \cup ... \cup F^k (having k-th power in terms of the rooted products)" as "exponential in k". The numbers of the patterns is also exponential in k (by assuming |F| = O(1)).

Figure 1

We updated Figure 1 to clarify the meaning.

L125

We don't want to consider functions that can distinguish isomorphic graphs; it is clearly written as "we stay with invariant functions as it is a natural requirement to represent graph properties" in the same paragraph. This is just a note that some existing studies claiming higher expressive power are sacrificing invariance. So our is incomparable with them as we/they are studying different classes of functions.

L196-L198

We wanted to say DHN is a very wide class of GNN that aggregates local information, i.e., (1) we think Eq (9) is the most generic GNN that aggregates information locally, and (2) it is expressible in DHN.

L202

Thanks. As you pointed out, it is important in the proof of the universality. We will update our manuscript accordingly.

L141

Thank you very much for pointing out the right reference.

L146

You are mathematically correct. This is just to emphasize the permutation invariance of the function, just be sure. Note that several GNN papers (e.g., Xu et a 2019 https://arxiv.org/abs/1810.00826) also mentioned a function is permutation-invariant even though it takes a multiset as an argument.

L149 L152

We will update them accordingly.

评论

I would like to express my gratitude to the authors for their clarifications, particularly concerning the application of non-linearity in the hom counts. This was a point I misunderstood in my initial reading, which may have led to an unduly harsh evaluation of the paper. After reviewing the authors' explanations and re-examining the paper, I now see its potential. If the authors can refine certain parts of the presentation and address all the reviewers' comments, I believe this paper could be a valuable addition to the conference program. Accordingly, I will raise my score to accept.

审稿意见
7

The authors propose a new neural network architecture for graph data, namely “Deep Homomorphism Network” (DHN). This is constructed as a stacking of “Graph Homomorphism layers”, which compute learnable, generalised homomorphism counts for a set of predefined input patterns.

DHNs extend the following previous works: (i) the approach by NT and Maehara which precompute extended homomorphism counts on attributed graphs and use them to define hand-crafted features for downstream models such as kernel methods; (ii) the approach by Barceló et al. which propose to add rooted homomorphism counts for a bank of patterns as initial features for downstream message-passing neural networks. With respect to (i), DHNs are deep, and can distinguish graphs beyond the homomorphism-distinguishability of the predefined set of input patterns. In relation to (ii), DHNs can account for features, and fully subsume the approach of Barceló et al. which can be considered as a special DHN whereby deeper layers only consider a bank made up of the single node and single edge patterns.

After expanding on preliminaries and model definitions, the authors interestingly show how the model fully capture msg-passing neural networks with a simple bank design, and then turn into discussing computational complexity, showing that, in some noteworthy settings, this grows linearly or polynomially.

A section on expressive power follows. The authors characterise the expressive power of their model in terms of homomorphism indistinguishability w.r.t. a bank of patterns obtained by the "closure" of a "kernel bank" w.r.t. rooted graph product. Next, a series of precise results are obtained in terms of typical bank choices (cycles, cliques, connected graphs of a certain size), and in contrastive terms w.r.t. known expressive models.

Finally, the authors practically experiment with their architecture on synthetic benchmarks for node disambiguation, in an attempt to empirically verify the discriminative power of their model. Results validate how this depends on the choice of the pattern bank and, importantly, the depth of the architecture.

优点

  • (S1) The present manuscript proposes a new, valid, interesting approach for expressive graph neural models. It builds on top of known works, but extend them in meaningful and concrete ways.

  • (S2) The proposed architecture is comprehensively studied in terms of its expressiveness, and its complexity is additionally discussed. The results on expressive power are precise and informative.

  • (S3) DHNs meaningfully subsume some known architectures, offering a new, interesting perspective on how a deep graph neural model can be structured beyond message-passing and combinatorial approaches.

缺点

  • (W1) The presentation of the paper could be improved in some parts. For example, the authors extensively refer to the graph rooted product around Lemma 3.1, but this is only introduced in Section 5.1 (almost two pages later). Some other parts could be improved as well by either giving better intuitions or by providing additional visual support. This would help the reader especially in technical and theoretical parts. This can be true, for example, around Lemma 3.1 — why does it hold, is it trivial that one can decompose homomorphism π\pi into π0\pi_0 and πp\pi_p, how would the two look like on a visual example?. As another example, a figure explaining Example 5.1. or the hierarchies/relations in Corollary 5.6 would help. An illustrative aid could also support the comprehension of the formula in Eq. 3.

  • (W2) The authors do not experiment at all on real-world benchmarks. Why is it the case? What would the reader expect on those? One interesting aspect of of DHNs is that they can (easily) account for node features; this kind of real-world experimentation would help gathering a signal on how important this architectural feature is.

  • (W3) The authors do not position well enough their contribution in the broader landscape of GNN research beyond expressive power. What is the impact that we should expect from these new architectures? Will they find a strong real-world application in a certain setting? Should they rather considered as an approach of purely theoretical interest?

问题

  • (Q1 - minor) Line 116: what are domain and co-domain for ff? Can the authors specify? I think it would help towards a better comprehension of the following lines.

  • (Q2 - minor) Line 149: Why do the authors use the first-plural pronoun? Would it be more correct to use an impersonal “it is possible to show”?

  • (Q3) Theorem 3.3: Is it essentially a consequence of Lovász’s theorem? It would help if the authors referred (already in the main corpus) to previous works and their proof in the appendix.

  • (Q4) What is μ\mu in Eqs. 6, 7?

  • (Q5) In Example 5.1, is the single vertex not needed? What is the intuition behind?

  • (Q6) The authors refer to the k-WL test as the “Folklore” type, right? If so, shouldn’t 2-GNNs and 2-IGNs be, instead, 3-GNNs and 3-IGNs (page 8)

  • (Q7) What is, concretely, the task solved in the experimental setting? Is discrimination cast as multi-way classification? Can the authors provide more details on that?

局限性

The authors mention limitations in the last paragraph of their manuscript (main paper). They could expand as per hinted in (W3), should other limitations emerge.

作者回复

Thank you very much for reading our manuscript carefully. We revise our manuscript to improve the clarify and readability by addressing your concern. Below, we reply to each comment/question.

Weakness

(W1) The presentation of the paper could be improved

graph rooted product around Lemma 3.1

We'll re-organise Section 3 as follows: We move Lemma 3.1 to Section 5. We also elaborate Theorem 3.3 by adding a remark to clarify [W1] of Reviewer AtgF.

Some other parts could be improved as well by either giving better intuitions or by providing additional visual support.

We will add illustrations to Lemma 3.1 and Hierarchy of Corollaries.

The decomposition in Lemma 3.1 is straight-forward. For example, let's take the spoon graph FF. The nodes are r,a,b,c\\{r, a, b, c\\}, where rr is the root, r,a,b\\{r,a,b\\} forms a triangle, and b,c\\{b,c\\} forms an edge. Let π\pi be an arbitrary homomorphism from FF. Then, π\pi is decomposed into π0\pi_0 that maps the triangle r,a,b\\{r,a,b\\} rooted at rr and πb\pi_b that maps the single-edge graph b,c\\{b,c\\} rooted at bb. They are the restrictions of π\pi on these subsets. As FF is obtained by attaching the edge to the triangle, this construction is one-to-one. We will add an illustration to show this construction, which may help to understand the decomposition at a glance.

(W2) We understand your concern; based on our expressivity results, the reader should expect our model to perform better than less expressive baselines but not better than highly engineered SOTA models on real-world datasets. We report such results in the global rebuttal.

(W3) This paper is on the line of studies of GNN expressive power, although we are motivated by practical large graph applications that involve subgraph feature engineering. In this sense, this paper itself doesn't immediately provide SOTA-level GNN architecture. The important missing part is how to design patterns and optimize the architecture accordingly. Although our model with basic patterns worked well on benchmark datasets in the manuscript and real-world datasets in the global rebuttal, we think the practical DHN architecture requires further work. We will mention this limitation in the updated manuscript.

Questions

(Q1 - minor) This is a great point. For an invariant function ff (line 116), the domain of ff is the set of all graphs with features. The codomain is arbitrary, but usually RdR^d. On the other hand, for an equivariant function ff (line 119), the situation is very complicated. As in the invariant case, the domain is all graphs. However, the codomain depends on the input since for each GG, f(G)f(G) is indexed by V(G)V(G) as f(G)uf(G)_u. So, mathematically speaking, the codomain of an equivariant function is GRd×V(G)\bigcup_G \mathbb{R}^{d \times V(G)}. This "irregular codomain issue" makes analysis difficult. Hence, we are consistently working on rooted graphs in this paper (this is the idea in https://arxiv.org/abs/1910.03802). For rooted graphs, we have f(Gr)Rdf(G^r) \in R^d so its codomain is RdR^d regardless of the input graph. This is explained in line 120--122, but we will elaborate to clarify the reasoning.

Note that most existing GNN studies didn't have this issue because they worked on graphs of fixed number of nodes. However, in many applications we often compare the function values on two different size graphs GG and GG' (see also Response to [Q5] of Reviewer 2WwS). Hence, we decided to employ the rooted graph formulation.

(Q2 - minor) We'll fix this.

(Q3) Yes, this is an immediate consequence of Lovasz 1967's classical theorem (https://www.cs.elte.hu/~lovasz/scans/opstruct.pdf), which says that any relational structure (not limited to graph) is uniquely (up to isomorphism) identified by the homomorphism numbers. We just need to confirm that our generalised homomorphism can count the homomorphisms as graphs-with-features, but this is also straight-forward. This is written in the proof in Appendix, but we will add the idea to the main body of the paper. See also Response to [W1] of Reviewer AtgF.

(Q4) Thank you for pointing out the imprecise presentation. μ\mu is a set of functions defined on nodes, μ=μp:RdiRdo:pV(F)\mu = \\{ \mu_p: R^{d_i} \to R^{d_o} : p \in V(F) \\}, transfers node features, which are the parameters of the model. In this specific case, the precise definition is in the Response to [W2] of Reviewer AtgF. We will put these precise definitions to avoid confusion in the manuscript.

(Q5) The singleton is included by the definition of the closure (0-th power is the singleton); see line 237.

(Q6) No, we meant k-WL original version.

(Q7) Indeed, in our experiments, distinguishing isomorphism classes is cast as multi-class graph classification and experimental settings for each synthetic dataset followed the literature. Each input graph belongs to an isomorphism class, and the task is to detect these classes in the test set. For example, SR25 is a set of strongly regular and co-spectral graphs belonging to 15 different isomorphism classes; the task is to correctly detect these isomorphism classes.

评论

I would like to thank the authors for their response, which I have positively acknowledged.

I personally find the results in the general response interesting, and the comments and discussions wrt SOTA methods worth sharing with the rest of the community – I think the manuscript would benefit from their inclusion.

No, we meant k-WL original version.

I apologise, but I am afraid I am not fully aligned on the k-WL issue. In the paper you have a footnote stating:

We follow [11]’s definition, which is also called the folklore Weisfeiler–Lehman test.

So, it seems like k-WL refers to the folkore version?

评论

We apologise for the previous response; we saw wrong the question. You are correct. We double-checked the k-GNN and k-IGN papers, and confirmed that they referred k to the original k instead of the folklore version, which led to our off-by-one error.

k-GNN: https://arxiv.org/pdf/1810.02244

We note here that the above variant is not equal to the folklore variant ...

k-IGN: https://arxiv.org/pdf/2201.10129

2-IGN = GIN

We'll update them to 3-GNN and 3-IGN accordingly. Thank you very much.

审稿意见
7

The authors introduce Deep Homomorphism Networks (DHNs), which generalize Message Passing Neural Networks (MPNNs) and previous Homomorphism Networks. DHNs explicitly count the number of homomorphisms from pattern graphs to the input graph, calculate representations based on these counts, and update node features accordingly. MPNNs are a subset of DHNs when patterns are limited to single nodes and edges. Previous Homomorphism Networks are equivalent to one-layer DHNs. The authors extend known results about the homomorphism expressivity of these networks, demonstrating that DHNs can count a broader range of homomorphisms. This lays the foundation for quantitative expressivity results.

优点

  • The construction of DHNs is elegant, effectively generalizing standard MPNNs and other homomorphism networks.
  • The theoretical contributions are interesting, intuitive, and non-trivial, providing a solid foundation for further research on the expressivity of GNNs.

缺点

  • The paper lacks evidence of improved predictive performance on real-world tasks. While the theoretical contributions are valuable, it is unclear in which real-world tasks DHNs would be advantageous. Clarification/Ablations on whether domain knowledge is necessary to predefine patterns and whether including irrelevant patterns can harm predictive performance would be beneficial.
  • The computational complexity of DHNs is not convincingly addressed (as a limitation). For example, 3-WL has cubic complexity and can count homomorphisms of all graphs with tree-width 2 at this complexity. DHNs are less expressive unless patterns include graphs with tree-width 3 or higher. Therefore, to surpass 3-WL, DHNs must match or exceed its complexity. The advantages of DHNs over 3-WL or its local variants, like subgraph GNN or local 2-FWL, need to be clearly stated.
  • The motivation for using labeled patterns is unclear, as this adds complexity and increases the need for domain knowledge. The benefits of using labeled patterns over blank patterns without features must be better explained.

Minor:

  • Some proofs could be more detailed. For instance, in Theorem 5.3, item 2 to item 3 is not clearly justified. The proof shows that there exists a function hh that satisfies h(u)=hom((F,μ),(Gu,x))h(u)=hom((F, \mu), (G^u, x)) for every uu and a fixed graph GG. If the construction of hh depends on the graph, this is not sufficient for the proof. More clarity and detail in this and other proofs would strengthen the paper.

问题

  • What is the predictive performance of DHNs on real-world tasks? Are explicit subgraph/homomorphism counts optimal for these tasks?
  • There is one related work [Paolino, Raffaele, et al. "Weisfeiler and Leman Go Loopy: A New Hierarchy for Graph Representational Learning." arXiv preprint arXiv:2403.13749 (2024).], and I believe the relation could be really interesting. Paolino et al. only consider injective homomorphism, while their color refinement algorithm is more complex in the sense that representations for the homomorphism are calculated. If I understand correctly DHNs can homomorphism count exactly cactus graphs (with limited cycle length) if the patterns contain cycles. Could you discuss the relationship?
  • For which classes of patterns is the homomorphism-expressivity maximal in the sense that the set of patterns that DHNs can count is maximal. See also
  • Why focus on feature homomorphisms? This approach seems more restrictive and domain-dependent, requiring prior knowledge of the existence of these featured patterns. When is it better to focus on feature homomorphisms versus general homomorphisms? why are the graphs G1G_1 and G2G_2 in Theorem 5.3 rooted?
  • The authors mention very shortly the continuity wrt BS distance. However, it's not clear why this is desirable. While it leads to learnability one can derive learnability also without it. At the same time, it strictly limits its expressivity as discussed in Remark 5.14.: Is continuity desirable? If yes, can you show that DHNs can approximate non-continuous graph invariants like biconnectivity?

局限性

It would be beneficial to add that the predictive performance is unclear. Also domain-knowledge might be needed to not worsen generalization performance/predictive performance.

作者回复

Thank you very much for the thoughtful review and comments.

Questions

  • [Q1] Since real-world dataset comments are common among all reviewers, please refer to our global rebuttal.

  • [Q2] Thank you very much for giving us a pointer to (Paolino et al., 2024); we will cite and discuss the relationship in the updated related work. Their model is a special case of ours with patterns Ck:k=1,,r\\{C_k: k = 1, \ldots, r \\}. Hence, ours can also count cactus graphs of bounded cycle length using these patterns. One important difference is that they used a permutation-invariant aggregation on cycles (Eq (3)); thus, they cannot distinguish two cycles of the same set of features that are ordered differently (e.g., [a,a,b,b][a,a,b,b] vs [a,b,a,b][a,b,a,b']), while ours can.

  • [Q3] [Note: Please let us know if we misunderstood your question (since we don't see after "See also")]. In general, it's difficult to answer. A theoretically correct answer (but not very related to DHN) is that the set of all graphs of size at most n is maximally expressive since it determines the isomorphism class of graphs of size at most n. Finding smaller but isomorphism-characterising patterns is a central research question in the field of homomorphism distinguishability; see (Dvorak, 2010) and (Roberson, 2022).

  • [Q4] Could you elaborate on "general homomorphisms" and "feature homomorphisms" in this question? We might misunderstand your question. It is true that the method requires some domain knowledge about what patterns are useful. However, we believe it is better than the classical graph learning approach in which we have hand-crafted subgraph patterns as we need to take a larger but tractable set and ask DHN to learn the important patterns (this might be similar to classical hand-crafted features in computer vision versus CNN). Our model is more restrictive than those with higher expressive power (e.g., k-WL equivalent models). However, the non-restrictive are expensive (say, O(n^k)) and not applicable to large graphs. In this sense, we are restricting our model to satisfy the needs of computational complexity.

    • Graphs G1 and G2 in Theorem 5.3 are rooted because we consistently work on rooted graphs for technical reasons. Please refer to our rebuttal [Q1-minor] to reviewer X5N6. Note that it is easy to derive the result from the rooted version to the non-rooted version because the non-rooted version can be stated as "there is a bijection pi from V(G1)V(G_1) to V(G2)V(G_2) such that each pair of graphs G1uG_1^u and G2π(u)G_2^{\pi(u)} rooted by uu and π(u)\pi(u) are equivalent"
  • [Q5] First, DHN (or any BS-continuous GNNs) cannot approximate the biconnectivity of all graphs because there exists a pair of graphs G,GG, G' such that they're arbitrarily close in the BS topology but have different biconnectivity so the model fails to predict the biconnectivity of GG or GG'. Models that can approximate biconnectivity must be non-continuous in the BS topology. Intuitively, if a function f is BS-continuous, GG is large, and GG' is a snowball sampling of GG, then f(G)f(G) and f(G)f(G') are close (see Czumaj 2019). It is useful in large graph applications: We want to estimate a property on a large graph (e.g., Twitter networks) but only have a small sample GG'. Here, the BS continuity guarantees that f(G)f(G') approximates f(G)f(G). If we use a non-continuous model, we have to discuss the validity of estimating f(G)f(G) from f(G)f(G'), which is usually non-trivial. In this sense, ours is useful in large graph applications but loses the approximability of non-continuous properties. In contrast, the non-continuous one is useful in small graphs but loses the generic applicability in large graphs. We will add the above-mentioned intuition of BS continuity and the reasoning why it is useful. This discussion is not specific to our DHN model, but it clarifies why we consider BS continuity, which improves the readability of the manuscript.

Weaknesses

  • [W1] Thank you for pointing out the necessity of some intuitions on real-world datasets. We quickly added some experiments in the global rebuttal and will update our manuscript following your comment.

  • [W2] In terms of patterns, DHN captures "locally complex but globally tree (i.e., treewidth = 1)" patterns, while k-FWL (or (k+1)-WL)-equivalent models capture "globally tree-like (i.e., treewidth = k)" patterns. Therefore, they are essentially incomparable (one cannot surpass the other). We expand line 35 and add a Remark after the Corollaries to state when the proposed method is better than k-WL-type methods in such applications.

  • [W3] Here, we interpret the term "labelled pattern" as the patterns with functions. Please let us know if our understanding is wrong. First, how to aggregate node features over patterns without functions needs more exploration. A simple idea is to multiply the feature values in each coordinate and sum them up. This gives the weighted homomorphisms on each coordinate. However, adding/multiplying raw features might not necessarily make sense to us (although many papers are doing so), and it has a mathematical issue, as mentioned by Reviewer AtgF. Hence, we decided to apply functions to features. Second, in practical applications, the importance of nodes in patterns might differ (e.g., nodes closer to the root could be more important). Thus, we decided to use node-dependent functions.

  • [W4] We will improve our presentation of the proofs.

评论

Thank you for the rebuttal. I have one final question: When calculating homomorphisms, do you consider a map that preserves edges as homomorphisms, or if node features are available, do you require the map to preserve both edges and node features? If it's the former, there may be issues with the proofs. If it's the latter, how likely is it to actually find a homomorphism in the second layer? Specifically, you would need to select a pattern FF with node features xx that appear in the second layer, i.e., in the original graph with updated node features.

Could you also please comment on situations where you would recommend not using deep homomorphism layers? It seems that the complexity becomes as high as kk-WL when cliques are used as patterns. Even for cycles, the complexity appears to be quite high, given that the homomorphism counts cannot be precomputed since μPμP​ is learned during the process.

评论

When calculating homomorphisms, do you consider a map that preserves edges as homomorphisms, or if node features are available, do you require the map to preserve both edges and node features?

Our algorithm enumerates all feature-less homomorphisms but aggregate features with non-linear transformations. So it's something in-between.

To select the pattern (F, x) in the 2nd layer, where x is the original feature, we set the 1st-layer transformation by stacking the transformation needed in the 1st layer and the identity function for the 2nd layer. The same discussion works for any but finite number of layers.

Could you also please comment on situations where you would recommend not using deep homomorphism layers?

Thanks. It's valuable to emphasise in limitation. In short, there are two cases we DON'T recommend using DHN.

  1. Graphs are small so that O(n^k) is tractable.
  2. Graphs are dense so that there are Theta(n^k) patterns.

As you mentioned, clique enumeration (and many small-pattern enumerations) takes Theta(number-of-patterns) time, and the number of patterns can be Theta(n^k). Therefore, its worst-case complexity is comparable to k-WL type models. If this complexity is acceptable, there's no need to use the DHN model. This is the above Case 1.

Our main target is large real-world graphs, which often exhibit the property that there are only O(n) patterns. This is a consequence of the sparsity of the graphs, which is usually analysed via degeneracy; see Case 3 in Section 4.2. In other words, if the given graph is dense, the enumeration takes too much time, making it impossible to apply. This is the above Case 2.

作者回复

First, we thank all three reviewers for their time and detailed comments on our work. Given the submission volume in our community, we are truly grateful that all three reviewers have clearly given our manuscript deep consideration.

In this global rebuttal, we would like to address a common concern from all reviewers regarding real-world experimental results.

Experimental result on ENZYMES and PROTEINS. We run some representative DHN models on TUDataset's ENZYMES and PROTEINS datasets. These bioinformatic datasets are cast as graph classification problems. ENZYMES consists of 600 graphs and 6 classes; each node has 3 tags and 18 features. PROTEINS dataset consists of 1113 graphs belonging to 2 classes; each node has 3 tags and 1 feature. We report the classification accuracy using a stratified 10-fold cross-validation method similar to the literature. We will update Table 1 in the manuscript as follows.

Model#paramsCSLEXPSR25ENZYMESPROTEINS
MPNN (4 layers)27k00054.6 ± 4.572.0 ± 4.0
PPGN (4 layers)96k10010010058.2 ± 5.777.2 ± 3.7
DHN (C2:4)5k10050064.3 ± 5.576.5 ± 3.0
DHN (C2:10)27k10098058.0 ± 5.378.5 ± 2.5
DHN (C2:4, C2)8k100505364.4 ± 5.977.1 ± 2.8
DHN (C2K3:5, C2K3:5)36k10010010057.5 ± 6.677.4 ± 3.4

Note that the SOTA result for ENZYMES is around 78%, and for PROTEINS, it is around 84%. However, we do not intend to compete with the SOTA results in this work.

The result above emphasizes the necessity of specific considerations for each real-world dataset. The expressivity experiments (CSL, EXP, SR25) clearly distinguish each architecture; however, real-world dataset results are often comparable among different architectures. Achieving SOTA might be attributed to engineering efforts rather than novel model architecture.

The lack of real-world benchmarks is indeed a shortcoming of our current manuscript. Our work aims to focus on the method's theoretical foundation and novelty, so we did not put much effort into experimentation except for the synthetic expressivity datasets. While our model performed well compared to basic baselines, it is still quite far from SOTA result due to the lack of engineering. For example, the SOTA model for ENZYMES (DSGCN-allfeat) computes the spectrum for each graph, which is highly engineered toward this particular dataset. It is clear that there is a gap between our theoretical proof of concept model and a truly applicable one, and we think there is potential. Still, more engineering work is needed before we can bridge the theory-practice gap. That being said, based on the reviewers' comments, we understand that readers have some expectations for real-world benchmarks, so we will include some results for a subset of the graph classification TUDataset in the final version of the manuscript.

最终决定

The paper proposes a new graph neural network (GNN) layer based on local subgraph patterns using homomorphisms. This yields a GNN that can be used to study the expressive power of GNNs that locally aggregate information. The relationships to existing works are well explained.

The paper received three very detailed reviews. All reviewers participated in the discussion and recommend to accept the paper.

They mentioned the missing consideration of real-world datasets as one main weakness of this work. The authors provided such results in the rebuttal but obtained only decent performance. I consider closing these gaps as challenge for future research.

Overall, I recommend to accept the paper.