Laplace-Transform-Filters render spectral Graph Neural Networks transferable
We develop a novel approach to GNN transferability based on information diffusion on graphs. We find that spectral graph neural networks are transferable from this new point of view if their filters arise as Laplace transforms of certain functions.
摘要
评审与讨论
This paper studies the transferability of spectral GNNs based on diffusion over graphs. The transferability is discussed between graphs which are similar from the diffusion perspective. The approach is to analyze the transferability of each single filter, giving rise to the findings that spectral GNNs are transferable if their filters are Laplace transforms of certain generalized functions.
优点
The research question concerns the interesting topic of transferability of GNNs. The paper is well-written and the experiments are well-conducted.
缺点
- The definition of similarity in the diffusion sense is rather limited and discussed in the context of graph coarsening. However, the important works on similarity of graphs in the context of graph coarsening have not been considered, i.e., the following two:
-
Andreas Loukas, Graph reduction with spectral and cut guarantees, JMLR 2019.
-
Andreas Loukas, Pierre Vandergheynst, Spectrally approximating large graphs with smaller graphs, ICLR 2018.
-
An obvious limit is that the class of filtering functions is limited to the class of low-pass filters, as shown by the examples 4.2 and 4.3, which are both typical examples of low-pass graph filters. This is in fact connected to the heat diffusion equation for the distance, as those will act as low pass filter. In this regard, the transferability may not hold in general setting, as spectral GNNs may learn also high-pass filters.
-
It seems the transeferability is discussed also in the context of graph coarsening. Is this general enough to discuss the transferability of GNNs? Please elaborate on this.
-
Two important works on transferability of GNNs are not well-discussed:
-
Convergence and Stability of Graph Convolutional Networks on Large Random Graphs
-
Graphon Neural Networks and the Transferability of Graph Neural Networks
For the second work, the author only mentioned it is limited to the (very) large graph setting. This is not true. The work in fact discussed a general setting that goes beyond the graph coarsening.
问题
-
The fundamental reason why has pitfalls in measuring the similarity between two Laplacians comes from that the spectral norm is a poor measure. Sometimes, Frobenious norm, as the sum of the eigenvalues, might be a better measure, but still limited. Fundamentally, one could really compare how similar two graphs are by checking the full difference matrix .
-
Different graphs in this paper have been only investigated in terms of one graph can be considered as the coarsened version of another. This is effectively the problem of graph coarsening. However, a similarity notion in the context of graph coarsening has already been rigoriously studied by
-
Andreas Loukas, Graph reduction with spectral and cut guarantees, JMLR 2019.
-
Andreas Loukas, Pierre Vandergheynst, Spectrally approximating large graphs with smaller graphs, ICLR 2018.
-
the authors there defined the notion of spectrally restricted similarity between graphs, so to ensure the spectral similarity when performing graph coarsening.
-
I do not see the authors have considered this in their work. Please note that in the above two works, rigorous definitions of spectral similarity are provided, as well as extensive theoretical results.
- The difference between two diffusion operators is in fact to utilize an extra dimension to measure the spectral difference than simply using the spectral norm. However, it is still a limited measure of similarity. As increases, the diffusion flow is approaching to the harmonic eigenvector (space) of the Laplacian. Consider a setting where two completed different graphs (but both connected) with the same number of nodes. When is large enough , the proposed similarity measure is essentially zero, showing two graphs are similar. This is clearly not true. Of course, this is under the assumption that is large. I wonder if this is in fact the case of Figure 9, where different molecules are similar as increases.
Related to this, I feel in 3.1 the maximum eigenvalue of L should be below 1 to ensure the diffusion will converge, right? Typically a parameter is needed there to control it. If so, then what is dominating the diffusion is maximum eigenvalue. If the latter is small, graphs will be similar in this perspective, which again is not necessarily the case.
Having said that, the authors could consider checking the more general measure from the above two works.
- In Section 4, it is briefly discussed that typical polynomial filters would diverge due to that the norm of on the graph tends to infinity (since ). Could authors elaborate on this? The spectral norm of a graph Laplacian should be bounded, so I do not directly see this. Moreover, this seems to contradict the claim by Transferability of Spectral Graph Convolutional Neural Networks, JMLR 2019.
In Section 4, it is briefly discussed that typical polynomial filters would diverge due to that the norm of on the graph tends to infinity (since ). Could authors elaborate on this? The spectral norm of a graph Laplacian should be bounded, so I do not directly see this.
We are very glad to oblige and elaborate on this! While each fixed (finite dimensional) Laplacian of course has bounded spectral norm, the spectral norm of the un-normalized Laplacian (sometimes also referred to as the combinatorial Laplacian) is not uniformly bounded: Considering the graph of two nodes connected by an edge with weight . It is not hard to see that the spectral norm of the corresponding Laplacian is given as . Let be the normalized eigenvector associated to this eigenvalue . Then we have
for any (non-trivial) polynomial, as tends to infinifty.
It should also be noted that the lack of uniform boundedness of the graph Laplacian is not an issue if the correct class of filters is being employed: One way of achieving stability in a spectral convolutional network is certainly to demand the Laplacian be sufficiently bounded. However, ultimately what needs to be bounded to achieve stable networks, is the norm of the filters . If polynomial filters are employed this indeed directly translates to the requirement that the norm of needs to be sufficiently bounded. For non-polynomial filters such as resolvents or exponentials, this however is not necessary. We e.g. have
irrespective of the norm of ; which might be arbitrarily large.
Moreover, this seems to contradict the claim by Transferability of Spectral Graph Convolutional Neural Networks, JMLR 2019.
We understand how initially this might indeed seem to be the case. However our results coexist in perfect harmony with those of Levie:
In Levie's work, filter functions are assumed to be bounded () and Lipschitz continuous with Lipschitz constant on the spectrally relevant part of (c.f. e.g. Theorem 17; page 17; ibid.). For e.g. the unnormalized graph Laplacian approximating an (unbounded) Laplace-Beltrami operator on a manifold, this spectrally relevant part is the non-negative real line. Polynomials are neither globally Lipschitz nor bounded on .
Thus Levie's (nonetheless very fundamental) work does not make statements about networks based on polynomial filters in the setting of unbounded graph shift operators (such as e.g. a sequence of Laplacians approximating the (necessarily unbounded) Laplace-Beltrami operator in the setting of Levie's paper).
Hence, as Levie does not make statements about polynomial filters in the setting of unbounded sequences of graph shift operators, our claim that polynomial filters are not transferable in this setting does not contradict any of Levie's results.
We thank the reviewer for raising this important question. Following this feedback, we have included a more detailed discussion of Levie's work in Appendix B of our submission. The reviewer might also be interested on our response to Reviewer DCXz where we further discuss the relation of our results to those of other existing works.
I wish to thank the authors for their detailed response. Overall, I stand to my original judgement.
The reasons are that this work studies a particular stability (generalizability) problem by ignoring the high spectral variations which are the most sensitive part. As discussed with the other reviewer above, using a diffusion distant is hiding this problem. Establishing transference between graphs that are close in a diffusion sense remains a particular case of the more general GNN stability. What would be interesting here is to show what theoretical insights and advantages this particular case analysis provides over the more general case. And how the latter analysis ties to the more general ones as well as to the ideas of graph coarsening that have used similar concepts. Unfortunately, insights in this regard are limited or inexistent in the paper.
Since the general stability of GNNs without a low pass filtration is discussed thoroughly, I struggle to see significance in this work to merit publication. Moreover, the tools used for such an analysis have also been thoroughly discussed in graph coarsening and here they are adapted to GNNs. Thus, overall I judge the results to have limited significance and lacking originality in some respect.
Related to the frequency analysis for directed graphs, do consider that it is quite investigated in the graph signal processing literature. But this is less relevant for the judgement of this paper and more fyi.
- Sandryhaila, A., & Moura, J. M. (2014). Discrete signal processing on graphs: Frequency analysis. IEEE Transactions on signal processing, 62(12), 3042-3054.
- Sardellitti, S., Barbarossa, S., & Di Lorenzo, P. (2017). On the graph Fourier transform for directed graphs. IEEE Journal of Selected Topics in Signal Processing, 11(6), 796-811.
- Shafipour, R., Khodabakhsh, A., Mateos, G., & Nikolova, E. (2018). A directed graph Fourier transform with spread frequency components. IEEE Transactions on signal processing, 67(4), 946-960.
Thank you very much for your thorough review of our paper!
We were very happy to read that
The paper is well-written
and that
the experiments are well-conducted.
Let us discuss the raised points individually:
An obvious limit is that the class of filtering functions is limited to the class of low-pass filters, [...] the transferability may not hold in general setting, as spectral GNNs may learn also high-pass filters.
The review is spot on in recognizing that transferability is not established for all filter functions! This is in fact a main finding of our paper:
Not all filter functions will lead to architectures that are able to transfer in all settings.
As remarked by the reviewer, on undirected graphs a frequency interpretation is available. Here one might interpret our findings as only (certain) low pass filters facilitating bi-directional transferability. Additionally (certain) all-pass filters (or more precisely filters that are constant on large frequency components) will lead to mono-directional transferability (c.f. the discussion in Section 4.2; esp. Line 289ff.).
Heuristically we might understand this result from a spectral graph theory perspective: It is well known (Chung 1997) that the low lying spectral information of the graph Laplacian corresponds to coarse structures within a graph. Large eigenvalues instead correspond to fine structure details. We want networks to be transferable between graphs that are approximately the same. Such graphs may differ greatly in their fine structure details, but generically will have similar rough structures within them. That is to say, the low lying spectral information within these graphs will be similar, while the high lying spectral information may be very different. Low pass filters will suppress the variations in the large eigenvalues while letting the approximately similar low lying spectral components of the respective Laplacians pass. This heuristically motivates while it is these filters that can be shown to be transferable. A similar line of argument motivates why also all-pass filters but not high pass filters can be transferable in this undirected setting.
For directed graphs, a frequency perspective is no longer straightforwardly available. Nevertheless, also here we may characterize the class of filters yielding transferable networks.
In both settings, it is the class of Laplace transform filters that renders spectral graph convolutional networks transferable. This fact informed the choice of title of our paper: Laplace transform filters render spectral graph neural networks transferable.
The definition of similarity in the diffusion sense is rather limited and discussed in the context of graph coarsening.
Having a rigorous, mathematically sound, and well adapted measure of similarity is indeed of fundamental importance when discussing transferability. However – as we hope to convince you – all of these properties are fundamentally satisfied by the notion of diffusion similarity. Detailed arguments and substantial discussions may e.g. be found in the original papers by Hammond et al. 2013 and Scott (2021) (see also Scott & Mjolsness (2021)). Here we note the following:
In the original paper (Hammond et al. 2013) that first introduced this notion of graph similarity, the authors showed diffusion distances () to be a well defined metric on the space of graphs. Here, 'metric' is used in the strictly mathematical sense (i.e. satisfying the defining properties of positivity, symmetry and the triangle inequality). Hence the notion of diffusion similarity equips the space of graphs with a well defined (metric-)topology. This topology respects the one induced by Euclidean norms:
If for one (and hence all) Euclidean norm, then also .
At the same time, the metric arising from diffusion similarity is able to capture more general settings of graph similarity: One example is a sequence of graphs where the connectivity in certain subgraphs increases (c.f. Section 3.2, line 177ff.). Such a sequence does not converge in any Euclidean norm. However, in the diffusion-distance metric it is Cauchy and hence also convergent. The limit is a coarse grained graph, where strongly connected clusters are collapsed to single nodes. Thus this diffusion based metric is e.g. naturally able to capture convergence to graphs of reduced size.
Additionally, the notion of diffusion similarity is not limited to the setting of coarse-graining graphs. Other examples settings captured by this notion of diffusion similarity are rewiring operations in graphs (c.f. Section 3, Fig.s 1&2 and line 136 ff.), the inclusion of subgraphs (c.f. Section 3.2 Fig. 5 and line 173 ff. as well as Appendix D) or graphs discretizing the same ambient space (c.f. the discussion in Section 5.3 line 493 ff. And Appendix F.2). Additionally the notion of diffusion similarity naturally extends to directed graphs (c.f. e.g. Appendix F.3).
Hence it is indeed fair to conclude that diffusion similarity is a well-adapted and widely applicable notion of graph similarity.
Following your feedback, we have now included an additional section in our Appendix C summarizing the rationale behind utilizing diffusion similarity along with an overview of its diverse beneficial properties.
However, the important works on similarity of graphs in the context of graph coarsening have not been considered, i.e., the following two:
- Andreas Loukas, Graph reduction with spectral and cut guarantees, JMLR 2019.
- Andreas Loukas, Pierre Vandergheynst, Spectrally approximating large graphs with smaller graphs, ICLR 2018.
Thank you for bringing these interesting works on graph coarsification to our attention! We have included them in our paper when introducing transferability in the setting of graph coarsening (c.f. lines 187 - 189). As the focus of our paper is the study of transferability properties of graph neural networks, we do not introduce any novel approaches to coarse graining or graph reduction in our manuscript. Instead our work provides a general framework of transferability that extends beyond the setting of graph coarsification. Nevertheless this example is indeed a prominent one. Thus, we appreciate your help in making the discussion of this example more rigorous.
It seems the transeferability is discussed also in the context of graph coarsening. Is this general enough to discuss the transferability of GNNs? Please elaborate on this.
Thank you for your impotant question. This gives us the opportunity to emphasize the main contributions of this work. While graph coarsening provides indeed a motivating and intuitive example of a transferability setting in our work, our theoretical analysis extends far beyond this setting, providing much more general results:
We establish transferability between any two graphs that are similar in the diffusion sense. Examples settings captured by this notion of diffusion similarity include rewiring operations in graphs (c.f. Section 3, Fig.s 1&2 and line 136 ff.), the inclusion of subgraphs (c.f. Section 3.2 Fig. 5 and line 173 ff. as well as Appendix D) or graphs discretizing the same ambient space (c.f. the discussion in Section 5.3 line 493 ff. And Appendix F.2). Additionally the notion of diffusion similarity naturally extends to directed graphs (c.f. e.g. Appendix F.3).
Two important works on transferability of GNNs are not well-discussed:
1. Convergence and Stability of Graph Convolutional Networks on Large Random Graphs
2. Graphon Neural Networks and the Transferability of Graph Neural Networks
This is an important point; thank you for this valuable feedback. Following this advice, we have now created a dedicated Appendix (c.f. Appendix B in our updated manuscript), where we discuss in detail other approaches to transferability as well as the relation of our results to them.
For the second work, the author only mentioned it is limited to the (very) large graph setting. This is not true. The work in fact discussed a general setting that goes beyond the graph coarsening.
Analyzing GNN transferability with the help of auxiliary graphons is of course a fundamental and well established approach to transferability. Nevertheless, the convergence rate in Theorem 1 of the paper mentioned by the reviewer is given as with a potentially large constant (e.g. dependent on spectral separation properties of the graph at hand). This result – while fundamental-- e.g. does not make meaningful predictions in the setting of smaller graphs such as the molecular graphs of QM. Our approach instead is not dependent on any type of limit object being approximated sufficiently well. Hence it also applies here – as well as other settings outside the coarse graining setting -- without restrictions.
<!-- 1. The similarity of graphs from the diffusion perspective. Is this really a novel perspective? -->
The notion of diffusion similarity was indeed established already in 2013 by Hammond et. al., as we discuss in our paper.
To the best of our knowledge, however, our work is indeed the first one to investigate the transferability of graph neural networks between graphs that are similar from the perspective of information diffusion. Hence, we do indeed provide a novel perspective on transferability.
To incorporate this feedback and avoid any potential ambiguity, we have changed the sentence
"Transferability of graph neural networks is then considered between graphs that are similar from this novel perspective of information diffusion."
in the abstract (to which we assume you are referring) to read
'Transferability of graph neural networks is then considered between graphs that are similar from this novel perspective on transferability.'.
The fundamental reason why has pitfalls in measuring the similarity between two Laplacians comes from that the spectral norm is a poor measure. Sometimes, Frobenius norm, as the sum of the eigenvalues, might be a better measure, but still limited. Fundamentally, one could really compare how similar two graphs are by checking the full difference matrix .
We fully agree with the reviewer that certain distance measures are more adapted to certain task-settings than others.
However, as we argue below, neither spectral or Frobenius norm, nor the full difference matrix are well adapted to capture the type of similarity captured by the notion of diffusion similarity. To exemplify this, let us revisit the rewiring setting of clique vs. dumbbell graphs in Fig.s 1 & 2. Irrespective of where an edge is deleted, the difference matrix (w.l.o.g. after relabelling nodes) reads:
\begin{pmatrix} 1 & -1\\\ -1 & 1 \end{pmatrix}$$ Here, Frobenius norm (i.e. Schatten 2 norm) as well as the spectral norm are thus all given as $ ||L - \tilde{L}||_2 = ||L - \tilde{L}|| = 2. $ Both of these quantities are insensitive to where the edge manipulation is performed. Also the entire difference matrix $ (L - \tilde{L}) \sim \begin{pmatrix} 1 & -1\\\ -1 & 1 \end{pmatrix} $ does not provide any information as to whether the two respective graphs should be considered similar or not. This is different for the diffusion distance, as we discuss in detail in Section 3.1. As is desirable (c.f. the discussion in Section 3.1), this distance metric considers the graphs in Fig. 1 to be similar, while it considers the graphs in Fig. 2 to be dissimilar. Similar considerations are also true for graphs with diverging graph Laplacians which nonetheless converge in a rigorous mathematical sense (i.e. in the sense of degenerate evolution semigroups (https://link.springer.com/article/10.1007/BF02573493) to a graph defined on a different node set; as e.g. discussed for an example setting in Section 3.2.[...] a similarity notion in the context of graph coarsening has already been rigoriously studied by
- Andreas Loukas, Graph reduction with spectral and cut guarantees, JMLR 2019.
- Andreas Loukas, Pierre Vandergheynst, Spectrally approximating large graphs with smaller graphs, ICLR 2018.
the authors there defined the notion of spectrally restricted similarity between graphs, so to ensure the spectral similarity when performing graph coarsening.
I do not see the authors have considered this in their work. Please note that in the above two works, rigorous definitions of spectral similarity are provided, as well as extensive theoretical results.
We thank the reviewer for these suggested works measuring similarity between original and coarse grained graphs. As graph coarsening serves as a prominent example setting in our work, we gladly include an extended discussion (c.f. Appendix D) on restricted spectral similarity as defined in the suggested works, which are relevant to our example transferability setting of graph coarsification.
Before sketching this discussion here, we would like to point out that our paper studies transferability properties of GNNs. As such it investigates filter functions and architectures and derives conditions on them that will guarantee transferability. It does not focus on studying or introducing any new coarse graining or reduction procedures.
Instead, our paper focuses on establishing a general transferability framework and deriving rigorous transferability guarantees for graph neural networks. One setting to which our theory applies is that of coarse graining graphs. Other examples include rewiring operations in graphs (c.f. Section 3, Fig.s 1&2 and line 136 ff.), the inclusion of subgraphs (c.f. Section 3.2 Fig. 5 and line 173 ff. as well as Appendix D) or graphs discretizing the same ambient space (c.f. the discussion in Section 5.3 line 493 ff. And Appendix F.2). Additionally our results on GNN transferability naturally extend to directed graphs (c.f. e.g. Appendix F.3).
The procedures laid out in the two references most certainly can produce graphs for which networks will be able to transfer. Additionally the notion of spectrally restricted similarity is a very well adapted measure to quantify spectral similarity between a graph and its coarse grained version.
However, to the best of our understanding, the notion of spectrally restricted similarity is almost but not completely sufficient to guarantee transferability of filters between an undirected graph and its coarse grained version:
Consider two graphs and . Using the notation of 'Andreas Loukas, Graph reduction with spectral and cut guarantees', we are interested in bounding the difference in filter outputs . Let us exemplarily consider the case (corresponding to ).
Denote by the spectral projections onto the first eigenvectors of respectively. Denote by the respective spectral projections onto the remaining eigenvectors of the respective two operators.
We may first observe that we may reduce the problem to considering only the first eigenvectors of the respective operators:
We may decompose into a sum over one dimensional eigenspaces as
with eigenvectors .
Similar considerations also hold for the coarse grained graph. Using this, we find
The first term is then bounded by a small quantity, as Theorem 13 of 'Andreas Loukas, Graph reduction with spectral and cut guarantees' guarantees that for .
For the second term we note that we may bound
If we could bound this term by a small quantity, we would be done.
In 'Andreas Loukas, Graph reduction with spectral and cut guarantees' such an alignment between the eigenspaces of and the lifted eigenspaces of is attacked from the direction of canonical angles. This uses machinery introduced in
C. Davis and W.M. Kahan: THE ROTATION OF EIGENVECTORS BY A PERTURBATION. III.
The canonical angle operator introduced there (and utilized in utilized in 'Andreas Loukas, Graph reduction with spectral and cut guarantees') is defined as
with defined in eq. (1.16) of Davis & Kahan. There it is then established (c.f. ibid. page 10) that . Hence, had we bounds on the entirety of , we would be done. In 'Andreas Loukas, Graph reduction with spectral and cut guarantees', a bound on is provided (c.f. ibid. Theorem 14). However, without an additional bound on (c.f. Davis & Kahan eq. (1.16)) we unfortunately can not achieve our desired bound above.
We have now included this discussion into our paper (c.f. Appendix D).
Should the reviewer have further insights or comments here, we would be more than happy to incorporate this additional feedback into our paper as well!
The difference between two diffusion operators is in fact to utilize an extra dimension to measure the spectral difference than simply using the spectral norm.
If we understand correctly, the reviewer is referring to measuring distance between graphs via the diffusion distance as opposed to measuring distance via .
The key idea here is that including the exponential into the distance metric leads to an (exponential) suppression of large eigenvalues of and . Information encoded into these eigenvalues and eigenspaces corresponds to fine structure details of the graphs and (c.f. e.g. Chung 1997). Suppressing this fine-structure information before taking a distance measurement effectively leads to a comparison that is predominantly determined by the coarse structures within the graphs. If the rough structures within the two graphs are similar, the distance between the two graphs will be relatively small. Thus this metric is adapted to considering graphs that are similar up to fine-structure variations to be close to each other.
From a purely topological perspective, the appearance of the extra temporal dimension over which the supremum is taken is more of an artifact stemming from the connection of this metric to diffusion flow: The topology generated by this metric is exactly the same as the one generated by , where is any injective function which decays to zero at infinity. That is to say, the notion of convergence (i.e. Which sequences of graphs are convergent and which ones are not) is not influenced by the appearance of this superfluous temporal dimension.
Nevertheless, as the subsequent analysis in our paper shows, having this additional temporal dimension available is greatly beneficial when deriving explicit transferability bounds. This is especially apparent in Theorem 4.4. which expresses single filter transferability in terms of (an integral over) this latent temporal dimension.
In contrast to the diffusion distance, the distance measure is dominated by variations in the (eigenspaces corresponding to) large eigenvalues. This metric is thus instead adapted to picking out fine-structure variations.
As increases, the diffusion flow is approaching to the harmonic eigenvector (space) of the Laplacian. Consider a setting where two completed different graphs (but both connected) with the same number of nodes. When is large enough , the proposed similarity measure is essentially zero, showing two graphs are similar. This is clearly not true.
The reviewer is correct that we have . However, the definition of diffusion distance does not allow to arbitrarily pick the value of ! Instead the value of that maximizes the difference
has to be picked (c.f. e.g. Section 3.1, line 150, or Hammond (2013)):
As proved in Hammond (2013), with defined in this way, we have if and only if . This is true since is a genuine metric (in the rigorous mathematical sense) on the space of graphs.
Following this feedback, we have added an additional appendix discussing the distance metric as introduced by Hammond (2013) in more detail.
Of course, this is under the assumption that is large. I wonder if this is in fact the case of Figure 9, where different molecules are similar as increases.
This is an excellent question! Fig. 5 showcases why LTF based models in Table 1 are able to transfer. It is indeed true that , as the reviewer correctly observed. However, the key take-away here is not that the functions decays to zero, but rather that it decays to zero sufficiently fast. For , we e.g. already have .
Let us exemplarily examine the implications of this sufficiently fast decay of the function for the transferability of the filter . which constitutes a basis element in our investigated LTF-Exp architecture. The generalized function associated to this filter is given by .
As discussed in Theorem 4.4 (line 274 ff.) the single filter transferability error is bounded as
Since , the transferability error of the corresponding filter is small. Together with Theorem 4.9 this then explains the transferability observed in Table 1.
Following this feedback, we have now included further discussions of Figure 9 in Appendix J.2 which we of course reference in the main body of our paper.
Related to this, I feel in 3.1 the maximum eigenvalue of L should be below 1 to ensure the diffusion will converge, right?
The issue of convergence is indeed an important one. Here convergence is in fact guaranteed without further conditions on the maximal eigenvalue; it can be arbitrarily large:
In Section 3.1, the solution of the diffusion equation is given as . For ,
we have with the spectral projection onto the (harmonic) eigenspace corresponding to the eigenvalue that
irrespective of the magnitude of the largest eigenvalue. This can be seen as follows:
We have . Thus
The sum on the right hand side is finite and each term decays to zero individually. Hence also the entire sum decays to zero.
Typically a parameter is needed there to control it. If so, then what is dominating the diffusion is maximum eigenvalue. If the latter is small, graphs will be similar in this perspective, which again is not necessarily the case.
Some eigenvalues do influence the diffusion process indeed stronger than others. Here however, the maximal eigenvalue is actually the one that is suppressed strongest (indeed exponentially as ). Its influence is thus the least pronounced.
Regarding the diffusion distance between graphs with small maximum eigenvalue we note the following. We have by definition
As varies over all non-negative values, the joint overall scale of the Laplacians (and hence also their largest eigenvalues) does not matter for determining their diffusion distance.
What would be interesting here is to show what theoretical insights and advantages this particular case analysis provides over the more general case. Unfortunately, insights in this regard are limited or inexistent in the paper.
We are happy to list some examples of the insights and advantages of our transferability analysis:
- Our work applies to transferability settings not covered by existing approaches (c.f. also the discussion above)
- We provide a quantitative similarity notion (i.e. diffusion similarity) adapted to the setting of graphs that are approximately similar, but different in fine structure articulation. We also discuss how existing approaches fail in this setting and discuss implications for transferability.
- We discuss that polynomial filters will not be able to transfer in the considered settings; a fact sometimes not appreciated in the community. Indeeed, we believe that also the reviewer was curious about this in their initial review and asked whether this finding does not contradict results of Levie et al. (2019) (c.f. also the discussion above).
- We characterize the class of filters that is able to transfer between graphs that are approximately similar, but different in fine structure articulation. Such a characterization had not been done before in the literature.
- We provide (non asymptotic) quantitative bounds for these novel transferability settings and more generally for the setting of graphs that are approximately similar, but different in fine structure articulation.
- We distinguish between mono- and biderictional similarity and classify the class of fillters that will facilitate transferability in the respective senses. We find that not all filters are bidirectionally transferable, which had not been appreciated in the community before.
- We consider transferability of undirected and directed graphs on an equal footing, while the vast majority of existing approaches focuses solely on the undirected setting.
- We relate transferablity to continuity of models: Models conforming to our developed theory are continuous from the space of graphs to Euclidean latent spaces; models not conforming to our theory are shown to be not conitnuous: If for esxample a sequence of graphs converges to a coarse grained graph defined on fewer nodes in the diffusion sense, latent embeddings generated by models conforming to our developed theory will converge to the latent embedding of the coarser graph. This is not true for other models (c.f. also Figure 10).
- Our analysis is not contingent on an underlying limit object that is sufficiently well discretized. Hence it also applies to small to medium sized graphs. This is e.g. exemplified in Section 5.1 (c.f. also line 404ff).
- Our analysis allows in practice to build models that experimentaly are able to transfer where existing models are unable to transfer: In Table 1, only models conforming to our theory are showcasing transferability. Thus our results have tangible practical relevance when designing new architectures.
And how the latter analysis ties to the more general ones as well as to the ideas of graph coarsening [...]
In addition to discussions in the main body of our paper (c.f. e.g. line 38ff., line 122 ff., line 187 ff., line 231 ff., line 404 ff.), we discuss in Appendix D how our work relates to the notion of restricted spectral similiarity that is used in graph coarsening and discuss how this notion -- while valid -- is not quite strong enough to guarantee transferability.
In Appendices B and C we further discuss existing approaches to transferability and how our approach differs from and relates to them.
Can we improve our discussion of these points in any form? We would be glad to act on any further feedback!
Moreover, the tools used for such an analysis have also been thoroughly discussed in graph coarsening and here they are adapted to GNNs.
Might we kindly ask which tools the reviewer is referring to? We are unaware of any tools from the coarse graining literature that are being used in our work, aside of course from providing an example setting to which our theory applies.
As we discuss above and in great detail in Appendix D the notion of restricted spectral similarity mentioned by the reviewer in an earlier comment (while definitely providing valuable information) is not sufficient to guarantee GNN transferability in the example setting of transferability between a graph and its coarsified version.
Additionally, our work considers transferability between a graph and its coarsified version as one example, but extends fundamentally beyond it.
Instead, our work and analysis is based on the fundamental notion of heat diffusion and the related notion of the Laplace transform. Mathematically, our tools thus originate in the well established field of analysis of semi-groups.
Related to the frequency analysis for directed graphs, do consider that it is quite investigated in the graph signal processing literature. But this is less relevant for the judgement of this paper and more fyi.
We are very happy to have a tangential discussion on this! We are well aware of the GSP literature (papers of which we of course refer to within our own paper) and the various approaches to constructing a Fourier-like transform with associated frequency interpetation for directed graphs.
Here we discuss related complications in the setting of transferability, as considered in our paper:
In the (for simplicity one dimensional) Euclidean setting, we have:
where the operator is defined via the Borel functional calculus.
We may write this as
with the projection valued measure associated to the self-adjoint operator and the integration being performed over the spectrum of the operator .
If is instead a finite dimensional self-adjoint operator, the projection valued measure associated to is a discrete measure, so that for (instead of the derivative operator) being a matrix (such as a graph Laplacian) we may write
Here {} is the family of spectral projections associated to the self adjoint matrix and we have
As is well known, the spectrum thus plays the role analogous to the frequency domain in the continuous setting. In the graph frequency domain, a graph filter then simply acts by multiplication with in the eigenspace corresponding to the eigenvalue .
However, if is a (non-symmetric) matrix associated to a directed graph (such as the corresponding adjacency matrix or graph Laplacian), the above equation no longer holds: In the Jordan Chevallier decomposition above -- in addition to the spectral projections -- there now also appear nilpotent radicals (which are identically zero if is diagonalizable). Instead, this equation now reads
with the radicals satisfying for the algebraic multiplicity of the eigenvalue .
For a filter (defined via the Dunford-Taylor integral; i.e. the holomorphic functional caclulus), we then have (as e.g. proved in Kato)
Here is the derivative of evaluated at the eigenvalue .
The appearance of these derivative terms complicates the interpretation of the spectrum as playing the role of a frequency domain. Additionally, the appearance of the radicals complicates the study of transferability between graphs, as now not only variations in spectral projections but also variations in radicals need to be considered and bounded.
We would be happy to discuss this more, should you be interested!
Thank you for your engagement! We happily address the raised points and elucidate significance and novelty of our work!
[...] this work [...] ignores the high spectral variations which are the most sensitive part.
Our work considers transferability of GNNs between graphs that are approximately the same, but differ in their fine-print articulations. This is exactly the setting relevant for transferability, as transferability is concerned with ensuring that GNNs generate similar latent embeddings for similar graphs.
Spectrally expressed (c.f. Chung 1997), graphs that are approximately the same, but differ in fine structure articulations have associated Laplacians where the low lying spectral structure is similar, but the high lying spectral structure may differ greatly.
Example settings of graphs having the same low lying spectral structure, but different high lying spectral structure are:
- Graphs that differ by an edge deletion in a high connectivity area
- Graphs that describe the same object at different resolutions (coarse vs. fine)
- Graphs discretising a manifold
Our work then considers transferability between such graphs that are approximately the same, but that vary in fineprint articulations, which exactly means that the high lying spectral information may vary significantly.
We then classify the class of filters that is able to transfer in this setting.
[The] diffusion distant is hiding this problem.
The notion of diffusion distance introduced by Hammond et. al. (2013) is not meant to hide any problem, as there fundamentally is no problem to be hidden!
Our work studies transferability between graphs that are approximately the same, but vary in fine print articulation. That means, that associated Laplacians are similar in their low lying spectral structure, but are allowed to vary in their high lying structure (as is e.g. the case for edge deletions, coarse graining, and discretisation of a manifold).
The distance notion is not able to pick up on the similarity in the such settings, as it weighs all spectral components simlarlly (c.f. also the discussion In Section 3 line 125 ff.). In contrast the diffusion distance is able to pick up on this similarity; it is better adapted to this setting.
For the setting of coarse graining, the reviewer suggested to compare with the the notion of restricted spectral similarity. As the name suggests, restricted spectral similarity only considers similarity between the spectral components below a cutoff. Similarity of higher lying spectral components is not considered. This is approach of focusing on the low lying spectral components is clearly necessary, as coarse-grained and original graphs differ in their fine-structure articulation (as well as the number of nodes).
Unfortunately, as we discuss in detail in Appendix D, the notion of restricted spectral similarity while valid is not quite enough to ensure transferability between a graph and its coarse grained version. The diffusion distance is able to guarantee transferability in this setting.
Additionally, it should be noted that our work considers transferability between graphs that are approximately the same, but differ in their fine-print articulations in general. The setting of coarse graining graphs -- while a motivating one -- is just one example.
Establishing transference between graphs that are close in a diffusion sense remains a particular case of the more general GNN stability.
Since the general stability of GNNs without a low pass filtration is discussed thoroughly, I struggle to see significance in this work to merit publication.
Established approaches to GNN stability do not cover settings that are covered under our transferability theory!
The underlying reason is, that existing approaches to stability measure the distance between graphs in spectral norm as . This distance-measure weighs all parts of the spectrum with similar strength. Thus graphs that are similar in the low lying parts of the spectrum but differ potentially strongly for the high lying part of the spectrum will not be considered similar (). As transferability is however only established for This setting is thus not covered by existing stability approaches while our theory applies to it.
A straightforward example is the setting of edge deletions discussed at the beginning of Section 3. Here we always have . In contrast, the diffusion metric is more sensitive, and depending on where the edge is deleted may be vanishingly small. Thus existing approaches for transferability will never allow for statements on transferability here, while using the diffusion distance will allow for such statements.
A different example of such a setting where our transferability analysis applies is the example setting of coarse graining graphs.
Also in the setting of coarse and fine structure, the suppression of large frequency components is of crucial importance! Let us consider a graph for which we choose two nodes and modify the edge weight between them.
As we increase the edge weight between them, it is more and more justified to treat these two nodes as a single entity, as they are so strongly connected. However, since we have for the associated (-dependent) graph Laplacian that , a triangle inequality argument then proves that for any matrix , we will have .
Hence the Laplacian of this graph will eventually be infinitely far away from any other graph Laplacian. In particular, will never converge to the Laplacian of any other graph, when distance is measured with linear norm difference .
However, we can show that if we use the diffusion distance instead, we will have convergence! The sequence of Laplacians *converge in the diffusion sense to the Laplacian of the graph where the two nodes connected by the edge of weight are fused together into a single node:
We have
as (suppressing embedding operators in notation).
Our results then show that this convergence also holds for networks that employ Laplace transform filters: For such networks the graph-level embeddings generated for the graphs will converge to the embedding of the graph . For other models, the embeddings generated for the graphs will not be more and more similar for the embeddings of the graph (c.f. also Figure 10 on page 9).
In the example of coarse and fine structure graphs, a well established and well cited notion of similarity is the notion of restricted spectral similarity. This is a very valid and useful one when coarsifying graphs, as it guarantees spectral similarity up to a frequency cutoff and allows to derive conclusions on alignment of eigenspaces of the respective Laplacians.
As we detail in Appendix D we however need slightly stronger results on alignment of eigenspaces than the ones that restricted spectral similarity is able to provide. We discuss this in detail in our response above.
Hence the notion of restricted spectral similarity from coarse graining -- while very valid -- is almost but not quite able to guarantee transferability in this context of GNN transferability between graphs and their coarsified versions.
Hence we unfortunately have to use a different measure of similarity when investigating transferability. The notion of restricted spectral similarity nonetheless remains a very valid, well established, useful and fundamental notion. We make this point very clear in our manuscript (c.f. e.g. Line 1043 ff.).
Dear Reviewer Dvw3,
Thank you again for your valued participation in the discussion phase!
We had a friendly discussion with the other reviewers above (https://openreview.net/forum?id=cTDooc2J9S¬eId=ujeYLl2Uj7) on the role of high frequency components in our analysis.
In two sentences, we would summarize the consensus achieved above as:
Our paper provides novel and valuable insights for transferability in settings where the overall rough structure of graphs is paramount. To be ready for publication, it also also needs to be clearly stated that this is our scope and we do not consider settings where the exact articulations of graphs is fundamental.
We have now updated our paper to reflect this. Following this discussion, we believe that we now also better understand your concern and -- as we hope -- are able to alleviate it!
Settings where the overall structure of a graph is more fundamental than the (high frequency) fine print structure occur, whenever not the exact articulation of the graph at hand is fundamental, but rather many graphs may be used to describe the same situation.
Some examples are:
- Graphs discretizing the same manifold (e.g. sensor networks measuring temperature in a geographical region)
- Graphs that describe the same object at different resolutions (e.g. social networks expressed at the level of individual users vs. communities)
- Graphs with edge presence determined by cutoffs (c.f. also the discussion in Gasteiger et. al. (2019) 'Diffusion Improves Graph Learning' (e.g. first paragraph of the introduction))
In these situations, we do not want models to overfit on the exact articulation of a graph, because the graph at hand is just one of many ways to describe an underlying system. The model should thus be insensitive to variations in fine structure (i.e. high frequency variations). That is to say, models should be insensitive to high frequency variations.
Existing approaches to stability measured distances in graphs using the linear distance . They then established transferability between graphs for which . Thus they do not apply to the setting considered above: Graphs whose overall structures align but which differ in fine print articulation will generically differ greatly in the high-lying spectral components. In this setting we will have , so that existing stability theories will not apply.
As the linear distance is unsuitable for this setting, we now want to find a distance metric that is adapted to this setting. When we have such a distance metric that is small whenever the overall structures within graphs are similar, we can built a transferability theory which provides guarantees for model transferability whenever ; i.e. when graphs are similar in overall structure but may vary in fine-print articulation.
We then contribute this within our paper:
- We establish and discuss a distance notion adept at capturing the setting of graphs that possess similar overall structure but vary in fine print articulation.
- We built a transferability theory for this setting where existing transferability theories do not apply.
- We establish that not all models will be able to transfer between graphs that are similar and characterize the class of models that are able to transfer.
- We validate these findings numerically.
To make this point clearer, we now have made significant modifications to our paper. In particularly we make clear the following points:
1. Diffusion distance as relaxation: The diffusion distance can be viewed as a relaxation of the typical linear distance , which allows us to consider graphs to be similar if rough overall structures align, while fine-print articulations may vary. This setting captures fundamental examples such as graphs discretizing the same manifold, graphs describing the same object at different resolutions or graphs differing by edge deletions. We now present this exact characterization in the Introduction (c.f. line 49ff) of our updated manuscript.
2. Weaker similarity implies weaker assumptions: Since the diffusion distance is a relaxed notion of similarity compared to the linear distance , also only weaker (spectral) conditions on filters are needed in order to guarantee transferability in this setting (c.f. line 67ff. of our Introduction).
3. Not applicable when fine-structure details are critical: While the notion of diffusion similarity is well adapted to situations where the overall structure of graphs is more important than their exact fine-print articulations, there also exist settings where precisely the fine-structure of graphs is important to consider. For this setting, we do not provide results; it is outside the scope of our work.
The paper looks at a spectral approach for graph learning and suggests to look at diffusion patterns to compare graph similarity. The main claim is that whenever the graph convolutions have a certain spectral form, the results automatically generalize (at least in the given theoretical setting). Then, these diffusion-based filters are implemented and transferability across data modalities (QM7 with and without H atoms) is demonstrated. The same is then done for different discretizations of a torus where again the spectral method (and none of the tested message-passing networks) achieves transferability.
优点
The paper is well-written and easy to read, especially for a theory-heavy paper. The problem of transferability is relevant and the suggested diffusion-based distance and its use for graph learning seem novel and interesting (while the distance is prior work, it is not widely used).
In general, the suggested LTF filters naturally handle local expansion and local transformation which is great for applications that rely on the global structure (even though most datasets are more about the local structure).
缺点
My main criticism has to do with the setting in which "transferability" is tested. See the questions below.
Further, I have a number of small points I would like to mention:
- 35ff: small changes may have a big effect on the desired label, e.g. when computing the parity function (but also in chemistry applications). It is (to me) not clear how this transferability should be happening.
- 166: it would be nice to have a formal definition of J here.
- 202: how to find this mapping? It would be nice to elaborate a little here.
- 286: why 10^0 instead of 1?
- 307f: this sentence is hard to parse, especially the end after "importantly"
- 344: J^up J^down (not down, down)
- 376: QM7 contains small molecules only, this may have an effect on how much diversity is in there.
- 400ff: what are the features of the "coarse" nodes? What kind of information is still there?
- 412: is this result surprising in any way?
问题
Why would you expect a network trained on molecules without H to generalize to graphs containing H atoms? Which directly leads to the question: why would such a behavior be desirable? (its different for the torus, there the setting is clear)
The experiment where hydrogens are deflected out of their equilibrium position is also interesting: I would expect unstable atoms (i.e. the ones with deflected hydrogen) to behave quite differently from their "normal" counterparts, so why is the convergence of the suggested LTF models a good thing? 448: Does this mean that the distances are simply ignored in this experiment? And if yes, why would htat be a good idea?
How was the QM7 target chosen and why only this single target? Do the results generalize to QM9 too? (e.g. between the two datasets) And how about other targets?
How is "transferability" between dissimilar graphs avoided? And is this shown somewhere? (as its part of the conclusion)
so why is the convergence of the suggested LTF models a good thing?
This is because the convergence here is to a different graph than the graph representing the original equilibrium molecule! As the limit graph is different from the original equilibrium graph, there is no contradiction to unstable intermediate molecules behaving differently from the original equilibrium graph.
More precisely: When deflecting hydrogen atoms out of their equilibrium positions towards the nearest heavy atoms, the final state is that where these hydrogen atoms have arrived exactly at the position of the respective nearest heavy atom. It is thus a graph where instead of the original charge at the position of a heavy atom and the charge at the original positions of hydrogen atoms, there now is the charge is to be found at the original position of the heavy atom in question. Here is the number of hydrogen atoms being collapsed onto said heavy atom.
The resulting limit graph is thus a coarse grained version of the original graph. These heuristic arguments are made precise using the notion of diffusion similarity: a modified molecular graph, with hydrogen moved out of equilibrium towards the nearest heavy atom converges in the diffusion sense to this coarser limit graph.
This convergence is respected by LTF based models: Whenever a graph converges to another graph in the diffusion sense, also the latent embeddings generated by LTF-based models will converge to those of the limit model.
Thus LTF based models are continuous from the space of graphs to the Euclidean latent space. The same is not true for other models, as Fig. 10 shows.
448: Does this mean that the distances are simply ignored in this experiment? And if yes, why would htat be a good idea?
Ignoring distances would indeed not be reasonable. In this experiment, positions of hydrogen atoms are varied (individual (original equilibrium) positions are available within QM). Modified edge weights are then calculated using the same formula as before (c.f. line 443) as
with updates positions for hydrogen atoms.
How was the QM7 target chosen and why only this single target?
One of the contributions of our paper is to provide transferability guarantees in the setting of small to medium sized graphs. Here existing approaches (which rely on auxiliary ambient objects to establish transferability) are not applicable. To showcase the applicability of our theory to this realm of smaller graphs in practice, we need to make use of a dataset such consisting of smaller graphs. This is true for QM.
Additionally, QM is a molecular dataset. As argued above (c.f. also the discussion from line 378 onwards) the notion of (chosen or varying) resolution scales is natural to the description of physical systems (cf. The discussion in lines 378 - 387).
Finally, QM (as opposed to e.g. QMb or QM) directly provides adjacency matrices (in the form of Coulomb matrices). Hence spectral methods – as investigated in our paper – are directly applicable here without further design choices or pre-computations.
QM is a single-target dataset; with the sole target being atomization energy. This sole target is thus the target for which we report results in Table 1.
Thank you for your thorough review as well as your insightful questions! We were very glad to read that
The paper is well-written and easy to read, especially for a theory-heavy paper.
And that
[...] the suggested diffusion-based distance and its use for graph learning seem novel and interesting
Let us address the raised points individually:
35ff: small changes may have a big effect on the desired label, e.g. when computing the parity function (but also in chemistry applications). It is (to me) not clear how this transferability should be happening.
There certainly exist situations where transferability between different graphs is not a desirable quality. This is especially the case when the exact articulation of a graph is important; e.g. when the label is given as the parity of the set of edges or nodes as the reviewer observed.
However, there also exist situations where the exact articulation of a graph is not important. This is for example the case when graphs are used as a means to describe a given more fundamental object. Examples include different graphs discretizing the same manifold, graphs describing the same entity at different resolutions (e.g. a social network expressed at the level of individual users vs. the level of interacting communities) or when graphs are drawn from a common statistical distribution. In these settings, a model should not overfit on the exact graph articulation at hand, but rather learn to represent information about the more fundamental underlying entity. In other works the model should be transferable between the graphs that describe this same underlying entity.
166: it would be nice to have a formal definition of J here.
We are glad to follow this advice. We have added a formal definition and a surrounding discussion of in our updated manuscript.
202: how to find this mapping? It would be nice to elaborate a little here.
We checked and are unfortunately uncertain which mapping you are referring to, as line 202 does not contain (the description of) any mapping. Could you maybe clarify? We would be happy to provide further details. The closest linear mappings to line 202 in our manuscript are the projection and interpolation operators . These are rigorously introduced from line 192 onwards. Heuristically, acts as an averaging operator. Its natural emergence in this setting may be understood from the perspective of heatflow: In the physical setting of heat conductance over a graph, the temperature in a strongly connected subgraph will equalize very fast. This subgraph will behave as a single entity, and the temperature of this single entity (i.e. super-node) will be given as the average temperature of the original nodes. This assingment of the respective average value to the supernodes in question is facilitated by . The interpolation is simply the adjoint mapping to .
286: why 10^0 instead of 1?
We used this notation to foster notational consistency with the discussion in the experimental section. There we use the notation '' when discussion transferability errors (c.f. lines 413, 416, 417). However we are of course very happy to write '' in place of '' in line 286 and have now done so in the updated version of our manuscript.
307f: this sentence is hard to parse, especially the end after "importantly"
Thank you for this feedback! Following this comment, we have now broken up this sentence into smaller ones, in order to foster readability.
344: J^up J^down (not down, down)
Thank you for spotting this typo! We have now corrected it in the updated version of our pdf.
376: QM7 contains small molecules only, this may have an effect on how much diversity is in there.
There certainly exist much larger and more diverse molecular datasets (such as GDB-13 of which QM is a subset). QM as a dataset contains all organic molecules of up to 23 atoms (c.f. http://quantum-machine.org/datasets/). Since our objective in using this dataset is not to contribute to the state of the art in molecular property prediction, but rather to showcase our theoretical transferability results in an example setting, we believe that the scope of the QM is already perfectly adequate for our purposes here. Additionally, one of the contributions of our work is to provide an approach to transferability in the setting of small to medium sized graphs, where existing approaches (reliant on auxiliary ambient objects to establish transferability) are not applicable. To showcase the applicability of our theory to this realm in practice, we actually need to make use of a dataset such as QM consisting of smaller graphs.
Do the results generalize to QM9 too? (e.g. between the two datasets) And how about other targets?
Following this feedback, we have now additionally also conducted Experiments on QM9. We currently provide preliminary results in Appendix J.3 of our updated manuscript and will include full results (on all targets and for all architectures) in the camera ready version of our paper.
We find that here too, Laplace transform filters minimize the transferability error; most often by a substantial margin.
Regarding the question of generalization between QM and QM we note: As QM might be considered a subset of QM; hence we would expect a certain amount of generalization from the former to the latter. While this is indeed an interesting research question, generalization in this sense is however not the focus of our present work as it does not fall into the realm of transferability. As such we did not consider this setup in our experimental section.
How is "transferability" between dissimilar graphs avoided? And is this shown somewhere? (as its part of the conclusion)
In Section 5.2, we conduct an experiment where we replace individual nodes in a graph with fully connected cliques. As we see in Fig. 12, this severely negatively affects the performance of typical models: Their corresponding node classification accuracy deteriorates as clique size is inreased. As we argue (c.f. line 481 ff. and in more detail in Appendix H), this occurs because these standard models are more and more transferable (as the clique size increases) between the given graph and a version of this graph where the connections between individual cliques are severed. Information propagation within these models thus occurs as if the underlying graph were disconnected. This precludes an information exchange between the respective cliques and explains the deterioration in classification accuracy.
In contrast LTF-based networks retain a high classification accuracy. These networks retain the ability to transfer information also between individual communities. In other words, they are not transferable between the given graph and a disconnected version of it. This is the case because from a diffusion perspective, these different graph structures are not considered similar, as we discuss in Section 3.1 (c.f. especially Fig. 4).
I thank the authors for the detailed and helpful answers! While I might not be fully in agreement about what exactly are the kinds of representation differences for which the model should work (and those for which it shouldn't), I think the overall picture is very clear and helpful. I also appreciate the additional experiments and further changes in the paper, I think this does clearly improve it further.
400ff: what are the features of the "coarse" nodes? What kind of information is still there?
The features associated to the coarse grained nodes are the average features of the original nodes that are being collapsed to the respective super node. We have included a detailed discussion of this in Appendix G.1 (c.f. line 1948 ff.).
412: is this result surprising in any way?
As we are conducting experiments considering graphs expressed at different resolutions, we wanted to check whether models that naturally already employ an information propagation scheme that utilizes different resolution scales are able to transfer between such graphs on different resolutions. As such models do not only utilize the graph at its original scale but also at a coarser scale when generating latent embeddings, one might initially be tempted that such models might be able to better handle transferability between different resolutions. As our experiment shows (c.f. Table 1) this is however not the case.
Questions: Why would you expect a network trained on molecules without H to generalize to graphs containing H atoms? Which directly leads to the question: why would such a behavior be desirable? (its different for the torus, there the setting is clear)
The same physical system may be expressed and described at different resolution scales. In theoretical physics, this is famously formalized into the celebrated notion of the renormalization group.
In the (molecular) setting at hand, there are various levels of resolutions at which a molecule as a physical system may be expressed.
We may express it at the level of interacting atomic nuclei, as is done in the QM dataset. Fundamentally, it may however also be described at a finer scale at the level of interacting protons and neutrons (making up the individual nuclei).
At an even more fundamental level, we may vary the resolution scale of our physical description even more and split up protons and neutrons even further into individual quarks.
All of these descriptions encode the same underlying physical system. As such, each description should yield (approximately) the same result when computing aggregate quantities such as e.g. the molecular atomization energy.
In our experiment in Section 5.1, we now do not train on graphs without hydrogen atoms.
Instead, we train on a coarsified version of QM, where in comparison to the articulation in the original QM data we have lowered the resolution scale in the description of the individual molecules (c.f. also line 381 ff.): Instead of only aggregating individual protons and neutrons inside each atom into a single node (as is done in QM), we now also additionally aggregate together each heavy atom with all its surrounding (single proton) hydrogen atoms into single nodes.
The resulting graph still describes the same underlying molecule, just at a different resolution scale. Since fundamentally, the same molecule is still being described, we would want a model (that is transferable between resolutions) to provide the same (or very similar) predictions for this very same molecule.
More generally, a model that is transferable from a coarse- to a fine-resolution scale may be trained on a lower resolution scale while still performing very well when confronted with high resolution data during inference. This might find applications in settings where only noisy or low quality training data is available for training. Additionally it may open up avenues to train models much more economically and less resource intensive on smaller graphs while then successfully being able to deploy them on larger graphs.
In the opposite direction, transferability from fine to coarse would allow models to be trained on high quality training data (such as e.g. synthetic data obtained from DFT calculations in the chemistry setting) and benefiting from all the high-resolution information available in this data, while not losing performance when eventually being deployed on experimentally obtained lower quality test data.
The experiment where hydrogens are deflected out of their equilibrium position is also interesting: I would expect unstable atoms (i.e. the ones with deflected hydrogen) to behave quite differently from their "normal" counterparts,
We are very glad that you found our experiment to be interesting! We might think of molecules where hydrogen atoms have been deflected out of their equilibrium position to be energetically excited. Hence they would indeed have (at least slightly) different properties than their equilibrium counterparts.
Thank you for your appreciation of our efforts! We are glad to read that we were able to further improve our work by incorporating your feedback!
With the modifications provided:
Do you feel like our work is a good paper that should be accepted at the conference?
We understand that modifying scores is completely at the discretion of reviewers and we have no right to any other score than the one you assign!
If you should however feel like our work is good and should be accepted, ICLR reserves a particular score for that sentiment. If it were the case that you consider our work to be good, we would be extremely grateful if you could reflect that sentiment in that very numerical quantity of 'score', that the fate of a paper so often comes down to.
Should you choose to do so, that would of course be absolutely voluntary on your part; we completely understand that!
The paper introduces a methodology to show the transferability of GNNs based on the Laplace-Transform. The paper is relevant and the topic is important. The paper is very well written and easy to follow. The main issue with the paper is the lack of intuition that it derives given that it does not use a limit object, and therefore it has to introduce the directional similarity. Also, the bounds are asymptotic, and therefore cannot be compared with any existing method (graphons, graphops, etc). Finally, the experiment section is vague when stating high and low resolution instead of varying a numerical quantity.
优点
- The paper is relevant and the topic is important.
- The paper is well written.
缺点
I believe there are 3 issues with the paper: the asymptotic limit model, the bound, and the implications.
Regarding the asymptotic model, the authors consider a graph that grows in the number of nodes (L) with no particular structure. This immediately removes any form of intuition or interpretation from the graph limit. It can essentially be anything. I believe this is a limitation in the results, given that the lack of structure makes the results less interpretable.
With respect to the bound, the authors do not provide a rate and simply show that it converges. This limitation makes the author's results difficult to compare with existing art, and therefore the question of how this result distinguishes itself from the existing literature becomes prominent. I believe this paper is too general and therefore there are no implications for the practitioner.
Finally, I believe that the numerical experiments show that the results presented are vague. The plots are divided into high and low resolutions, but this is not a mathematical quantity. The authors should modify a continuous quantity such as the number of nodes.
Minor issue: Theorem 4.4 should have a statement, it currently reads "As proved in ...".
问题
N/A.
Thank you very kindly for your review! We were vey glad to read that
The paper is very well written and easy to follow.
and that
The paper is relevant and the topic is important.
We also would like to thank you sincerely for candidly expressing your honest concerns so that we may directly address them here. We fundamentally believe that these concerns expressed may easily be resolved and aim to do so below.
Let us address each the the three raised points in detail:
Regarding the asymptotic model, the authors consider a graph that grows in the number of nodes (L) with no particular structure. This immediately removes any form of intuition or interpretation from the graph limit.
We first provide clarifications about the asymptotic model to elaborate on the interpretation of our approach. First, note that letter referenced above does not denote the number of nodes but rather the graph Laplacian throughout our paper. We introduce this notation at the beginning of our paper (i.e., in the background section (Section 2.1), line 85) and also reiterate it in our table on notational conventions (Appendix A, line 777). We instead utilize the variable to denote the number of nodes (as introduced in line 75 of our background section).
Second, we would like to clarify the structural implications of our notion of transferability in this setting. We point the reviewer to line 488ff in Section 5.3, which states that
"The concept of operators capturing the geometry of underlying spaces also applies to manifolds. [...] This is hence is a prime setting for studying transferability."
, that is, we consider transferability of graphs approximating the same underlying manifold. Thus, a convergent sequence of graphs will possess increasing structural similarity with the underlying manifold. This provides the intuitive notion of structure, as was your concern. Indeed, not only is this an intuitive notion of a graph limit, but we also provide a mathematically completely rigorous discussion of how graphs are taken to approximate manifolds in Appendix F.2. We refer the reader to this detailed mathematical discussion in line 500f of Section 5.1. For further conceptual clarity, we also provide in Fig. 13 a visualization of the (regular grid) graph discretizations considered in our experiments.
We thank the reviewer for this comment as the setting of graphs discretizing the same manifold is an important example setting for our theoretical analysis of transferability between structurally similar graphs. We emphasize this meaning further in the updated manuscript, where we replace occurrences of the wording "ambient space" with "manifold" to avoid any ambiguity.
Minor issue: Theorem 4.4 should have a statement, it currently reads "As proved in ...".
Following the reviewer's comment, we have now changed the formulation of Theorem 4.4 in our updated manuscript
from
'As proved in Appendix E.3, we have: [statement that was proved]' .
to:
'As we prove in Appendix F.3, we find for the transferability of a single filter that: [statement that was proved]'.
Dear Reviewer i3Zf, as the discussion period is ending today and we have so far not heard back from you, we wanted to kindly follow up and ask whether there exist any remaining concerns? We would be happy to adress any such remaining points, should they exist!
All the best, The authors of submission 279
The plots are divided into high and low resolutions, but this is not a mathematical quantity. The authors should modify a continuous quantity such as the number of nodes.
We gladly provide further discussion of the meaning of the terms 'high and low resolution' in the experimental section: Regarding the the results of Table 1: Here we collect results for models trained on the original high-resolution QM dataset and evaluated on a (coarsified) low resolution version of this dataset. We also report results for the opposite pairing of models trained on low resolution QM and evaluated on the original high resolutionQM dataset. We rigorously discuss how the coarsified low resolution QM dataset is obtained (c.f line 381ff. of the relevant Section 5.1 as well as Appendix G.1 (referenced in Section 5.1)). We also included both datasets (the original high resolution QM dataset as well as the coarsified low resolution QM dataset) in our submitted supplementary material.
We agree with the reviewer that varying a numerical quantity (like the number of nodes, as suggested) can indeed provide valuable insights into transferability properties of networks. For this reason our original manuscript provides the following three experiments which vary various numerical quantities (e.g. the number of nodes):
In Fig. 14 we vary the number of node} in graphs approximating the same manifold and discuss corresponding effects on transferability
In Fig. 12 we vary the number of nodes inside cliques. Here we show that standard architectures can not deal with such a situation, while our theory allows to design models that can handle this setting well.
In Fig. 10 we continuously vary edge-weights inside subgraphs. We use this to show that standard models are not continuous as maps from the space of graphs to Euclidean latent spaces, while models conforming to our theory (i.e. LTF-based models) are continuous.
In our updated manuscript, we have now also included the axis labelling of Fig. 14, which was previously still missing.
We sincerely hope that we were able to clear up any misunderstandings or doubts. Should this not be the case, please do not hesitate to get back to us; we would be more than happy to provide additional details or incorporate any feedback further feedback!
With respect to the bound, the authors do not provide a rate and simply show that it converges]. [...] I believe this paper is too general and therefore there are no implications for the practitioner.
The reviewer expressed concerns regarding rates of convergence for transferability. We agree with the reviewer that quantitative bounds can be critical for practical relevance. However, we wish to direct the reviewer to our results in Theorems 4.4, 4.6, and 4.7. Indeed, as a main contribution of our work, we do in fact provide explicit quantitative bounds on transferability. Theorems 4.6 & 4.7 reduce network transferability to single filter transferability. Theorem 4.4 then provides explicit quantitative (even non-asymptotic) bounds on this single filter transferability.
Additionally, as the reviewer observed, we also provide results on convergence of latent embeddings: We show that if a sequence of graphs converges to a limit graph in the sense of diffusion similarity, embeddings generated by networks conforming to our theory will converge to the latent embedding of the limit graph. This is in fact predicted by our quantitative results in Theorems 4,4, 4.6 & 4.7.
We might interpret this result as LTF-based models constituting continuous maps from the space of graphs to Euclidean latent spaces. In contrast, as we show in Section 5.1, latent embeddings generated by other models will not converge. Such models are not continuous.
In our work, we indeed introduce a general and widely applicable framework for the study of transferability properties of graph neural networks. This framework is distinct from previously introduced frameworks and also applies to situations where previous works are not applicable. We fundamentally believe this to be an important and valuable contribution in its own right.
At the same time, our work is also of great practical relevance:
First and foremost, we establish (for the first time) that not all types of filter-functions will lead to transferable networks. We then completely characterize the class of filters that allow for transferability. This enables the practitioner to actually design networks that will be able to transfer in practice.
We also provide a discussion of the reasons why architectures not conforming to the developed theory are unable to transfer in the considered settings (c.f. Appendix H).
Counter to other approaches, the transferability framework we introduce is not dependent on any underlying limit object. This hence allows to provide transferability guarantees also for the settings of small to medium size graphs.
Our transferability framework is applicable in important settings not covered by previous approaches, such as e.g. the inclusion of subgraphs (c.f. e.g. Fig. 5 or Appendix D) and the setting of graphs on different resolution scales (c.f. e.g. Section 5.1)
The developed transferability framework also directly extends to directed graphs (c.f. e.g. Appendix F.3), a setting previously hardly acknowledged in the literature, but of great practical relevance (c.f. e.g https://arxiv.org/abs/2305.10498).
Following this feedback, we now reference practical implications in the 'Contributions' section of our updated manuscript.
This paper discusses graph transferability.
First, GNN transferability is defined based on graph similarity. The authors use a comparison between and Dumbbell graphs to introduce a diffusion-based similarity measure, which they extend to scenarios where graphs have different numbers of nodes. In such cases, graphs with different node sets are aligned using coarsening and interpolation.
Next, based on the similarity definition above, the authors discuss the transferability of filters on similar graphs. They find that general polynomial filters do not satisfy transferability, whereas Laplace Transform Filters (based on exponential basis functions or resolvent basis functions) do.
Furthermore, they explore the transferability of LTF-based networks on both node-level and graph-level tasks.
In the experimental section, the authors validate their findings using graphs of different resolutions and graphs sampled from the same manifold.
优点
- The authors' writing is logically rigorous.
- The authors propose a diffusion-based similarity measure and extend it to scenarios where graphs have different numbers of nodes.
- The authors discuss the transferability of different filters on similar graphs, finding that LTFs and LTF-based networks exhibit transferability.
缺点
Please check questions.
问题
Regarding the experiments on different graphs sampled from the same manifold:
Considering the definition of Bidirectional Transferability and the motivating example, it seems natural that the experiments on graphs of different resolutions would be successful. Therefore, I am more interested in the experiments on different graphs sampled from the same manifold. However, it seems that the authors only conducted experiments on a pair of graphs on the Torus manifold, is that correct?
Additionally, I would like to ask why the final comparison in this experiment is made using transferability error, rather than demonstrating the difference between network outputs before and after transfer?
Thank you for your rigorous review of our paper. We were happy to read that our writing
is logically rigorous and we were able to convey one of our main messages, namely that polynomial filters do not satisfy transferability, whereas Laplace Transform Filters [...] do clearly.
As we understand it, only questions 'Regarding the experiments on different graphs sampled from the same manifold' remained open. We are very happy to answer these:
'Considering the definition of Bidirectional Transferability and the motivating example, it seems natural that the experiments on graphs of different resolutions would be successful. Therefore, I am more interested in the experiments on different graphs sampled from the same manifold. However, it seems that the authors only conducted experiments on a pair of graphs on the Torus manifold, is that correct? '
That is correct. We derived and presented a general theoretical framework for generic graphs discretising the same ambient space in Appendix F.2 Further discussion of graphs discretizing ambient spaces.
As an example we there then derived theoretical convergence guarantees for graphs discretizing the Torus manifold. To experimentally verify this convergence in this example, we then conducted the experiments provided in Section 5.
Additionally, I would like to ask why the final comparison in this experiment is made using transferability error, rather than demonstrating the difference between network outputs before and after transfer?
We are not exactly certain if we understand the question correctly, as the transferability error exactly measures the difference in outputs generated by networks deployed on the respective graphs.
What the experiment at hand experiment demonstrates is that deploying an LTF-based network on two graphs that discretize the same ambient space leads to similar outputs; with outputs becoming more similar, the closer the graphs approximate the common object.
The transferability error now measures the difference between deploying the networks on the respective graphs. As such it is the metric of choice when comparing the respective outputs.
Did this answer your question? If not, please do not hesitate to get back to us; we will be glad to provide further details!
Dear Reviewer BeHs,
as the discussion period is ending today, we wanted to make sure that you are happy with our responses. If there are any other questions, please feel absolutely free to let us know and we will be happy to address them!
All the best, The authors of subbmission 279
This paper investigates transferability of graph filters and graph neural networks (GNNs) from a spectral perspective. The main conclusion of the paper is that graph filters and GNNs transfer if the frequency response of the filter can be written as a Laplace transform. I want to advocate in favor of this paper but to do so the authors must provide explanations of the following points:
(1) The use of Laplace transforms as frequency representations of graph filters appear in the work of Wang et al (https://arxiv.org/abs/2106.03725 and https://arxiv.org/abs/2305.18467). These papers conclude that transferability of graph filters and GNNs require conditions on a filter's spectral representation. I.e., not all filters transfer. Only those that have some specific spectral properties that affect their ability to discriminate some frequency components. This stands in contradiction to the claim in this paper that any filter that has a Laplace transform can transfer. The authors need to explain why this is not a contradiction.
(2) The work of Wang el al is related to the work of Levie et al and Ruiz et al that the authors cite. It is also related to the work of Zhou and Lerman (https://arxiv.org/abs/1804.00099) and the work of Gama et al (https://arxiv.org/abs/1806.08829 and https://arxiv.org/pdf/1905.04497) on stability properties of graph filters, GNNs, and graph scattering transforms. In all of these papers transferability and stability properties require restrictions on the spectral properties of graph filters. The statements in this paper seem to contradict this extensive literature. The authors need to explain why it doesn't.
I have recommended that the paper be rejected (3). However, if the authors provide satisfactory responses to the questions above, I will change my recommendation to accept (8).
优点
Problem formulation and results are interesting.
缺点
Results contradict existing literature.
问题
Please clarify points (1) and (2).
Thank you very much for your careful review of our paper!
We were very happy to read that
the problem formulation and results are interesting
and were delighted to see that you
want to advocate in favor of this paper.
Let us answer your raised questions and explain how – in contrast to other approaches – our work is able to provide transferability guarantees without having to impose spectral restrictions on the studied graph filters.
At a very high level, the underlying reason for this is a difference in how we quantify the magnitude of graph perturbations:
Previous works unanimously used a linear spectral-norm based approach to quantify the size of graph perturbations as . As we detail below, deriving results in terms of this measure of perturbation size is challenging and generically only possible when further restrictions on filter functions are imposed.
In contrast, we quantify differences between graphs using the notion of diffusion distance as originally introduced by Hammond (2013). Bounding variations in filter outputs in terms of this distance measure is much easier, as discussed in detail below.
For conceptual clarity, let us here discuss in detail the setting of a fixed number of nodes. The setting of varying numbers of nodes proceeds in complete conceptual analogy, but with added notational complexity due to the need of having to map between the different node sets.
When studying single filter transferability, we are then interested in bounding the difference corresponding to the maximal difference in outputs of a single fixed filter deployed on two different graphs with associated Laplacians .
Both our work and existing approaches then aim to bound this difference in terms of a distance measure that measures how different the respective laplacians are:
Here denotes spectral norm and the stability constant depends on (and for some existing approaches may also depend on the spectral properties of and ).
The fundamental difference between our approach and existing works is then, in which distance measure the difference between the two Laplacians is measured:
Existing approaches unanimously make use of the operator norm differences
between the two Laplacians to provide a measure of similarity.
The fundamental problem in deriving a bound of the form for some constant now is that scalar Lipschitzness does not imply operator-Lipschitzness: In other words, even if for scalar we have , this still does not imply that we would also have .
In order to find a constant so that , it is thus not sufficient that simply be Lipschitz. Instead, further restrictions on (and potentially on the spectra of ) need to be made, and the constant will depend on these restrictions.
In contrast to previous works, we instead do not use the norm difference to measure graph similarity. Instead, the distance measure we are considering is the diffusion distance
introduced by Hammond (2013).
The key idea here is that including the exponential into the distance metric leads to an (exponential) suppression of large eigenvalues of and . Information encoded into these large eigenvalues (and corresponding eigenspaces) corresponds to fine structure details of the graphs and (c.f. e.g. Chung 1997).
Suppressing this fine-structure information before taking a distance measurement effectively leads to a comparison that is predominantly determined by the coarse structures within the graphs. If the rough structures within the two graphs are similar, the distance between the two graphs will then be relatively small. Thus this metric is adapted to considering graphs that are similar up to fine-structure variations to be close to each other. This is the setting we are interested in when considering transferability, so that this distance measure is adapted to this setting of transferring filters between approximately similar graphs (see also the discussion in Section 3).
Suppose now that we are given a Laplace transform filter and want to determine whether it is transferable between two graphs. That is, we want to determine whether the quantity is small. To this end, we may note:
Thus in our approach, we may express the desired bound on filter outputs simply as
.
As the calculation above showed, we do not need to introduce any further restrictions on or the spectral properties of or .
Instead, we can straightforwardly guarantee that if is not too large and the diffusion distance is small, the quantity is also small and the filter is thus transferable.
We believe that this is indeed an important quality of our work and hope that tools and approaches developed in the present submission may be picked up by other researchers in the community in order to be used for further research. In order to ease and facilitate this, we have now included an appendix highlighting further the properties of diffusion similarities over the standard spectral difference (Appendix C). We have also included an Appendix discussing in more detail existing works on transferability and -- where applicable -- also discuss the origin of the respective spectral assumptions made in these works (c.f. Appendix B).
Below we discuss the spectral assumptions in the papers mentioned by the reviewer individually:
Stability to Deformations of Manifold Filters and Manifold Neural Networks:
Here filters are bounded as . In Theorem 2 absolute perturbations are considered (), in Theorem 3 relative perturbations are considered (). In both cases the conditions on spectrum and filter functions stem from the fact that Lipschitz-ness does not directly translate to operator Lipschitz-ness.
Geometric Graph Filters and Neural Networks: Limit Properties and Discriminability Trade-offs:
Here instead of measuring the linear norm difference between a graph Laplacian and a manifold Laplacian (which generically would be infinite as is an unbounded operator), the difference of the action of these operators on eigenfunctions (). After a triangle inequality argument, one term that has to be bounded in order to bound the difference in filter outputs is of the eigenfunction and eigenvector respectively. The fidelity of this approximation depends on spectral separation properties (c.f. Theorem 4 ibid.), which hence leads to the requirement that the spectrum be -separated. This requirement can thus be considered an artifact of considering the linear approximation for each eigenfunction. In contrast, in our approach (c.f. Appendix F.2) the notion of approximation of the Laplacian on the underlying manifold is different. We bound the quantity instead. Hence we do not need to bound differences between individual eigenfunctions and eigenvectors and hence avoid dependencies on spectral separations.
Transferability of Spectral Graph Convolutional Neural Networks
Levie's work only requires filters to be bounded and Lipschitz continuous (c.f. Theorem 17 ibid.). This is also true for Laplace-Transform Filters as considered in our work. We avoid Levie's growth of the stability constant with the number of considered eigenvalues (c.f. the discussion towards the end of page 12 ibid.) by avoiding approximations of individual eigenfunctions and instead approximating the bounded operator directly.
Graphon Neural Networks and the Transferability of Graph Neural Networks
Here the assumptions are that filter functions are bounded and Lipschitz continuous (c.f. AS2 on page 6; ibid.). These assumptions are satisfied true for Laplace Transform Filters as well.
Graph Convolutional Neural Networks via Scattering
Here the conditions on the spectrum also arise from a Lipschitz type approach to bounding differences. The difference is then via a triangle inequality argument reduced to bounding each term
individually. This is done in eqs. (64) and (68) respectively, which are contingent the there stated spectral restrictions.
Diffusion Scattering Transforms on Graphs
Here the dependence in Theorem 5.3 on the 'spectral gap' as defined before Proposition 4.1 comes from the Lipschitz type argument used in eq. (48).
Stability Properties of Graph Neural Networks
Here as well, Lipschitz type arguments are being used (See e.g. the assumptions of Theorem 1), and hence additional restrictions on spectrum and filter functions need to be posed.
We sincerely thank you for your comments, which are really important questions to answer to put our work in context of existing literature. We greatly appreciate that you gave us the opportunity to demonstrate our novelty with respect to these past works.
Thanks for your explanations. They are clear.
The fundamental point that I think is missing from this paper is that the analysis of high frequency components in the aforementioned references is not spurious. Convolutional graph filters that process large spectral components are unstable to small deformations of graphs. This was first pointed out in Gama et al (https://arxiv.org/abs/1806.08829 and https://arxiv.org/pdf/1905.04497) but several subsequent analyses have corroborated that this is true. Furthermore, there are simple examples that illustrate this problem such as that of graphs with edge weights and .
I understand that using the diffusion distance to compare graphs renders the comparison of large spectral components moot, but this is because the diffusion distance is designed to ignore this problem. That the choice of distance ignores the problem does not mean the problem does not exist.
This difference has to be clearly spelled out in the paper. I.e., the authors must clarify that their results hold for a specific way of comparing graphs and that other ways of comparing graphs yield different conclusions. Please propose a concrete modification to address this concern and I will be happy to advocate for your work.
Thank you very much for your quick response, continued engagement and valuable feedback!
We have now implemented concrete changes to our paper to make clear that our results hold for a specific way of comparing graphs and that other ways of comparing graphs yield different conclusions:
Concretely we have made the following changes:
-
At the beginning of Section 3, before introducing the notion of diffusion similarity, we now discuss how the usual spectral norm difference is adapted to capturing variations in edge weights by stating (c.f. line 120f):
This measure is especially well adapted to capture similarity under small edge variations ( with ).
-
After first introducing the notion of diffusion similarity, we now discuss the spectral and structural implications of using diffusion similarity as a measure of distance between graphs (c.f. line 160 f.)
As further discussed in Appendix C, the exponential suppression of high-lying spectral information renders this metric especially adapt at capturing variations preserving coarse structures.
- When discussing transferability of filters and networks, we now make clear that all transferability results in the given section (i.e. Section 4) apply to graphs that are similar in the diffusion sense and that different results hold when graphs are instead compared using the distance measure by stating (c.f. line 231 ff.):
A discussion of the alternative setting where instead is small is provided in Appendix E. There, different conditions on filter functions are generically necessary to guarantee transferability (Gama et al., 2019; 2020).
- Finally, in Appendix E we then discuss transferability in the setting where is small in greater detail. In Theorem E.1, we reduce network transferability to single filter transferability. We then refer to existing works on single filter transferability in line 1188f. Finally, the reviewer might also be interested in Theorem E.2. Here we provide a bound of the form , with (at least in the undirected setting) completely independent of . Hence, uniform bounds on in terms of the size of the direct difference are possible if this distance is measured in Frobenius norm as opposed to spectral norm.
We hope that we have sufficiently implemented the desired changes. If we can implement additional changes to further improve our paper, please do not hesitate to let us know; we would be happy to follow further advice as well!
Dear Reviewer DCXz,
Thank you again for your valuable and constructive feedback!
We wanted to doublecheck whether we have sufficiently incorporated your feedback into our work with the concrete changes (outlined in our comment above) that we have made to our PDF.
Should you want us to make further or different modifications, we would be happy to do that! Unfortunately, we can only do so until today a.o.e., as afterwards openreview will no longer let us modify our submitted pdf.
Should you however already be content with the changes that we have made, we would be more than happy!
We do not want to rush you in any way, as we understand that you are likely busy with your own rebuttal on top of an already busy research job and it is of course absolutely at the discretion of the reviewers when to respond to authors.
If however despite this you could find the time to briefly let us know if you would prefer additional changes or are happy with the ones we have provided in order to incorporate your feedback, we would be very grateful to you.
Dear colleagues,
I am not sure you are understanding my point. Your responses are predicated on the assumption that what you are calling the "fine" structure of the graph is irrelevant. This is true in some applications but not in all applications.
For instance, recall that GNNs do not process graphs directly. Rather, they process signals supported on graphs. In this case we know that large frequency components are associated to signals with high variability as measured with respect to the topology of the graph. Your results seem to imply that filters that process this information are transferable. But this is only because the diffusion distance discounts high frequency components.
I am in favor of your paper being published. I think that the analysis is interesting. But I cannot endorse publication if the paper does not state clearly the inherent limitations of the analysis and how it differs from other existing publications. I am simply asking for a remark that states that: (1) Your analysis produces a conclusion that is counter to the conclusions of other existing papers. (2) That this is because you are measuring Laplacian distances in a different way. (3) That your results do not apply to problems where we want to process high frequency components.
Just to be clear, this is not about claiming that some analyses are better than others. Just that they are different. In the end, that the analyses are different is the only sicentific statement. That one analysis is better than other is just a matter of opinion.
I would like to emphasize this last point - in order to properly understand the paper, its preconditions (namely the assumption that the coarse structure is more important than fine details, which is certainly the case in a lot of scenarios) should be very clearly stated in the introduction. I know that it is implicitly there through mentioning of the diffusion distance, but the implications are not obvious to every reader.
Dear Reviewer DCXz and dear Reviewer LEkq, thank you both very much for your continued engagement with our work, your patience and your high quality comments. We really appreciate your efforts!
Dear Reviewer DCXz, thank you very much for explaining your point to us again; which -- as we believe -- we have now understood!
We agree with what you are saying and have made corresponding modifications to our paper, which we detail below.
To be absolutely certain that we have indeed understood you correctly this time, let us first summarize the point that you are making in our own words:
-
Our paper focuses on transferability between graphs that have the same overall rough structure, but that differ in fine-print articulations.
I.e. in the application settings within the scope of our theory, the overall rough structure of graphs is more important than their exact fine-print articulations.
Spectrally speaking, this is the setting where the lower-lying spectral information is more important than the high-lying spectral information.
To capture this setting, we use a certain measure of distance between graphs (i.e. the diffusion distance of Hammond et. al. (2013)). This measure of distance is adapted to considering graphs to be similar when their coarse structures align. To achieve this, the diffusion distance discounts high-lying spectral information (which corresponds to the fine-print articulation of graphs), while focusing on the low lying spectral structure.
-
However, there fundamentally also exist settings where the high-lying spectral information (i.e. the exact fine-print articulation of a graph) is of crucial importance. This is the case whenever we actually do want to process high-lying frequency components. As high-lying spectral information is discounted in the diffusion distance, this similarity measure is thus ill-adapted to these settings where the fine-print structure of a graph does matter.
-
This limitation needs to be clearly stated:
The results we derive in our paper apply to the setting where we do not want models to overfit on the exact articulation of a graph (e.g. because the graph at hand is just one of many ways of describing an underlying system).
The results in our paper however do not allow to draw conclusions about transferability and model performance in settings where the exact underlying articulation of a graph is crucial; or -- more fundamentally and generally-- when the processing of high lying spectral information is important.
Following this feedback, we have now updated our manuscript to emphasize the three following points:
1. Diffusion distance as relaxation: The diffusion distance can be viewed as a relaxation of the typical linear distance . Such a relaxations allows to consider two graphs to be similar if the rough overall structures within them align, while fine-print articulations are allowed to vary. This setting captures fundamental examples such as graphs discretizing the same manifold, graphs describing the same object at different resolutions or graphs differing by edge deletions. We now present this exact characterization in the Introduction (c.f. line 49ff) of our updated manuscript.
2. Weaker similarity implies weaker assumptions: Since the diffusion distance is a weaker notion of similarity than the linear distance , also only weaker (spectral) conditions on filters are needed in order to guarantee transferability with respect to this (relaxed) notion of similarity. We have included this point in line 67ff. of the Introduction section of our updated manuscript.
3. Unsuitable when fine-structure details are critical: While the notion of diffusion similarity is well adapted to situations where the overall rough structure of graphs is more important than their exact fine-print articulations, there also exist settings where precisely the fine-structure of graphs is important to consider. Here our results do not apply. We have now made this point clear by including the following caveat in our Introduction section:
Caveat: The notion of diffusion-similarity central to our analysis below is adapted to the setting where the rough overall structure within graphs is more important than fine structure details. Utilizing such a relaxation of the standard linear distance allows to consider more relaxed conditions on filter functions than previous works (Gama et al., 2019; Wang et al., 2021) in this setting. It is however important to note that since our analysis is based on a distance notion that discounts fine-structure details within graphs, the results in our paper do not allow to draw conclusions about transferability and model performance in settings where the exact articulation of a graph is important.
Additionally, to further incorporate your feedback, we have:
- modified our Abstract. We now write:
We introduce a new point of view on transferability of graph neural networks based on the intrinsic notion of information diffusion within graphs. This notion is adapted to considering graphs to be similar if their overall rough structures are similar, while their fine-print articulation may differ.
as well as
> Spectral convolutional networks are transferable between graphs **whose overall rough structures align**, if [...]
2) made clear when introducing the notion of diffusion distance in Section 3, that it is
adept at capturing variations preserving coarse structures [...] but ill-suited for fine-structure variations [...]
(c.f. line 161 f.).
3) Modified the wording of our conclusion. We now write:
We developed a novel approach to transferability based on the intrinsic notion of diffusion on graphs, which considers graphs to be similar if their rough overall structures align.
as well as
A rigorous analysis established that when the rough overall information whithin graphs is paramount, networks are transferable if filters arise as Laplace transforms
We hope that we have now understood your point correctly and sufficiently acted upon your feedback! Is this indeed the case? Please do not hesitate to let us know if any point should have remained unaddressed!
Dear Reviewer Lekq,
thank you for emphasizing the importance of stating the preconditions of our paper more explicitly in our Introduction section. This is a good and valid point! As you very correctly observed, our paper is indeed concerned with transferability between graphs that have similar coarse structures, but differ in fine print articulation. It is indeed fundamentally important for the reader to understand that this is the scope of our paper.
We have now made changes, to make this explicitly clear in the introduction section of the updated version of our manuscript.
In particular, we have introduced a caveat into our Introduction section, which begins with almost the exact wording you suggested:
Caveat: The notion of diffusion-similarity central to our analysis below is adapted to the setting where the rough overall structure within graphs is more important than fine structure details. [...]
We hope that we could incorporate your feedback with these modifications. Is this indeed true? We sincerely hope so, but would be happy to act on any further input!
Dear Reviewer DCXz,
We hope you are having a great weekend and -- should you celebrate it -- had a happy Thanksgiving!
As the end of the rebuttal period is drawing close and Monday will be our last day to receive comments, we wanted to make sure we have now understood and incorporated your feedback exactly as intended.
In your response above you were
asking for a remark that states that: (1) Your analysis produces a conclusion that is counter to the conclusions of other existing papers. (2) That this is because you are measuring Laplacian distances in a different way. (3) That your results do not apply to problems where we want to process high frequency components.
To incorporate this, we have introduced a Caveat paragraph in our Introduction section (in line 65 ff.) which reads:
Caveat: The notion of diffusion-similarity central to our analysis below is adapted to the setting where the rough overall structure within graphs is more important than fine structure details. Utilizing such a relaxation of the standard linear distance ||L - \tilde{L}|| allows to consider more relaxed conditions on filter functions than previous works (Gama et al., 2019; Wang et al., 2021) in this setting. It is however important to note that since our analysis is based on a distance notion that discounts fine-structure details within graphs, the results in our paper do not allow to draw conclusions about transferability and model performance in settings where the exact articulation of a graph is important.
When introducing the diffusion distance, we then also write (c.f. line 160 ff.)
The exponential suppression of high-lying spectral information renders this metric adept at capturing variations preserving coarse structures (but ill-suited for fine-structure variations; c.f. Appendix C).
We do believe we have thus incorporated your feedback. In the introduction, we talk about coarse vs. fine structures as opposed to low vs. high lying spectral information as either terminology encodes the same information and we want to avoid being too technical in the Introduction section. Should you however want us to change
the results in our paper do not allow to draw conclusions [...] in settings where the exact articulation of a graph is important.
to
the results in our paper do not allow to draw conclusions [... ] for filters that process high frequency information.
we would also happily do that!
Thank you again for your continued engagement and valuable feedback, which we believe has greatly enhanced the clarity of our paper.
As the discussion phase draws to a close, we would like to thank all reviewers again for the time and energy spent reviewing our paper!
Reviewer DCXz "is in favour of our work being published". They needed us to introduce a limitations paragraph before raising to score 8. We have introduced said paragraph in our updated pdf.
Reviewer BeHs finds our writing to be logically rigorous and judges our work to be above the acceptance threshold.
Reviewer i3ZF
- wanted us to clarify which limit object is considered in Section 5.3,
- wanted us to provide exact mathematical definitions of the notion of coarse and fine graphs,
- suggested we conduct experiments varying numerical quantities such as the number of nodes
- and provide quantitative results.
In our rebuttal we
- clarified that we consider manifolds as graph limits in Section 5.3
- highlighted the precise mathematical definition of course and fine graphs provided within our paper
- reiterated that we conduct a lot of experiments varying numerical quantities, including the number of nodes in various settings
- pointed the reviewer to the quantitative and non asymptotic results in our paper
Unfortunately, Reviewer i3ZF did not engage with us during the discussion phase.
Reviewer LEkq finds our paper to be well-written and easy to read, comments that the suggested diffusion-based distance and its use for graph learning seem novel and interesting and judges our paper to be above the acceptance threshold.
Reviewer Dvw3 finds our paper to be well-written and considers our experiments to be well-conducted. They were concerned that
- we might be making use of existing using tools from graph coarsening
- and suggested that other transferability approaches could also apply in the settings we consider, without limiting to certain filter classes.
As we point out in our response,
- we do not make use of tools from graph coarsening, although we believe adapting prior research to new settings would be fair, even if that were the case. Furthermore we explain how existing notions of similarity from graph coarsening are insufficient to guarantee transferability.
- existing transferability approaches do not apply to the settings we consider.. In the setting at hand, only the filters we consider and characterise will guarantee transferability, while other filters will not. Hence as one of the results, our paper tells the practitioner, that they need to restrict to the considered Laplace transform filters to guarantee transferability in the considered settings; other filter functions will not lead to transferability.
The paper proposes a diffusion-based measure for graph similarity and demonstrates that spectral GNNs with LTF achieve transferability between graphs with aligned coarse structures. Theoretical guarantees are provided, and experiments validate the approach using manifold-discretizing graphs and molecular datasets.
Several reviewers likes the novel and rigorous mathematical framework, along with practical insights for filter design in GNNs to ensure transferability. At the same time, the dependence on diffusion similarity limits applicability to settings requiring fine structural detail, as acknowledged by the authors during rebuttal. Reviewers were also mentionning limited empirical exploration, in particular with respect to transferability tasks.
While the theoretical contributions are sounds, the limitations in empirical validation, practical benchmarking, and narrow applicability to coarse structures led me to recommend a rejection for this paper.
审稿人讨论附加意见
The discussion was quite long and intense between referees and authors. While some reviewers (e.g., DCXz and LEkq) acknowledged the rebuttal, others (e.g., i3Zf) did not engage further, leaving concerns unresolved. I believe the authors did a good job during the rebuttal to offers their perspective, but it is hard to champion it in the current form.
Reject