A Manifold Perspective on the Statistical Generalization of Graph Neural Networks
We leverage manifold theory to analyze the generalization gap of GNNs sampled from manifolds in the spectral domain.
摘要
评审与讨论
This paper derives generalization bounds for spectral graph neural networks, based on the assumption that the graphs are obtained from a sampling procedure from an underlying manifold. The obtain generalization bound decreases with the number of nodes, and increases with the number of features and network depth. Experimental results corroborate the dependence of the generalization gap on these factors.
优点
The paper is overall well written, and as far as I have checked the technical details seem to be correct. The experimental evidence that the generalization gap decreases as the number of nodes increases is interesting.
缺点
I don't like the assumptions made in this paper so much. It feels to me like they make the proof rather straightforward, but are rather far from reality. In particular, how common are the GNNs defined in equation (2)? How related are the GNNs in equation (2) to GCN, for which the experiments were carried out in practice? How likely is the assumption that the graphs are sampled from a manifold, for applications of interest (e.g., for the Cora, citeSeer datasets), and why is this assumption more reasonable than the assumption that the graphs were sampled from graphons as in previous work?
Overall, in comparison with the related work of Levie and Maskey cited in the paper, I don't feel there is a significant upside. My impression that Maskey and Levie also showed generalization gaps which decrease with the number of node, and increase with model complexity (which can be equated with model complexity and depth in this paper). True, these results were obtained for different GNNs and with a different limit model, but not clear to me why the manifold model, and spectral GNN, are any "better"?
问题
Main issue is a more detailed discussion of Maskey and Levie- if you would like to try to convince me what your contribution is over those papers, I will be happy to listen.
In addition, here are some minor points which you don't have to discuss in the rebuttal, but could address in your next version of the paper:
-
Line 155 instead of "with" should be "from"
-
In definition 4, Is the low-pass filter defined with the same d as the manifold? If so why?
-
Line 280 you say assumption I makes sense, you can also explain why assumption II makes sense (e..g I believe it is satisfied by norm-based losses like MAE and RMSE)
-
Line 370: not sure I understood your point about a single graph. As far as I recall, an alternative where several graphs were sampled from each manifold was not discussed.
-
Mention in main text which GNN you use. I understand from the appendix that it is GCN.
We thank the reviewer for their review. We will address the Weaknesses in what follows.
I don't like the assumptions made in this paper so much. It feels to me like they make the proof rather straightforward, but are rather far from reality. In particular, how common are the GNNs defined in equation (2)? How related are the GNNs in equation (2) to GCN, for which the experiments were carried out in practice?
The reviewer points out the connection between equation (2) and GNN architectures. Let us point out that equation (2) expresses the diffusion over the graph in matrix form and it is very general and it includes GIN, and GCN among other architectures. To show that GCN and equation (2) are indeed equivalent, this is another way of expressing equation (2),
where is the coeficient matrix assocaited with layer , and is the feature matrix of layer . Analogously, the GCNconv layer from pytorch geometric, which is based on the paper Semi-Supervised Classification with Graph Convolutional Networks can be written as follows:
Note that when , the two equations are equivalent. When the previous equation is applied times, it is equivalent to in our equation (2).
How likely is the assumption that the graphs are sampled from a manifold, for applications of interest (e.g., for the Cora, citeSeer datasets), and why is this assumption more reasonable than the assumption that the graphs were sampled from graphons as in previous work?
The assumption that graphs are sampled from manifolds is always valid in practice because it is well-known that any graph can be modeled as the discretization of a manifold. More to the point, any graph can be modeled as the sampling of a manifold embedded in . Thus, when our experiments consider the subsampling of a graph, this is equivalent to considering the sampling of a manifold at different resolutions. This is, incidentally, an advantage of manifold models compared with graphon models.
A separate question concerns the complexity of the manifold. To explore this question, we have added a new section in the Appendix (Appendix I - Manifold Assumption) in which we plot the largest eigenvalues of each graph that we use in our node classification experiments. We see a quick decay in the magnitude of eigenvalues, which showcases the validity of our low-dimensional assumption.
We further note that the differences between graphon and manifold models are well known. A graphon models the limits of graphs that share structural connectivity properties as described by homomorphism densities. A manifold describes the limits of graphs that share geometric properties as described by the geometric properties of the underlying manifold. In general, manifold limits are considered more general because all graphs are samples of manifolds, as we have already stated.
Overall, in comparison with the related work of Levie and Maskey cited in the paper, I don't feel there is a significant upside. My impression that Maskey and Levie also showed generalization gaps which decrease with the number of node, and increase with model complexity (which can be equated with model complexity and depth in this paper). True, these results were obtained for different GNNs and with a different limit model, but not clear to me why the manifold model, and spectral GNN, are any "better"?
We agree with the reviewer that our work is not the first one to address the generalization gap of GNNs. However, we believe our work presents certain benefits over the aforementioned work by Maskey and Levie. That said, we regard the contributions of Maskey and Levie's work as invaluable and seminal research. Our work is by no means intended to challenge their work, but rather build upon and complement their significant contributions to the GNN theory field.
First, manifolds are more interpretable and akin to physical systems than graphons. This is due to the fact that the manifold hypothesis is widely accepted in physical dynamics[1], and images (the so-called image manifold)[2,3]. Second, Maskey and Levie's work only addresses graph classification, which is not the only problem that GNNs are used for. Our results here work for both graph classification and node prediction, making our results more general and applicable. Finally, we uniquely observe the behavior of generalization in the spectral domain, which leads to showing the trade-off between discriminability and generalization. This is a novel contribution to the field. It follows that one can improve generalization by using filters with smaller continuity constants. That is to say, one can give up discriminability to improve generalization, but this is only mainly sacrificing the discriminability of high-frequency features.
Regarding the Questions, we thank the reviewer for their typo corrections and overall comments. We will adopt them for the most part.
[1] Talmon, Ronen, et al. "Manifold learning for latent variable inference in dynamical systems." IEEE Transactions on Signal Processing 63.15 (2015): 3843-3856.
[2] Gabriel Peyre, “Manifold models for signals and images,” Computer vision and image understanding, vol. 113, no. 2, pp. 249–260, 2009.
[3] Stanley Osher, Zuoqiang Shi, and Wei Zhu, “Low dimensional manifold model for image processing,” SIAM Journal on Imaging Sciences, vol. 10, no. 4, pp. 1669–1690, 2017.
Thanks for your answer. I have some follow up questions and remarks:
-
At high level I understand your argument that standard GNNs correspond to graph convolutions with K=2. As a suggestion for the next version of the paper (whether camera ready or not) I think the paper would benefit from a discussion of which GNNs fall into this framework, and what is the benefit of considering K>2 (no need to answer this point).
-
Could you provide a reference for the statement in the rebuttal that any graph can be realized as sampled from a manifold?
-
following up on weakness 1 by reviewer mBN7: it would seem to me that a graph filter, which is a polynomial function of the eigenvalues according to your definition, will never satisfy definition 4, because a polynomials goes to infinity with lambda, and lambda^{-d} goes to zero. Could you explain why this is not the case?
-
Regarding the empirical observation that generalization bounds decrease as the number of nodes increases. Could you comment on the history of this observation? This paper provides nice numerical verifiication of this on common datasets (in addition to the theoretical explanation). Were similar experiments carried out before hand?
At high level I understand your argument that standard GNNs correspond to graph convolutions with K=2. As a suggestion for the next version of the paper (whether camera ready or not) I think the paper would benefit from a discussion of which GNNs fall into this framework, and what is the benefit of considering K>2 (no need to answer this point).
As a final comment, note that the GCNconv layer can be applied more than once (and in practice it often is). That is to say, GCNconv layers are applied successively, which corresponds to a larger value of . The GCNconv layer admits a matrix multiplication form, which is the one that we write. We agree with the reviewer, that a discussion on this would be beneficial, and we plan to add it to the camera-ready version of the paper.
Could you provide a reference for the statement in the rebuttal that any graph can be realized as sampled from a manifold?
Coifman, R. R., & Lafon, S. (2006). “Diffusion maps.” Applied and Computational Harmonic Analysis, 21(1), 5-30. This paper shows how to recover a manifold model for data points in with a graph structure.
following up on weakness 1 by reviewer mBN7: it would seem to me that a graph filter, which is a polynomial function of the eigenvalues according to your definition, will never satisfy definition 4, because a polynomials goes to infinity with lambda, and lambda^{-d} goes to zero. Could you explain why this is not the case?
We would like to first point out that we introduce the filter frequency response function as a general form while the polynomial function and exponential function are examples. If one employs the polynomial function as in our experiments, we take into the restrictions in Definition 4 by adding the penalty term scaling with the eigenvalues and the eigenvalues of graph Laplacians are always finite
Regarding the empirical observation that generalization bounds decrease as the number of nodes increases. Could you comment on the history of this observation? This paper provides nice numerical verifiication of this on common datasets (in addition to the theoretical explanation). Were similar experiments carried out before hand?
To the best of our knowledge, we have not found any papers explicitly showing this empirical observation of generalization vs the number of nodes with real-world datasets.
Sorry for the later answer, and thanks for the response. I appreciate the emprical verification that generalization improves with the number of nodes.
There are several elements in your response which I do not find satisfying:
-
I could not find in the "diffusion maps" reference you provided any justification for the claim "The assumption that graphs are sampled from manifolds is always valid in practice because it is well-known that any graph can be modeled as the discretization of a manifold." I suggest you provide a more specific pointer for the claim, or retract it.
-
Regarding "we introduce the filter frequency response function as a general form while the polynomial function and exponential function are examples." This claim does not seem to me consistent with the wording in the paper. The wording can be changed, but ultimately the case of polynomial graph filters is the common case, certainly for GCN, used for the experiments in the paper, and other standard MPNN where the response function is linear in the eigenvalues.
-
The claim that graphs have a finite number of eigenvalues is of course correct, but I don't see how it helps your case, where your proofs aim towards approximating a manifold with unbounded eigenvalues, so seems like graph size, and graph eigenvalues, are going to infinity. Overall, I feel that there is a mathmatical issue here that needs to be fixed and not swept under the rug: "is the theoretical analysis applicable to standard MPNN"? In my opinion, you should fix this and submit to the next conference... I will retain my rating.
It is unfortunate that we can't find common ground. We also find it unfortunate that our answers are not satisfying.
Our results indeed show that polynomial filters do not generalize for manifolds that have unbounded eigenvalues. Filters that generalize require frequency responses that are flat at high frequencies. This can only be achieved by filters of infinite length. Further, Definition 4 is proposed for general filter functions, and if we want the polynomial filters to satisfy this, the values of filter parameters that scale the polynomial terms need to adapt to the eigenvalues with the penalty terms implemented. This technical conclusion is not wrong. There is nothing swept under the carpet.
The reviewer then asks whether our results apply to "standard MPGNNs." They do so to the extent that arbitrarily large eigenvalues do not matter in practice and are not necessary to discriminate. Again, nothing is swept under the carpet here. The reviewer disagrees about the applicability of an asymptotic result to a finite regime.
Finally, the reviewer is asking us to retract the claim that all graphs can be modeled as manifold samples or to provide a reference. We did not provide an explicit reference for this statement. The paper we provided, 'Diffusion Maps,' shows how to recover a manifold model for data points in with a graph structure, which can inversely testify that any graph can recover a manifold model from which this graph is sampled from.
In any case, in our paper, we do not require this, as we explicitly explain how to construct the graph (Definitions 2 and 3), ensuring no details are hidden. We would like to leave this discussion out of the realm of the paper as it is a tangential discussion of a fact that we do not use in the paper. Let us go back to the discussion over why the generalization analysis with the manifold model is contributive compared with the generalization analysis with the graphon model, as we stated before:
First, manifolds are more interpretable and akin to physical systems than graphons. This is due to the fact that the manifold hypothesis is widely accepted in physical dynamics, and images (the so-called image manifold). Second, Maskey and Levie's work only addresses graph classification, which is not the only problem that GNNs are used for. Our results here work for both graph classification and node prediction, making our results more general and applicable. Finally, we uniquely observe the behavior of generalization in the spectral domain, which leads to showing the trade-off between discriminability and generalization. This is a novel contribution to the field. It follows that one can improve generalization by using filters with smaller continuity constants. That is to say, one can give up discriminability to improve generalization, but this is only mainly sacrificing the discriminability of high-frequency features.
We would like to thank the reviewer for their comment. As the reviewer pointed out, we make novel contributions both theoretically and empirically.
Thanks for your response. To reiterate, I think this submission could be stronger with an exact explanation of which GNN models the argument apply to, and rigorous arguments showing that "large eigenvalues don't matter in practice" (e.g. by assuming f depends only on small eigenvalues, which you already did) or "regularization helps". As I understand it, the current results do not apply to GCN, since it is not a low pass filter, and it is not clear to which GNN model they do apply. Some interesting related questions are:
- is there a regularization which can be applied to polynomial models like GCN, so that you do get your generalization bounds?
- Are there better GNN models, in terms of generalization? e.g. taking instead of polynomial in ?
To reiterate, it is unfortunate that we can't reach an agreement. We have answered your questions while you don't like our answers.
(i) A GNN generalizes only when the filters have frequency responses flat at a high spectrum. This does imply that filters must have an infinite number of taps.
(ii) In particular, our results do apply for GCNs consisting of filters with a finite number of taps. They imply that these networks do not generalize.
(iii) In practice, we apply our results to networks with a finite number of taps. The insights are valid even if the formal analysis does not hold. This is, by the way, almost always true of theoretical analysis. The theory depends on stylized assumptions not intended to replicate reality but rather abstract the fundamental aspects of reality.
This paper investigates the statistical generalization capabilities of Graph Neural Networks (GNNs) from a manifold perspective. The authors propose a theoretical framework proving that when graphs are sampled from manifolds, the GNN's generalization error bound decreases logarithmically with the number of nodes and increases linearly with the spectral continuity constants of filter functions. The theory applies to both node-level and graph-level tasks, providing guidance for practical GNN design. The theoretical results are validated through eight real-world datasets.
优点
1.First to analyze GNN generalization from a manifold perspective, providing a new theoretical viewpoint that is more relevant to practical applications compared to previous work. 2. Not only presents theoretical framework and generalization bounds, but also validates theoretical results through extensive experiments and provides practical guidelines for GNN design.
缺点
-
The paper requires input signals to be bandlimited (Definition 1) and filters to be low-pass (Definition 4). Please clarify that these assumptions are not difficult to satisfy in practical applications.
-
The proposed Gaussian kernel-based graph construction method (Definition 2) requires computing the full adjacency matrix, which is computationally expensive for large-scale graphs with O(N^2) time complexity.
-
The theoretical results heavily depend on the choice of spectral continuity constant C_L, but the paper lacks concrete guidance on how to select appropriate C_L values in practice.
问题
Please refer to weaknesses.
We thank the reviewer for their positive review of our work.
Regarding Weakness 1, indeed these assumptions are not difficult to satisfy in practice. For Definition 1, it is practical to satisfy due to the fact that in practice the graph size is finite and the input signal possesses finite energy, thus the spectrum of the input graph signal is also bounded. Furthermore, in many practical applications, considering high-frequency components are often associated with noise or fluctuations without useful information, preprocessed data are commonly bandlimited. For Definition 4, we would like to clarify that in the experiments we add a regularizer to ensure the filters in GNNs satisfy the conditions in Definition 4. To do so, we can compute equation (108) in the appendix as the penalty term that we add to the loss function during the training process as a continuity regularization:
with as the regularizer term shown in Figure 7.
Regarding Weakness 2, we construct the graph in a theoretical way, but in practice, we work with existing graphs. That is to say, we propose a way of constructing graphs from manifolds which is indeed expensive, but in practice, we assume that the graph is already sampled from an underlying model. Furthermore, leveraging the generalization result of GNNs, a finite-sized graph can effectively generalize to any points or graphs sampled from the underlying manifold. This eliminates the need to directly compute over a very large graph representing the manifold, thereby avoiding the significant computational complexity associated with such operations. This is also an insight into our generalization results. We added a section to the appendix (Appendix I) where we show that the assumption that graphs come from an underlying low-dimensional manifold holds in practice.
Regarding Weakness 3, the result indeed depends on the spectral continuity constant . Note that in practice, a regularizer can be added to the loss such that this constant is reduced. We have included an experiment (Figures 7c and Figure 8d), where we vary the value of the regularizer (a larger regularizer implies a smaller spectral continuity constant), and show that our theoretical results indeed manifest in practice. The focus of this work is to highlight this observation of the trade-off between discriminability and generalization, uniquely analyzed in the spectral domain. This is a novel contribution to the field. Additionally, we recognize that it would be an interesting extension to investigate methods for selecting an appropriate continuity constant to achieve some optimal balance, ensuring both satisfying discriminability and generalization.
This paper considers the generalization performance of GNNs when the data lies in a manifold. Consider the problem of predicting target value from a one-dimensional graph signal on a compact Riemannian manifold . Two types of models are considered: the Manifold Neural Networks (MNN) induced from the Laplace-Beltrami operator on , and the GNN constructed on a finite number of i.i.d. sampled data points using a graph kernel. This paper evaluates the generalization performance of GNNs when the similarity graph is a Gaussian kernel-based graph and a -graph, respectively. Here, the generalization performance is defined as the difference between the statistical risk in the sense of the average risk on when applying the MNN and the empirical risk using the GNN on a finite number of points . In particular, the rate of generalization gap with respect to the number of sampled data points and its dependence on the dimension of the underlying manifold are derived. Furthermore, the generalization gap for node classification tasks is derived when we assume that data points with different labels are on different manifolds. In numerical experiments, GNNs are applied to node prediction tasks on synthesis and real networks, and node classification tasks on point clouds. The dependence of generalization on data points and spectral continuity constants is also evaluated.
优点
- A single methodology can be applied to the analyses of both regression and classification tasks. Furthermore, for classification, the analysis can be extended to the case where data points with different labels are on different manifolds.
- This paper provides detailed background knowledge on spectral theory on compact Riemannian manifolds, in particular, the Laplace-Bertlami operator, and MNN based on it. It is intended for those who are not familiar with the field, making it accessible to them.
缺点
- I have a question about the significance of Theorem 1, especially about the dependence of the upper bound on the spectral continuity constant.
- If I understand correctly, the definition of generalization performance in theoretical analysis differs from that in the usual statistical learning theory.
- I have a question about whether the generalization gap in numerical experiments appropriately corresponds to that in the theoretical analyses.
问题
- L.261: Is it correct to assume that the feature vector (input to the first layer of GNN or MNN) at each point is 1-dimensional because the input signal belongs to .
- l.293: I think is different from the generalization gap in the usual statistical machine learning theory. Usually, the generalization gap is the difference between the empirical and statistical risks of the same model (i.e., hypothesis). In contrast, this paper uses GNNs for the empirical risk and MNNs for the statistical risk: when a new data point is sampled, the prediction value for is computed using MNN not GNN . This setting may be justified if we interpret as a hypothesis space. However, I do not think it is a very popular setting.
- l.308: Theorem 1 shows that the generalization gap decreases approximately linearly with the number of nodes in the logarithmic scale.: I want to clarify what this means mathematically. Does it mean that the main term of is , where is the O-notation that ignores logarithmic orders.
- l.311: More importantly, we note the bound increases linearly with the spectral continuity constant [...].: I have a question about this claim. In Theorem 1, the left-hand side is uniform with respect to the hypothesis space , that is, independent of the specific hypothesis . On the other hand, the upper bound on the right-hand side depends on the specific hypothesis (or specific MNN/GNN) through spectral continuity constant . In particular, let be the infimum of the spectral continuity constant over : . The inequality replacing with on the right-hand side must also hold and is a tighter uniform bound independent of specific hypothesis .
- l.378: In Figure 4, the linear fitting is applied only on a region with a training accuracy of over 95%. However, I wonder whether this is justifiable. In fact, since the previous argument for l.311 provides a tighter uniform bound on the hypothesis space , the inequality holds regardless of the training accuracy.
- l.378: Related to the previous, as the theoretical analysis applies to untrained GNNs, I suggest conducting numerical experiments using untrained GNNs to see if the implication from the theory is valid numerically for them.
- l.474: In all cases, we vary the number of nodes in the training set by partitioning it in partitions when possible.: Does this mean that when the whole dataset is divided into partitions, the training dataset is one of those partitions? If so, how is the test dataset chosen if the number of partitions is one? I also want to check whether this experiment is a transductive problem, i.e., the test dataset is available during training, and the training data is available during the test.
- l.478: In the theoretical analysis, the generalization gap is defined as the difference in risk between MNN and GNN. However, interpreting the generalization gap in numerical experiments in that way is questionable for the following two reasons: first, numerical experiments use GNN for both training and testing. Second, assuming my understanding of the previous question about l.474 is correct, it is a transductive problem, which is different from the assumption in the theory. I want to ask the authors why it is justifiable to interpret the generalization gap in the numerical experiments as in the theoretical analysis.
- l.497: If we want to make a statistical claim that there is a linear correlation, it is more appropriate to apply hypothetical testing such as t-test and ANOVA rather than using an arbitrary threshold of 0.95.
- l.506: In the experiment in Figure 7, it is claimed that the relationship between the spectral continuity constant and the generalization gap is verified. However, the experiment does not measure the value of is not directly and assumes that increasing the regularizer reduces , which shoulde be verified.
- l.506: Even if we do not take the tight uniform bound as we discuss in the question about l.311, I still have a question about the validity of Theorem 1 in numerical experiments. In Theorem 1, the dominant term in the upper bound, when is sufficiently large, is the last term . However, this term is independent of the value of . It implies that the generalization gap should be almost independent of when the number of nodes is large; this is not the case in Figure 7.
Minor Comments
- l.161: is undefined.
- l.165: is undefined.
- l.184: I think it is not appropriate for the main text to refer to Equation (23), which appears only in the Appendix.
- l.187: The fact that the Laplacian eigenvalue is discrete and, at most, countable should be proved. We can derive it from the fact that is compact.
- l.329: Since indicates the density of the manifold in the previous sections, it is not recommended to use as a general symbol for the Laplace operator such as . The same applies to the graph Laplacian in l.333.
- l.334: Figure 3 is not referenced from the main text.
- l.359: is undefined in this section. Also, the dependence of on should be made explicit.
伦理问题详情
N.A.
We thank the reviewer for their questions and suggestions. We have corrected the minor comments in the updated version. In what follows we will address the reviewer's questions.
Questions
Question L.261 Yes, it is safe to assume it.
Question L.293 The reviewer points out that we are using GNNs and MNNs for ERM and SRM respectively. This is part of our contribution, that is, we see the distribution of graph nodes as the manifold and in turn, the GNN can generalize to unseen data over the manifold. As in practice, it is unrealistic to operate in the continuous manifold space, while this generalization bound provides a guarantee that a GNN trained on a finite sampled graph can approximate the performance of the MNN.
Question L.308 Yes, that is true.
Question L.311 This is a good point by the reviewer. What we assume here is that the holds over the function class, and therefore the bound is valid.
Question L.378 The reviewer makes a good point. Note that we have plot one curve, which is the one that better approximates the points after the training accuracy is below . Alternatively, we can fit the points before . In both cases, the bound will be linear (in log scale), and will be an upper bound to the GA as our theory predicts. As the reviewer points out, the plots show that the GA presents a switching behavior (which is very interesting!), but that our theory does not model. This would be an interesting future direction.
Question L.378 Related to the previous We have run experiments on untrained GNN for the toy examples, and the same properties hold.
Question L.474 Yes it is the so-called transductive learning. The test set is given, we do not modify it.
Question L.478 The reviewer raises two points. The first one is that we used GNNs for both training and testing. We cannot use a MNN given that in practice, the datasets are finite without full access to the underlying continuous model. We added a new section in the appendix (Appendix I) which shows that these graphs are low dimensional and therefore the manifold assumption is valid. That is why we believe that running GNNs on unseen data is a valid way to estimate the performance of the MNN. The second point the reviewer makes is regarding transductive learning. Please note that our theory does accommodate transductive learning. To see this note that the dimension of x (number of points in the graph), and the dimension of the labeled points (dimension of y), need not be the same. Plus, the expression in equation (11) is pointwise and the loss is over the dimension of y. In the main body of the paper, we considered these two the same for simplicity, but they do not need to be. By using different values for and , the transductive learning framework is accommodated. We thank the reviewer for this point.
Question L.497 We agree with the reviewer that there might be better ways of computing the linear correlation. We would like to point out that our bound is an upper bound, which holds in practice, and the trends (increasing with continuity constant and decreasing with the number of nodes) are correct with the support of this correlation test.
Question L.506 The reviewer raises the point that it should be verified that increasing the regularizer decreases the value of . Note that measuring the proper value of is very computationally costly in practice given that we are using large graphs whose spectral decomposition requires large amounts of memory, and can only be done partially. The idea behind the regularizer is that it will reduce the norm of the weights in the GNN (penalizing different layers differently), and this will translate into a smaller Lipschitz constant. Having said that, for regular MLPs it is shown that increasing the value of weight decay decreases the Lipschitz constant. The precise computation of the Lipschitz constant is itself a research topic. We do not believe that we are making a big claim here, and these experiments are a way to showcase the spectral continuity effect on the generalization gap.
Question L.506 Related to the previous We note that there is the probability term in Theorem 1. If we take as the order , the term with also takes the lead compared with the term . In the logarithmic scale, being that is a multiplicative factor, it should not change the rate but it should change the position of the curve. The fact that a curve with larger is above one with smaller is proof that indeed an indication that the theory is correlated with what is seen in practice.
Question L.261 Yes, it is safe to assume it.
OK. Then, can we extend the theory to high-dimensional feature vectors?
Question L.293 The reviewer points out that we are using GNNs and MNNs for ERM and SRM respectively. This is part of our contribution, that is, we see the distribution of graph nodes as the manifold and in turn, the GNN can generalize to unseen data over the manifold. As in practice, it is unrealistic to operate in the continuous manifold space, while this generalization bound provides a guarantee that a GNN trained on a finite sampled graph can approximate the performance of the MNN.
I understand Theorem 1 guarantees the value of MNN on a general point on the manifold using GNNs. Although the setup is reasonable, since this is not the usual usage of the term generalization gap, I suggest using other terms to represent this quantity.
Question L.308 Yes, that is true.
OK. I suggest adding more explanations about it.
Question L.311 This is a good point by the reviewer. What we assume here is that the holds over the function class, and therefore the bound is valid.
I think the bound is valid even if depends on specific hypothesis , at the const that the upper bound is hypothesis-dependent. However, it is also mathematically valid that we assume that there exists such that Definition 4 holds for all . I suggest making this assumption explicit.
I realize that the discussion I gave in the previous comment was inappropriate. So, I want to withdraw the original question. I am sorry for that. More specifically, the inequality replacing with does not hold in general (or at least it does not hold freely by taking the infimum of over .) If is a finite set, we can apply this discussion by using the union bound (at the cost of replacing the probability with .) However, we cannot do this when is infinite. We may have to use the theory of chaining somehow. Otherwise, we need to check the proof of the Theorem carefully to see if the discussion holds even if we replace with .
Question L.378 The reviewer makes a good point. Note that we have plot one curve, which is the one that better approximates the points after the training accuracy is below 95. Alternatively, we can fit the points before 95. In both cases, the bound will be linear (in log scale), and will be an upper bound to the GA as our theory predicts. As the reviewer points out, the plots show that the GA presents a switching behavior (which is very interesting!), but that our theory does not model. This would be an interesting future direction.
I thank the authors for the response. However, it could not fully solve my question of the relationship between the theoretical upper bound and the fitted line in the experiment. I first thought that the line was the approximation of the theoretical upper bound. This approximation implicitly assumes that the gap between GA and the upper bound (i.e., the gap between the left and right-hand sides of Theorem 1) is small. However, when we fit the linear line to the curve, the line deviates from the curve at some points. It contradicts the assumption.
Question L.378 Related to the previous We have run experiments on untrained GNN for the toy examples, and the same properties hold.
OK. I am afraid I could not find the results of untrained GNNs in the current version of the draft. I wonder if it is available there.
Question L.474 Yes it is the so-called transductive learning. The test set is given, we do not modify it.
I thank the authors for the response. I want to clarify further that the original dataset is divided into three (training, validation, and test) datasets, and the authors further divided the training dataset into 1 to 1024 partitions. That is, are the validation and test datasets independent of the partitioned dataset?
Question L.478 [...] The first one is that we used GNNs for both training and testing. [...] We added a new section in the appendix (Appendix I) which shows that these graphs are low dimensional and therefore the manifold assumption is valid. That is why we believe that running GNNs on unseen data is a valid way to estimate the performance of the MNN.
I think the manifold assumption is the problem setting we consider in the theoretical analyses. Therefore, the fact that the manifold assumption holds only justifies applying the theoretical results to the numerical experiments and does not justify using the value of GNN as a proxy of MNN.
The second point the reviewer makes is regarding transductive learning. Please note that our theory does accommodate transductive learning. [...]
OK. I understand that the pair of for represents the test dataset, and we can add other nodes such as training and validation datasets by increasing the dimensionality of .
Question L.497 We agree with the reviewer that there might be better ways of computing the linear correlation. We would like to point out that our bound is an upper bound, which holds in practice, and the trends (increasing with continuity constant and decreasing with the number of nodes) are correct with the support of this correlation test.
I thank the authors for the response. However, I am afraid I could not understand it. The authors' response implicitly assumes that the methodology of the correlation test is correct, and this is where I questioned it. I intended that the validity of the correlation test should be justified more statistical methodologies. With that being said, I think it is not a big problem because I agree with the authors that the correlation coefficient over 0.95 is empirically sufficiently large.
Question L.506 [...] Note that measuring the proper value of is very computationally costly in practice given that we are using large graphs whose spectral decomposition requires large amounts of memory, and can only be done partially.
I do not think we need the spectral decomposition of the underlying graph because we can compute from the polynomial whose coefficients are filter parameters 's. Note that Definition 4 should hold for arbitrary , not just for graph spectra (i.e., eigenvalues of the graph Laplacian.)
The idea behind the regularizer is that it will reduce the norm of the weights in the GNN (penalizing different layers differently), and this will translate into a smaller Lipschitz constant. Having said that, for regular MLPs, it is shown that increasing the value of weight decay decreases the Lipschitz constant. The precise computation of the Lipschitz constant is itself a research topic.
I agree that this explanation is appropriate as an intuitive explanation for the effect of regularizer. However, I would say it is not suitable to claim the empirical justification of the spectral continuity and the Lipschitz norm unless the relationship between the spectral continuity of the Lipschitz norm and the regularizer is shown explicitly.
We do not believe that we are making a big claim here, and these experiments are a way to showcase the spectral continuity effect on the generalization gap.
I still question whether changing the regularizer really changes the spectral continuity, in particular, the strong regularizer leads to the small spectral continuity. Therefore, I am not sure this experiment is appropriate to demonstrate the effect of spectral continuity.
Question L.506 Related to the previous We note that there is the probability term in Theorem 1. If we take as the order , the term with also takes the lead compared with the term .
I thank the authors for the response. However, I have two questions about it. First, I am not sure how we can justify to set to apply the theory to the numerical experiment. Second, even if this operation is appropriate, I think the authors overlook the term . If we substitute , then the order of the term that depends on is . It is significantly smaller than the last term .
We thank the reviewer for their responses. In what follows we will address the reviewer's questions.
Questions
Question L.261 Yes, we can extend the theory to high-dimensional feature vectors by employing a connection Laplacian, while that would be another work and beyond our current research scope. We thank the reviewer for pointing out this interesting direction.
Question L.293 We will change the term defined in equations (13) and (17) as a generalization gap via manifolds.
Question L.308 We have added more explanations on this in the comments following Theorem 1.
Question L.311 In Theorem 1 and 2 we show that there exists for any function class in . We would add a restriction to the function class to make it depend on , that is, is the continuity constant for the function class.
Question L.378 We thank the reviewer for the responses. We note that the line was plotted to show the linear dependence of to the number of nodes in the logarithmic domain. This holds before and after the changing point. We did not use these fitted linear plots to show that our bound is tight.
Question L.378 Related to the previous The results for the toy example on untrained GNNs are shown in Figure 1 and we pointed this out at the beginning of Section 5.
Question L.474 Yes, the validation and test datasets are independent of the partitioned datasets.
Question L.478 Using GNN as a proxy of MNN is justified by the non-asymptotic convergence result of GNNs to MNNs in Appendix B.
Question L.497 We agree with the reviewer that there might be better ways of computing the linear correlation with t-test and ANOVA to make our claim more rigorous. We thank the reviewer for this suggestion.
Question L.506
I do not think we need the spectral decomposition of the underlying graph because we can compute from the polynomial whose coefficients are filter parameters 's. Note that Definition 4 should hold for arbitrary , not just for graph spectra (i.e., eigenvalues of the graph Laplacian.)
As we need to hold for any , we still need to calculate the range of and cannot set it to infinity.
I agree that this explanation is appropriate as an intuitive explanation for the effect of regularizer. However, I would say it is not suitable to claim the empirical justification of the spectral continuity and the Lipschitz norm unless the relationship between the spectral continuity of the Lipschitz norm and the regularizer is shown explicitly.
We thank the reviewer for pointing this out. We think this is the most efficient way to relate the continuity restriction theoretically and empirically.
I still question whether changing the regularizer changes the spectral continuity, in particular, the strong regularizer leads to the small spectral continuity. Therefore, I am not sure this experiment is appropriate to demonstrate the effect of spectral continuity.
The regularizer is set to control the weight of the continuity penalty term in the training loss, similar to weight decay. A larger regularizer would put more weight on the continuity constant (decided by the filter parameters), which would lead to a smaller continuity constant during the training process.
Question L.506 Related to the previous We note that in Figure 7 we are fixing the number of nodes to study the effect of spectral constant, which would also reflect this change even if the last term is larger.
I thank the authors for the responses to my questions, and I am sorry for the late response at the very last of the discussion period. Here is a point-by-point comment on the authors' responses.
Question L.261 OK
Question L.293 OK
Question L.308 OK
Question L.311
This part looks good to me:
We would add a restriction to the function class to make it depend on , that is, is the continuity constant for the function class.
However, I am afraid I could not figure out what it meant:
In Theorem 1 and 2 we show that there exists for any function class in .
Do Theorems 1 and 2 show the existence of rather than assuming it?
Question L.378 I thank the authors for the explanation. Based on the response, I understand that the linear fitting is the approximation of GA and that there could be a gap between GA and its upper bound. On the other hand, it poses a question about the following discussion in l.485:
As predicted by our theory, the generalization gap increases with the number of features and layers in the GNN.
Theorem 1 claims that the upper bound, not GA, scales logarithmically linearly. Therefore, we cannot infer that GA behaves similarly if the bound is not tight (or assumed not to be tight).
Question L.378 Related to the previous OK. I thank the authors for their kind guidance.
Question L.474 OK
Question L.478 OK. I suggest adding this justification to the main text.
Question L.497 OK
Question L.506 I thank the authors for the explanation.
The regularizer is set to control the weight of the continuity penalty term in the training loss, similar to weight decay. A larger regularizer would put more weight on the continuity constant (decided by the filter parameters), leading to a smaller continuity constant during the training process.
We often explain the effect of the regularizer in that way, and this is a good rationale for expecting that the spectral continuity becomes small as we increase the regularizer. Therefore, if we treat the above as a hypothesis, it looks OK to me. However, this is also an intuitive explanation, and it is not sufficiently precise to prove it to be true. I think there may be a discrepancy in how we justify the claim from numerical experiment results.
This work studies the generalization bound of GNN by comparing the bound of the corresponding manifold neural networks. The results show that the generalization gap decreases as the number of nodes increases or the spectral continuity constant decreases. The experiments justify the theory, especially including the discussion about the regularize of the training.
优点
-
The paper is clear and well-written.
-
The theoretical analysis is solid and correct based on my understanding.
缺点
-
Some definitions are not clear enough. For example, it is better to relate Eqn 16 to the expectation of Eqn 15 so that the generalization gap definition in Eqn 17 is aligned with the classical definition. Please let me know if my understanding is incorrect.
-
The contribution of the trade-off between discriminability and generalization gap is weak. I find that better discriminability is interpreted as a smaller regularizer, which is indicated by a larger continuity constant. This explanation is quite indirect. A stronger analysis of discriminability could be a discussion based on some quantity to measure the discriminability.
-
The theoretical contribution and the novelty of this work are not very clear. It is better to provide a table or a list to show the comparison between this work and existing theoretical works.
问题
In Eqn 7 and 8, there are in both first steps. Is it a typo, otherwise isn't it a notation abuse of two different kernel functions?
We thank the reviewer for their comments.
Regarding Weakness 1, the generalization definition we employed aligns with the classical generalization definition. Typically, generalization gaps state that for all functions in the hypothesis class, the difference between empirical and statistical errors is bounded. See the following references:
- Foundations of Machine Learning - Mohri et al. Chapter 2, Theorem 2.13
- The nature of statistical learning theory - Vapnik, Chapter 3, Theorem 3.1
The point raised by the reviewer is valid though. As we can see the manifold as the underlying distribution of the graph nodes, the expectation of the loss value in equation (15) over all the nodes sampled from the manifold can indeed be written as equation (16). This is because the expectation of the sampled point is taken with respect to the measure over the manifold.
Regarding Weakness 2, the discriminability of the filter can be practically measured through the training loss. The relationship between the continuity constant and the training loss can be understood as follows: filter functions with larger continuity constants encompass a larger hypothesis function class, allowing the GNNs to better approximate the target functions during training. This leads to a smaller training loss. We have added the explanation of discriminability in the updated version in Section 4.1.
Regarding Weakness 3, we thank the reviewer for the opportunity to showcase our novelty. Our result is novel in a variety of ways:
- The analysis is based on manifolds and allows graphs constructed with geometric information. This improves upon graphon results with better physical interpretations in practice as data often does lie in some manifold. Our results show that under this assumption there are interesting consequences for GNN generalization that do not follow from other existing analyses.
- Our work provides a unifying theory that allows us to explain both node classification and graph classification, which is unique and novel compared with other works.
- We explicitly show that the statistical generalization of a GNN depends on the spectral properties of the graph filters that compose individual layers. The practical implication is that there is a tradeoff between generalization and discriminability of the GNN. This tradeoff is frequency-dependent (frequencies correspond to eigenvalues of the graph Laplacian operator). Features that are associated with large eigenvalues of the graph Laplacian (high-frequency components) are inherently more difficult to discriminate. It follows that one can improve generalization by using filters with smaller continuity constants. That is to say, one can give up discriminability to improve generalization, but this is only mainly sacrificing the discriminability of high-frequency features.
Regarding the questions, the reviewer makes a good point. We have changed the kernel function representation in Definition 2 and 3 by adding a subscript to avoid the abuse of notations. We thank the reviewer for pointing this out. We thank the reviewer for pointing this out.
Thank you for the response. I have some further questions.
-
In the response to Weakness 3, you say "Our results show that under this assumption there are interesting consequences for GNN generalization that do not follow from other existing analyses." What are these interesting consequences? Do they refer to some results in point 3 about the tradeoff between discriminability and generalization?
-
Point 3 looks novel and interesting. However, how is it related to practice? What kind of real-world graphs have more high-frequency components?
-
At the end of point 3, the authors say "one can give up discriminability to improve generalization." Is it meaningful? The testing error equals the addition of a training error and the generalization gap. If the discriminability is bad, which means a bad discriminability from my understanding, then improving the generalization gap is meaningless since the testing loss is still high.
In the response to Weakness 3, you say "Our results show that under this assumption there are interesting consequences for GNN generalization that do not follow from other existing analyses." What are these interesting consequences? Do they refer to some results in point 3 about the tradeoff between discriminability and generalization?
Yes, our paper offers s spectral perspective on the generalization of GNNs via a manifold model, which is novel compared with existing analysis.
A very interesting aspect of our bounds is that the generalization of GNNs depends on the spectral shape of the filter, not on its number of coefficients. This is unique in the literature. Another interesting aspect of our bounds is the tradeoff between the discriminability and generalization of GNNs and the generalization of filters is easier for low frequency components. This is also unique in the literature.
Point 3 looks novel and interesting. However, how is it related to practice? What kind of real-world graphs have more high-frequency components?
Thank you for your comment. We have added a new section in the appendix (Appendix I), which shows that the graphs from the datasets are low-dimensional and therefore the manifold assumption holds for these real-world graphs. The kind of real-world graphs that have more high-frequency components are typically the ones whose edges change significantly at a local scale. High frequencies are associated with rapid signal variations across edges, thus resulting in a large variability in the graph structure. For example, in the context of community graphs, high frequencies are often associated with the presence of multiple communities, and having more communities translates into more high-frequency components. In general, there is a finite and small number of communities, and therefore the number of high-frequency components is limited. For the input graph signals, high-frequency components are often associated with noise or fluctuations without useful information therefore we can sacrifice the discriminability of these components.
At the end of point 3, the authors say "one can give up discriminability to improve generalization." Is it meaningful? The testing error equals the addition of a training error and the generalization gap. If the discriminability is bad, which means a bad discriminability from my understanding, then improving the generalization gap is meaningless since the testing loss is still high.
The statement is correct but it applies to all of statistical learning theory. Any generalization result is a tradeoff between training and generalization gaps. What matters about generalization results is what they say about this tradeoff. We agree with the reviewer that a very bad discriminability is not what we want while a very perfect discriminability may lead to overfitting and is also unwanted. Our point is that it can be beneficial to trade off some level of discriminability to enhance generalization. We believe there exists a balance where the best of the two worlds is attained. As the reviewer points out, it is more advantageous to introduce a modest amount of continuity regularization to the filters. We thank the reviewer for this comment. As we say above, our results are unique in the literature in that they relate generalization to spectral properties of filters. This sheds light on problems where generalization is easier, e.g. when low-frequency components are more important to discriminate.
Thank you for the reply. Figure 9 shows that some datasets have more high-frequency components, while others do not. I would like to double-check the insights on learning graphs with more high-frequency components. You previously stated that these graphs are more difficult to learn. Then, should we try a larger regularizer so that the continuity constant is smaller to improve the generalization gap? Or should we use a smaller regularizer so that we can improve the discriminability? Did you have an experiment on this?
Thank you for the reply. Figure 9 shows that some datasets have more high-frequency components, while others do not. I would like to double-check the insights on learning graphs with more high-frequency components. You previously stated that these graphs are more difficult to learn. Then, should we try a larger regularizer so that the continuity constant is smaller to improve the generalization gap? Or should we use a smaller regularizer so that we can improve the discriminability? Did you have an experiment on this?
This is a great point by the reviewer. Indeed, some graphs have more high-frequency components, which translate to a larger intrinsic dimension on the manifold. Our theory predicts that a larger intrinsic dimension on the manifold increases the generalization gap. Then to maintain the generalization, we need a relatively larger regularizer (smaller Lipschitz constant) but still modest to ensure the discriminability. The reviewer asked how to choose the optimal regularizer for different graphs. Let us note, that even for one given graph, choosing the right amount of regularization is an open problem. This is a great point and is an open problem that exceeds the scope of our current work.
We would like to further explain our statement:
"Features that are associated with large eigenvalues of the graph Laplacian (high-frequency components) are inherently more difficult to discriminate."
Large eigenvalues on graphs are a matter of normalization. What happens in practice is that graphs with large numbers of nodes have a large number of eigenvalues. If the graph is normalized, its largest eigenvalue is not large but implies that the remaining eigenvalues are tightly packed. There are a large number of them in a relatively small interval. The interpretation of our bounds is the same. In particular, generalization depends on the spectral properties of the filters and high-frequency components are more difficult to generalize. It is just that instead of explaining the effect of filters on large eigenvalues we have to explain the effect of filters on tightly packed eigenvalues.
I have increased my rating to 6. The insight that using a different regularizer to improve the generalization gap makes sense. However, I do think adding experiments to show it will be better, especially to show the differences in the optimal regularizers between graphs with different amounts of high-frequency components.
This paper discusses the generalization of GNNs from a manifold perspective. In this setup, the graph is considered as sampled from an underlying manifold equipped with a Hausdorff probability measure.
The authors aim to investigate the generalization gap between the following: A GNN trained on a sampled instance of the graph, and an MNN (Manifold Neural Network) that has the same structure and parameters as the GNN mentioned above. Here, GNN and MNN specifically refer to low-pass filters.
The core of the paper lies in two theorems, which provide generalization gaps for both node-level and graph-level tasks. The main findings of the theorems are:
- The GNN trained on the sampled graph can generalize to the MNN on the original manifold.
- The trained GNN can generalize to other graphs sampled from the same manifold.
- The generalization error of the GNN is primarily influenced by several key factors: the number of nodes, the manifold dimension, and the spectral continuity constants . Practically, this implies:
- As the manifold dimension increases, more nodes are required to obtain a model with good generalization.
- (Important) A larger represents stronger discriminative power but results in weaker generalization. This is a critical trade-off to consider when choosing a model.
The paper uses experiments to demonstrate the impact of these key factors.
优点
-
The paper is well-structured, and the experiments respond well to each part of the theorem.
-
The paper presents two theorems, which contribute to the study of GNN generalization.
-
The two theorems provide practical guidance for the selection of filters.
缺点
-
The relationship between the experimental and theoretical settings is not clearly explained. In the appendix, the authors mention that the experiments use a graph convolutional layer (GCN). Does GCN fall under Definition 4 as required by the theorems? Moreover, does the input used in the experiments meet the requirements of Definition 1? Also, GCN does not seem to match the filter structure that is analyzed in the paper.
-
The explanation of MNNs is insufficient for me. Are MNNs implemented by sampling graphs and running GNNs on them?
-
If the answer to (2) is yes, then shouldn't the paper provide a generalization bound between two separate runs of MNNs (effectively GNNs on two different sampled graphs)?
问题
Please refer to the Weaknesses section.
If my questions can be addressed, I will consider increasing the score :)
We thank the reviewer for their time and effort.
Regarding Weakness 1, we would like to clarify that in the experiments we add a regularizer to ensure the filters in GNNs satisfy Definition 4. To do so, we can compute equation (108) in the appendix as the penalty term that we add to the loss function during the training process as a continuity regularization:
with as the regularizer term shown in Figure 7.
Furthermore, for the graph signals considered in the paper, it is important to note that in practice, most signals possess finite energy, and the graphs used in our experiments have a finite number of nodes. Therefore, there exists a value such that Definition 1 holds. Furthermore, in many practical applications, considering high-frequency components are often associated with noise or fluctuations without useful information, preprocessed data are commonly bandlimited. In all, both Definition 1 and 4 are fulfilled in practice given the finite size of the graphs, and GNNs are trained with continuity penalty terms in our experiments.
Regarding GCN not matching the filter structure analyzed on the paper, we believe it does. When GCN is applied successive times, it is equivalent to multiplying by the square of the graph shift operator matrix. Let us point out that equation (2) expresses the diffusion over the graph in matrix form and it is very general and it includes GIN, and GCN among other architectures. To show that GCN and equation (2) are indeed equivalent, this is another way of expressing equation (2):
where is the coeficient matrix assocaited with layer , and is the feature matrix of layer . Analogously, this is the GCNconv layer from pytorch geometric, which is based on the paper Semi-Supervised Classification with Graph Convolutional Networks can be written as follows:
Note that when , the two equations are equivalent. When the previous equation is applied times, it is equivalent to in our equation (2).
Regarding Weakness 2, the reviewer is correct. Given that manifold signals are not available in most practical scenarios, we sample the manifold at a very fine resolution and we run GNNs on these large graphs.
Regarding Weakness 3, the reviewer suggests that we should study the separate runs of MNNs. We would like to point out that this is very similar to what we have shown in the paper. What the reviewer has suggested can be derived as a corollary of Theorem 1 by using the triangle inequality. We have added this in the appendix. In the experiments, we resampled the graphs for each run of the experiments as the reviewer suggested with the number of sampled points large enough to approximate the underlying manifold.
We thank the reviewer for the questions and we have added more explanations to the corresponding parts in the updated draft. We hope that the questions were addressed.
Thank you once again for your comments. We would appreciate your feedback as the discussion period is closing soon.
This paper tackles a problem of generalization error of graph neural network from a manifold perspective. That is, by assuming that the data are generated from manifold, the authors investigated how close the GNNs can be to manifold neural networks (MNNs). The theoretical investigation was supported by some numerical experiments.
- Although the manifold perspective is interesting (and this kind of notion has been considered in the literature), this paper still requires some more revisions. As the reviewers raised, the writing can be much improved by adding more detailed and clear descriptions on the meaning of each mathematical notions and statements that the authors claimed. There are several unclear points that should be explained in more consistent way.
- Moreover, the message derived from the theories are not necessarily well justified. In particular, the generalization and discriminability trade-off are discussed in a hand-wavy way. That is not so rigorous.
- The connection of theory and numerical experiments is also unclear. For example, whether the GCNs employed as the ground truth satisfies the conditions or not is not rigorously justified.
For these reasons, I think this paper is not mature enough for acceptance.
审稿人讨论附加意见
There have been extensive discussions between authors and reviewers. However, the reviewers still have concerns that were not completely resolved through discussions. I also think the paper requires a few more rounds of revisions so that the paper becomes more readable and rigorous.
Reject