A Local Graph Limits Perspective on Sampling-Based GNNs
摘要
评审与讨论
This paper propose to accelerate the training of Graph Neural Networks (GNNs) by training GNNs on small subgraphs insead of the whole large graph. Moreover, the authors show that the parameters of GNNs trained on small subgraphs are within an -neighborhood of that trained on the whole graph.
优点
- This paper is easy to follow.
- The authors provide the theoretical analysis of the proposed method.
缺点
- Please explain why Assumption 5.3 holds in practice.
- The proposed method aims to train GNNs on large graphs efficiently, while the authors does not evaluate the runtime and memory in experiments.
- The authors claim that no formal models explain why they can reduce the computation and memory requirements of GNNs without signifcantly compromising their performance. However, in my opinion, VR-GCN [1] and LMC [2] explain the phenomenon of node-wise sampling and subgraph-wise sampling respectively in terms of convergence.
[1] Stochastic Training of Graph Convolutional Networks with Variance Reduction. ICML 2018.
[2] LMC: Fast Training of GNNs via Subgraph Sampling with Provable Convergence. ICLR 2023.
问题
See Weaknesses.
This paper proposes a unified framework for sampling-based GNNs, and then uses Local Graph Limits theory to show that training on small sampled subgraphs guarantee convergence to eps-neighborhood of the solution trained on the entire graphs.
优点
- Timely contribution as graphs are getting bigger and bigger and computational resources become tighter.
- Nice theory for the convergence.
- Works for different types of tasks.
缺点
- Missing literature, e.g., https://arxiv.org/pdf/2006.13866.pdf and related papers.
- Experiments could be more comprehensive by looking at more datasets and more algorithms.
- The unification is not as novel as advertised. Since available platforms support sampling, implicitly the framework has been established before. See, e.g., https://arxiv.org/pdf/2207.03522.pdf.
问题
- In Fig 3: why does the sampling-based method perform better than GNN trained on entire graph?
In this paper, the authors propose a new framework to analyze the behavior of GNN training using sampled subgraphs of an original, large input graph, which encapsulates many of the currently-used GNN algorithms. In doing so, the authors prove that using this paradigm, the parameters of a model converge within some bound around the parameters if they were trained on the original graph. Authors also include empirical results on using their new sampling-based framework, and show that indeed it is comparable to training on the original graph.
优点
The paper's strengths lie in both its strong theoretical and empirical sides. The paper backs up any claim with solid theory, which is then compared experimentally with cohesive experiments. In addition, the paper has strong presentation; the overall flow and quality of the paper makes it very clear to read and understand.
缺点
The main weakness is the lack of novelty and impact the work may have on the field as a whole. Although the ideas presented in the paper are novel, they do not seem like they would have much impact in the future of the field. For example, perhaps there could be more discussion on how efficiency in both time and space are improved, practically, using the framework presented. And if the paper is purely focused on theory, perhaps there could be better bounds of existing algorithms as an example illustrating the power of the framework provided.
问题
- Section 3.2 introduces the general framework. Does the framework only take into consideration "node sampling" and "computational graph sampling"? Would it be possible to add other forms of sampling (known currently, or in discovered in the future) easily to this framework?
- In figure 2, right side, could you offer an intuitive explanation as to why you think GCN struggles compared to GCN full considerably more in this experiment compared to others?
Paper is interested in sampling subgraphs for training GNNs on large input graphs, rather than doing gradient decent steps on the whole graph at every training step. They write a general training pipeline that can be applied on most message passing GNNs (with AGGREGATE and COMBINE), including GraphSAGE, GCN, and others. They theoretically show that, under some assumptions, the parameters recovered by subgraph training are not too far away form the parameters of full-graph training.
优点
- General algorithm that can scale many GNNs onto large graphs
- Theoretical insight of convergence
缺点
Experiments
The paper is lacking on experiments -- there are only two datasets: pubmed (small size) and ogb-mag (medium size). It is nice to show some wins on more datasets, especially larger ones. I also suggest to remove the word "very" for "very large citation networks", on 2nd page 3rd paragraph.
In fact, the SIGN model (also designed to scale GNNs, by computing pretraining () then training on rows without ever looking at adjacency matrix ever again during training, i.e., the graph is only used for untrainable pre-processing then usual SGD training proceeds.
Comparison with earlier work
The work on GTTF (Marowitz et al, ICLR 2021) is very similar: it develops a general algorithm that can be specialized for training GNNs and skip-gram models (like deepwalk) by sampling, they also prove that the parameters learned with sampling are equivalent to ones learned on full graph, under some assumptions (the actual assumptions and proofs in GTTF differ).
It would be nice if this paper brings in something new: if it is the analysis, name the paper appropriately. Otherwise, it feels like just another paper that is doing sampling (like GraphSAGE, GTTF, etc)
Uninterpretable math
-
The notation is not standard. E.g., instead of , it could be -- As-is, it looks like a function accepting 2 arguments: a feature vector and a boolean (is a member of ).
-
Further, the expression for at end of page 5 is hard to decipher
-
In section 3.2, there are 3 conflicting definition of . It says but then invokes it on same line as
-
Having READOUT function (last line of Alg.1) is ambiguous: how are the sampled nodes any special to give a signal to the entire graph?
-
Is a measure or an instance? In definition 4.1, it overloads it to be both
-
What is above Eq 6? Is it the same as defined later in Sec 4.2?
Minor Improvements
- In Section 2: GNNAutoScale is not sampling-based GNN. On the other hand, it is based on historical embeddings.
- Below Equation 2, you say "simple aggregation". Why not be explicit and define "simple" right here?
问题
I have questions listed above.
Although the paper offers intriguing theoretical groundwork, it has encountered criticism regarding its novelty and empirical evaluation. I strongly believe that the results and insights discussed would significantly benefit our research community. However, it seems that further refinement of the current manuscript may be necessary before publishing at ICLR.
为何不给更高分
The novelty and the empirical evaluation are limited to better understand how the proposed framework can help or have effect on GNN training scheme, which limit the interest in the ICLR audience. The reviewers concerns are not addressed for further understanding and insights.
为何不给更低分
N/A
Reject