PaperHub
6.0
/10
Poster5 位审稿人
最低5最高8标准差1.1
5
8
6
5
6
3.4
置信度
ICLR 2024

TEDDY: Trimming Edges with Degree-based Discrimination Strategy

OpenReviewPDF
提交: 2023-09-21更新: 2024-04-21

摘要

Since the pioneering work on the lottery ticket hypothesis for graph neural networks (GNNs) was proposed in Chen et al. (2021), the study on finding graph lottery tickets (GLT) has become one of the pivotal focus in the GNN community, inspiring researchers to discover sparser GLT while achieving comparable performance to original dense networks. In parallel, the graph structure has gained substantial attention as a crucial factor in GNN training dynamics, also elucidated by several recent studies. Despite this, contemporary studies on GLT, in general, have not fully exploited inherent pathways in the graph structure and identified tickets in an iterative manner, which is time-consuming and inefficient. To address these limitations, we introduce **TEDDY**, a one-shot edge sparsification framework that leverages structural information by incorporating *edge-degree* statistics. Following the edge sparsification, we encourage the parameter sparsity during training via simple projected gradient descent on the $\ell_0$ ball. Given the target sparsity levels for both the graph structure and the model parameters, our TEDDY facilitates efficient and rapid realization of GLT within a *single* training. Remarkably, our experimental results demonstrate that TEDDY significantly surpasses conventional iterative approaches in generalization, even when conducting one-shot sparsification that solely utilizes graph structures, without taking feature information into account.
关键词
Graph Lottery Tickets; Graph Compression; Graph Sparsification; Graph Neural Networks

评审与讨论

审稿意见
5

This paper introduces an intuitively derived graph sparsification technique based on edge degrees, and integrates network distillation techniques for weight sparsification. The pruning process is independent of IMP and operates in a one-shot manner.

优点

S1. This work demonstrates that high degree edges are less important via empirical observations, which is significant in LTH community.

S2. Although many prior efforts have explored one-shot GLT; nevertheless, this paper seems to well strike the balance between efficiency and performance.

缺点

W1. The authors have avoided discussions concerning complexity, yet it appears that TedgeT_{edge} might necessitate an O(N2)O(N^2) space complexity. Any comments on this?

W2. This work appears to be a fusion of two research lines:

  • From the perspective of graph sparsification, there have already been efforts to prune edges based on edge properties and graph connectivity. The assertions made in \cite{wang2022pruning} closely align with this study, suggesting that the removal of "non-bridge" edges (corresponding to the "low-degree edges") has minimal impact on graph information flow. Furthermore, it provides theoretical support and error bound analysis.
  • From the perspective of weight sparsification, [1, 2] have extensively explored the feasibility of using PGD for parameter pruning. I cannot see the substantial differences or new contributions in this paper.

W3. While the innovations are intuitively appealing, they still lack theoretical substantiation. The assertion that high-degree edges should be pruned is not a trivial conclusion. I find the similar claim in [3], which examines graph connectivity, to be more appealing.

[1] Zhou X, Zhang W, Xu H, et al. Effective sparsification of neural networks with global sparsity constraint[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 3599-3608. [2] Cai X, Yi J, Zhang F, et al. Adversarial structured neural network pruning[C]//Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2019: 2433-2436. [3] Wang L, Huang W, Zhang M, et al. Pruning graph neural networks by evaluating edge properties[J]. Knowledge-Based Systems, 2022, 256: 109847. [4] Hui B, Yan D, Ma X, et al. Rethinking Graph Lottery Tickets: Graph Sparsity Matters[J]. arXiv preprint arXiv:2305.02190, 2023. [5] Chen T, Sui Y, Chen X, et al. A unified lottery ticket hypothesis for graph neural networks[C]//International conference on machine learning. PMLR, 2021: 1695-1706.

问题

The baseline performance depicted in Fig 6 and 11 appears highly questionable, displaying a significant drop compared to the results reported in [4, 5], as well as my previous replications. Perhaps the authors should consider presenting a more rigorously established and fair baseline performance.

The references can be found in the weakness part.

伦理问题详情

N/A.

评论

[Baseline Performance]

Q1. We believe that our experiment is conducted upon the fair setting. To demonstrate this, we have attached the source code for both our TEDDY and the baselines in supplementary.

It's important to note the issue of the official implementation of UGS [4] on GAT (https://github.com/VITA-Group/Unified-LTH-GNN/blob/main/NodeClassification/gnns/gat_layer.py). According to the code, edge masking is applied prior to the softmax normalization of attention coefficients. However, this implementation leads to a notable issue where pruned edges inevitably receive non-zero normalized attention coefficients, incurring the revival of pruned connections during the neighborhood aggregation phase (hence, the corresponding edge is NOT actually pruned).

Our experimental results using a revised implementation revealed a significant performance decrease compared to the results reported in [4]. However, we expect the performance enhancement of TEDDY as well as baselines if we adhere to the original implementation of UGS, since in this case, the amount of information received from neighbors also increases for our method. The same issue is pertinent on WD-GLT [5] as well, and we posit that this is due to the implementation on top of the UGS. Besides, we detailed the additional reproducing issue of WD-GLT in Appendix B.3.


References

[1] Pruning graph neural networks by evaluating edge properties, Knowledge-Based Systems 2022.

[2] Effective Sparsification of Neural Networks with Global Sparsity Constraint, CVPR 2021.

[3] Adversarial structured neural network pruning, CIKM 2019.

[4] A Unified Lottery Ticket Hypothesis for Graph Neural Networks, ICML 2021.

[5] Rethinking Graph Lottery Tickets: Graph Sparsity Matters, ICLR 2023.

评论

[On Theoretical Substantiation]

W3. As we discuss in [Graph Sparsification], [1] introduce the concepts of bridge/non-bridge edges using graph spectral analysis, upon which the authors propose graph sparsification methods, but it is important to note that the notion of bridge/non-bridge edges in [1] does not entirely align with the low-degree/high-degree edges in our paper. As we note in Theoretical evidence of remarks in Section 4, the generalization error bounds of several GNN families heavily depends on the adjacency matrix rather than model parameter or node features. For the representative example, the error bound for GCN relies on the Asym\Vert A_{\text{sym}}\Vert_{\infty} for the symmetrically normalized adjacency matrix AsymA_{\text{sym}} and this norm can be further upper-bounded by degmax+1degmin+1\sqrt{\frac{deg_{max}+1}{deg_{min}+1}}. Note also that PGEP method introduced in [1] would be computationally intractable in practice since it should execute multiple times of graph search algorithm (such as DFS) for each node to identify bridge/non-bridge edges through a whole graph, which would be more problematic in a large-scale scenario. On the other hand, our approach TEDDY can practically identify to-be-pruned edges in a more computationally efficient way (please refer to the performance evaluation and time consumption on large-scale datasets in Section 6.2 and Appendix A.8, respectively).

评论

[On Graph Sparsification]

W2-1. We appreciate your observation in drawing parallels between our work and the approach presented in [1]. However, we respectfully disagree with the view that our work, TEDDY, is closely related to [1].

There exist several clear differences and superiority of TEDDY from the methodology proposed in [1], which we believe are worthwhile to clarify:

  1. No Direct Conceptual Alignment: Although [1] demonstrates the edge importance via spectral analysis, the concept of the bridge is NOT strictly aligned with our high-degree/low-degree edges. According to the Definition 2 in the paper [1], the bridge refers to an edge whose deletion increases the number of connected components in the graph, and the reverse corresponds to the non-bridge. However, the bridge in the paper does not correspond to neither low-degree edge nor high-degree edge, since there could be still a chance of the existence of common neighbors between node pairs of both high- and low-degree edges, or the between the receptive fields (multi-hop neighbors).
  2. Training Pipeline Distinction: More importantly, the primary distinction between our method and [1] is in the training procedure. In practice, PGEP method proposed in [1] prunes two types of edges in a fully trained network: (i) edge (i,j)(i,j) with different predicted labels y^iy^j\widehat{y}_i\neq \widehat{y}_j (specified as negative edge) and (ii) non-bridge edges. Hence, TEDDY is the first approach to achieve effective pruning by utilizing solely the structural information of graph data.

Furthermore, the aforementioned approach in [1] faces several practical issues. Firstly, to identify all non-bridge edges in the entire graph, one needs to execute graph algorithms such as DFS for each node, which can be highly time-consuming when it comes to large-scale datasets. Secondly, even if all non-bridge edges are identified, if the total number of non-bridge edges and negative edges is less than the target sparsity ratio, PGEP fails to achieve the desired pruning level. In consequence, this method may not be available in high sparsity regimes. On the other hand, given sparsity ratio for both graph and parameter, TEDDY can exactly achieve the target sparsity level in a more efficient way. Unfortunately, due to this cumbersome training procedure, we were not able to conduct the additional evaluation on [1] during the rebuttal period. Nevertheless, we will include the experiment in the final version.

Thus, while there are superficial similarities in the definition, conceptual alignment and underlying training procedure of TEDDY are markedly different from [1]. We believe these differences are significant and contribute to the distinctiveness and novelty of our work in the realm of GNN pruning.


[On Parameter Sparsification]

W2-2. The projected gradient descent (PGD) is a well-known optimization algorithm with rich properties that has been used in various machine learning literature for a long time. The characteristics of PGD depend on the appropriate setting of the constraint set. In this regard, the two mentioned works [2, 3] indeed utilize projected gradient descent, but each has a different constraint set from ours. In the case of [2], projection is performed onto the 1\ell_1-ball, while [3] utilizes a constraint set satisfying the RIP-like condition. Note that both [2, 3] introduce additional hyperparameters, and tuning these hyperparameters is inevitable to reach the target sparsity level. In contrast, our approach directly projects onto the 0\ell_0-ball without introducing additional hyperparameters, making it a novel attempt in sparsification.

评论

We sincerely appreciate the reviewer’s helpful and constructive feedbacks. We would like to address the reviewer’s concerns as below.

[On Complexity]

W1. In practice, TedgeET_{edge} \in |\mathcal{E}| is only computed for the existing edges, hence it requires only O(V+E)=O(N+M)O(|\mathcal V|+|\mathcal{E}|)=O(N+M) space complexity. The computation of TedgeT_{edge} involves (1) degree computation for individual nodes by row-wise sparse summation of the adjacency matrix, (2) Mean-aggregating the degree vector with (sparse) adjacency matrix to obtain g~\widetilde g in a sparse matrix multiplication manner, and (3) Element-wise multiplication between g~(v)\widetilde g(v) of nodes vv in rows (i.e., source nodes) and g~(u)\widetilde g(u) of nodes uu (i.e., destination nodes) in columns of the edge index. For better understanding, we attached the code snippet for computing TedgeT_{edge} in the supplementary material (compute_edge_score function in pruning.py).

审稿意见
8

This paper observes the importance of low-degree edges in the graph and proposes a one-shot edge sparsification framework that leverages edge-degree to find graph lottery tickets (GLT). They achieve superior performances on diverse benchmark datasets.

优点

The graph lottery tickets problem is an important direction, and the solution proposed in this paper is simple and elegant by utilizing the low-degree edges. The experimental results are also convincing.

缺点

Could the authors compare the consumed time of the proposed method with other baselines to further emphasize the efficiency?

问题

Could the authors compare the consumed time of the proposed method with other baselines to further emphasize the efficiency?

评论

We sincerely appreciate the reviewer’s helpful and constructive feedbacks. We would like to address the reviewer’s concern as below.

[On Wall-clock Time]

Q1. In response to the the reviewer’s valuable recommendation, we compared the wall-clock time analysis of TEDDY against baselines, averaged over five runs in five distinct simulations. We conducted experiments in the same GPU and GPU models for baselines and our method for each setting, to ensure fair comparisons.

Overall, the results clearly substantiates that TEDDY consistently achieves a significant reduction in pruning time relative to the baselines. This is especially pronounced in GIN on the Pubmed dataset, where TEDDY outperforms WD-GLT [1] by a remarkable margin, displaying a maximum time saving of 220.72 seconds during the 15-th simulation. Similar trend is observed in large-scale datasets, where TEDDY’s time consumption is nearly half that of UGS [2]. The maximum duration gap is revealed in the Reddit dataset, with TEDDY concluding the last simulation 131.84 seconds quicker than UGS in GCN. For your convenience, we attached tables below in the revised manuscript, throughout Table 3-5.


References

[1] Rethinking Graph Lottery Tickets: Graph Sparsity Matters, ICLR 2023.

[2] A Unified Lottery Ticket Hypothesis for Graph Neural Networks, ICML 2021.

评论
Datasets & GNNsMethods1-st5-th10-th15-th20-th
Cora
GCNUGS6.30 ± 0.555.84 ± 0.395.57 ± 0.805.65 ± 0.595.56 ± 0.73
WD-GLT54.11 ± 1.2853.61 ± 0.3953.59 ± 0.1753.52 ± 0.2353.77 ± 0.37
TEDDY2.12 ± 0.082.15 ± 0.162.09 ± 0.122.11 ± 0.132.18 ± 0.05
GATUGS7.58 ± 0.447.11 ± 0.516.97 ± 0.597.03 ± 0.607.26 ± 0.24
WD-GLT83.81 ± 10.0279.98 ± 8.0980.76 ± 6.3080.84 ± 6.2979.40 ± 4.81
TEDDY3.00 ± 0.072.96 ± 0.143.04 ± 0.082.93 ± 0.403.01 ± 0.05
GINUGS4.38 ± 0.814.11 ± 0.294.11 ± 0.284.06 ± 0.304.01 ± 0.19
WD-GLT68.55 ± 10.1372.43 ± 10.9170.76 ± 10.4572.04 ± 10.2372.67 ± 11.89
TEDDY1.86 ± 0.251.91 ± 0.311.85 ± 0.281.86 ± 0.321.86 ± 0.27
Citeseer
GCNUGS8.21 ± 1.727.44 ± 0.527.55 ± 0.537.36 ± 0.557.33 ± 0.42
WD-GLT83.73 ± 16.7578.24 ± 14.0977.67 ± 19.8570.75 ± 22.6269.51 ± 17.12
TEDDY2.65 ± 0.332.72 ± 0.292.70 ± 0.342.71 ± 0.292.80 ± 0.34
GATUGS8.78 ± 0.848.09 ± 0.528.47 ± 0.368.64 ± 0.778.41 ± 0.64
WD-GLT93.96 ± 20.9286.23 ± 19.9789.55 ± 22.0389.03 ± 22.8486.74 ± 21.02
TEDDY3.21 ± 0.573.35 ± 0.463.51 ± 0.503.55 ± 0.493.63 ± 0.51
GINUGS6.61 ± 1.116.44 ± 0.776.05 ± 0.385.98 ± 0.386.16 ± 0.42
WD-GLT108.74 ± 3.78109.59 ± 4.25106.69 ± 3.24109.91 ± 4.16107.93 ± 3.35
TEDDY2.51 ± 0.102.53 ± 0.112.46 ± 0.092.49 ± 0.052.52 ± 0.16
Pubmed
GCNUGS8.40 ± 0.547.80 ± 0.317.61 ± 0.557.95 ± 0.278.12 ± 0.30
WD-GLT176.48 ± 26.20161.88 ± 31.51160.96 ± 34.56141.58 ± 9.83142.68 ± 12.39
TEDDY3.23 ± 0.203.27 ± 0.133.16 ± 0.143.14 ± 0.213.08 ± 0.12
GATUGS9.85 ± 0.919.35 ± 0.599.16 ± 0.729.61 ± 0.679.41 ± 0.75
WD-GLT69.64 ± 8.7668.69 ± 8.3166.07 ± 5.4964.38 ± 2.7463.66 ± 1.11
TEDDY3.90 ± 0.263.82 ± 0.183.58 ± 0.203.63 ± 0.213.50 ± 0.30
GINUGS7.42 ± 1.017.17 ± 0.417.11 ± 0.517.03 ± 0.496.95 ± 0.41
WD-GLT218.87 ± 6.40214.61 ± 7.44213.63 ± 5.83223.74 ± 4.99216.23 ± 6.28
TEDDY3.20 ± 0.183.12 ± 0.133.06 ± 0.103.02 ± 0.072.97 ± 0.12
评论
Datasets & GNNsMethods1-st5-th10-th15-th20-th
Arxiv
GCNUGS82.28 ± 19.3277.93 ± 13.1475.26 ± 12.8973.18 ± 11.9471.50 ± 10.81
WD-GLTN/AN/AN/AN/AN/A
TEDDY43.45 ± 0.0239.70 ± 0.0436.10 ± 0.0533.35 ± 0.0331.16 ± 0.02
SAGEUGS74.22 ± 3.9873.37 ± 2.6176.18 ± 4.8675.18 ± 5.0770.94 ± 1.00
WD-GLTN/AN/AN/AN/AN/A
TEDDY45.10 ± 0.0241.96 ± 0.0238.85 ± 0.0236.43 ± 0.0134.58 ± 0.02
Reddit
GCNUGS180.80 ± 69.56176.25 ± 54.61158.97 ± 52.54154.12 ± 42.40155.35 ± 49.46
WD-GLTOOMOOMOOMOOMOOM
TEDDY47.68 ± 0.0240.29 ± 0.0332.92 ± 0.0227.56 ± 0.0123.51 ± 0.10
SAGEUGS174.47 ± 34.80173.08 ± 31.65167.41 ± 32.09166.55 ± 30.85161.88 ± 30.77
WD-GLTOOMOOMOOMOOMOOM
TEDDY65.53 ± 0.0355.77 ± 0.0346.25 ± 0.0339.12 ± 0.0533.45 ± 0.04
审稿意见
6

The paper proposes TEDDY, a novel edge sparsification framework that considers the structural information of the graph. TEDDY sparsifies graph edges based on scores designed by utilizing edge degrees, and sparsifies parameters via projected gradient descent on l0l_0 ball. In particular, sparsification of both edges and parameters can be done in one-shot and the edge sparsification part does not to consider node features. Then the paper demonstrates the effectiveness of TEDDY over iterative GLT methods with comprehensive experiments.

优点

  1. The paper is overall easy to follow.
  2. The experiments are diverse and thorough.
  3. It is an interesting observation that low-degree edges are important, which is supported both empirically and theoretically. The proposed method is thus principally designed.
  4. The proposed method is one-shot and efficient (but seems still need to train a dense network first; see weaknesses).
  5. The proposed method does more than just preserving the performance of vanilla dense GNNs---it actually improves the original performance in many settings. This is an impressive and interesting result.

缺点

  1. The main part of the proposed method and some discussion of the edge degrees are not clear and confusing.
  • It is not very clear to me how T in eq.(5) is actually used (just dropping the edges with the lowest scores)? Also it seems that T computes all scores for all node pairs. How to deal the case when there is no edge between two nodes?
  • Typo in eq.(5)? Should it not include specific node v instead?
  • The analysis of the effect of edge degree in Section 4 only applies for first-order degrees, yet the paper discusses why higher-order edge degrees need to be considered heuristically via a toy example and claims TEDDY is based on “multi-level consideration of degree information“ in section 5.1. Then in the experiments, it never touches upon higher-order edge degree information and only discusses about first-order edge degrees in Figure 8.
  1. The loss objective involves distillation (6), meaning that one still needs to train a dense network on the entire graph first, weakening the efficiency of the proposed method.

问题

  1. Have you evaluated the zero-shot performance of the proposed edge sparsification method (with dense weights)? Intuitively it seems that it might able to work in the zero-shot setting, and in this way, one can single out the effect of sparse subgraph on the boost of performance.
  2. What happens if the distillation term in (6) is removed (so one does not to try a dense network at first)? How significantly the performance would be affected? Could you provide an ablation study on this?
  3. Would TEDDY work if one trains on small graphs and then applies to large graphs?
  4. Any intuition why TEDDY is still able to improve the performance of vanilla GNNs under extreme sparsity?
评论

[On Smaller Graphs to Larger Graphs]

Q3. The reviewer's question raises a consideration about the scalability of TEDDY, which may be interpreted as its adaptability to an inductive semi-supervised node classification setting, where the fraction of the validation and test set is unobserved and larger than that of the training set. In this perspective, GNNs are trained on the partial set of the entire dataset while tested on the remaining unseen portions.

In response to the reviewer’s insightful question, we expanded our experiments to assess our model’s performance in inductive scenarios. We adjusted the Cora, Citeseer, and Pubmed datasets to a 20/40/40% split for training, validation, and testing, guaranteeing no exposure to the validation and test nodes as well as edges connected to them during training.

In the corresponding results, illustrated in Figure 15 and 16 of the revised manuscript, TEDDY achieves prominent pruning performance on inductive setting as well. This is especially pronounced on the Cora dataset with GCN and on the Citeseer dataset across all benchmark architectures, surpassing the accuracy of vanilla GNNs. Hence, the results affirm TEDDY’s capacity to generalize from smaller to larger graphs effectively.


[On Extreme Sparsity]

Q4. Following the reviewer’s insightful question, we believe that TEDDY strategically preserves low-degree nodes while pruning primarily focuses on high-degree nodes. This approach is rooted in the inherent design of GNNs, where neighborhood aggregation is a core component. Unlike vision domain, which often relies on complex, fully connected layers, GNNs typically employ lighter pre-aggregation embeddings. This characteristic is particularly crucial in such neural networks, where low-degree nodes have fewer neighbors to gather information from. Excessive pruning that drastically reduces the degree of these nodes, or in extreme cases, isolates them (equivalent to the forward pass of MLP), can severely hamper the model's ability to classify, since the amount of information they can obtain through the message-passing scheme becomes limited. By protecting these low-degree nodes, TEDDY ensures that even under extreme sparsity, there is sufficient local structure to facilitate effective classification, hence improving the performance of vanilla GNNs.


References

[1] Fast graph representation learning with PyTorch Geometric, ICLR 2019 (RLGM Workshop).

[2] A Unified Lottery Ticket Hypothesis for Graph Neural Networks, ICML 2021.

[3] Rethinking Graph Lottery Tickets: Graph Sparsity Matters, ICLR 2023.

评论

We sincerely appreciate the reviewer’s helpful and constructive feedbacks. We would like to address the reviewer’s concerns as below.

[Computation of Edge Score]

W1-1. TedgeRET_{edge}\in\mathbb{R}^{|\mathcal{E}|} is only computed for node pairs that exist as edges, not for all possible node pairs. In practice, this is achieved by element-wise multiplication between g~(v)\widetilde g(v) of nodes vv in rows (i.e., source nodes) and g~(u)\widetilde g(u) of nodes uu (i.e., destination nodes) in columns of the edge index provided in Pytorch Geometric [1]. This ensures the obtainment of scores for only real edges. To clarify this, we have attached the code snippet of the edge score calculation in supplementary material (compute_edge_score function in pruning.py).


[Typo in Eq. (5)]

W1-2. Thank you for the reviewer’s correction. We modified the typo in Eq. (5) in the revised paper.


[On Degree Analysis]

W1-3. We apologize to the reviewer for confusion. We provide the clarification as follows. We empirically validated the necessity to consider higher-order edge degrees in Figure 17 in the appendix. As evidenced by the figure, the multi-level consideration of edge degree clearly improves the performance (specified as Ours), compared to the outcomes of pruning based solely on first-order degree information (specified as 1-hop Degree). While Figure 8 focuses on first-order degrees, the motivation behind Figure 8 is on the particular emphasis of TEDDY performing degree-aware pruning, reflecting the prior observation highlighted in Section 4. Conversely, baselines lacks the edge degree consideration, correlating with their less optimal performance as compared to our method.


[On Distillation]

W2. Our findings, as illustrated in Figure 17 and 18, do indicate performance improvement due to the model distillation. Note that, however, our analysis puts more emphasis on the significance of 2-hop degree based edge score, TedgeT_{edge}, in Figure 18 in revised paper. As demonstrated in the figure, the design of edge scores is more critical, since any configuration with the baseline including distillation fails to achieve better pruning performance than our method with a sole TedgeT_{edge} (represented as Ours w/o Ldt\mathcal L_{dt}).


[On Zero-shot Performance]

Q1. In the development of TEDDY, we specifically focused on a one-shot edge sparsification approach, which inherently differs from zero-shot pruning strategies. Consequently, evaluating TEDDY in a zero-shot setting falls outside the scope of our current research. This distinction aligns with the methodology used in previous studies on GNN pruning [2, 3], which typically employ pretrained models as a foundation for iterative pruning. While zero-shot sparsification offers an interesting perspective, our method gains significant efficacy on one-shot mechanism, by eliminating the need for repeated retraining each pruning level adopted in [2, 3] (Please refer to the wall-clock time analysis of TEDDY against baselines in Appendix A.8 in the revised paper).


[On Distillation Component Removal]

Q2. In response, we present the ablation study in Figure 17. As illustrated in the figure, removing distillation term in TEDDY (specified as Ours w/o Ldt\mathcal L_{dt}) consistently achieves superior performance compared to baselines, even with those integrated with distillation component.

Further ablation on all benchmark datasets is detailed in Figure 18 in the revised paper. In the figure, the performance UGS [2] attached with distillation loss remains suboptimal compared to TEDDY without Ldt\mathcal L_{dt}, regardless of benchmark datasets. This underscores the significant influence of our degree-based pruning component. Additional studies to further evaluate this effect on other architectures will be included in the later revision.

审稿意见
5

This paper introduces several techniques to sparsify graph data and networks in order to build a sparse graph learning model. The important aspects are as follows:

  1. Degree-based graph sparsification to remove low-degree edges and make graphs sparse;
  2. Distillation by matching the classification logits of the model pre-trained on the whole graphs;
  3. Parameter sparsification by using a subset of parameters (i.e., Projected Gradient Descent).

When applied to classical GNNs such as GCN/GAT/GIN, the experimental results validate the proposed method by showing some performance improvements.

优点

The proposed methods are generally straightforward and concise. They empirically improve performance compared to the baselines. The whole process is clearly stated in Algorithm 1.

缺点

  1. Baseline GNN models are somewhat outdated. Traditional GNNs such as GCN/GAT/GIN are known to suffer from limited expressivity, resulting in restricted empirical performances in many cases. Several studies have done expressiveness analysis, to name a few references [3,4]. Later works, such as graph transformers and expressive GNNs [1,2,3], have proven to be more powerful and meaningful for empirical studies.

  2. Detailed ablation experiments are lacking, which are necessary to clearly validate the effectiveness of each component.

  3. The theoretical analysis concerning the motivation behind preserving low-degree edges seems somewhat disconnected from the main idea.

  4. Some claims regarding training efficiency and generalization performance appear to be misleading.

These points are detailed in the questions section below.

[1] Recipe for a general, powerful, scalable graph transformer. NeurIPS 2022.
[2] Specformer: Spectral Graph Neural Networks Meet Transformers. ICLR 2023.
[3] Provably Powerful Graph Networks. NeurIPS 2019.
[4] Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks. AAAI 2019.

问题

  1. Conventional GNNs are empirically seen to suffer from limited expressivity. How do the latest transformer-based GNNs or more expressive models like GPS[1], Specformer[2], and PPGN[3] compare when applying the proposed techniques? These newer models have already surpassed the empirical performance of older baselines like GCN/GAT/GIN. Validating the proposed techniques on these latest methods rather than older models would make the arguments more convincing.

  2. About Figure 1: What is the precise step for removing high-degree or low-degree edges? Given a graph sparsity value, how exactly is the subgraph generated? Is random sampling involved in this process? If so, could you provide the results of multiple runs (e.g., mean and standard deviation) to separate the training-related noise?

  3. Ablation Studies: Could you provide detailed ablation studies on the effectiveness of the proposed three components? a) graph sparsification by removing edges; b) model distillation; c) parameter sparsification with projected gradient descent. It seems that these components could be applied separately to any of the baseline training methods (UGS/WD-GLT) in empirical studies. In Figure 13, there are some comparisons made. Graph sparsification doesn’t seem to improve empirical performance at all. There are no curves representing UGS/WD-GLT without distillation loss. Although considering multi-hop subgraphs appears to be effective in a way, this concept is not new in graph learning literature (e.g., k-WL GNNs [4]).

  4. The theoretical analysis involving the upper bound of the symmetrically normalized adjacency matrix is somewhat unclear. a) Concerning the mentioned analysis, how realistic is the assumption of the Lipschitz constant being present? b) If the aforementioned assumptions hold true for the specified models, how plausible is it that removing low-degree edges actually increases the term fractextdeg_max+1textdeg_min+1\\frac{\\text{deg}\_\text{max} + 1}{\\text{deg}\_\text{min} + 1}? This aspect may be specific to datasets and could possibly be numerically simulated. c) Conversely, removing high-degree edges appears to reduce the generalization gap by lowering textdeg_max\\text{deg}\_\text{max}, according to the preceding analysis. What is the reason this aspect is not emphasized or motivated more within the discussion?

  5. In Eq. (5), the edge score is always symmetric. How does this approach apply to directed graphs?

  6. How crucial is the distillation training? It appears to undermine the purpose of utilizing sparse graphs, as there is a necessity to initially train the model on the full graphs. This approach also seems to contradict the claim of "a single training," a statement that is reiterated numerous times throughout the paper.

  7. The paper asserts that "TEDDY significantly surpasses conventional iterative approaches in generalization." Could you provide more details to substantiate this claim? Which sections or aspects of the experiments corroborate this assertion regarding generalization?

  8. The paper claims that the employed PGD training saves computation time compared to the iterative approach. Could you provide additional results to validate this claim, such as comparing training hours using the same GPU hardware?

[1] Recipe for a general, powerful, scalable graph transformer. NeurIPS 2022.
[2] Specformer: Spectral Graph Neural Networks Meet Transformers. ICLR 2023.
[3] Provably Powerful Graph Networks. NeurIPS 2019.
[4] Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks. AAAI 2019.

评论

[On Theoretical Claim]

Q4. Note that the generalization error bound for GNN [10] is not our contribution, but we can offer theoretical evidences for our TEDDY based on the results in [10]. In fact, the most challenging component in computing the Lipschitz constant for practical GNN families is the non-smooth ReLU function. Theoretically, [10] considered a slightly relaxed ReLU function that makes ReLU continuously differentiable. In such cases, since every part of the function becomes smooth, it is possible to compute the Lipschitz constant. It is anticipated that the Lipschitz constant of the original GNN can also be approximated to some extent through this relaxed activation function. Regarding the quantity degmax+1degmin+1\frac{deg_{max}+1}{deg_{min}+1}, the action of removing low-degree edges (not nodes) tends to increase the generalization bound by consistently making degmindeg_{min} monotonically decreasing, thereby inducing worse generalization. Similarly, removing high-degree edges also consistently makes degmaxdeg_{max} monotonically decreasing. In contrast to the previous case, this tends to reduce the generalization gap, encouraging better generalization.

评论

We sincerely appreciate the reviewer’s helpful and constructive feedbacks. We would like to address the reviewer’s concerns as below.

[On Experiments for Recent GNNs]

Q1. In response to the reviewer’s insightful suggestion, we expanded our experiments to encompass recent Transformer-based GNN architectures. While integrating TEDDY with these advanced models, we encountered several issues with some architectures suggested by the reviewer.

To begin with, PGNN [1] utilizes a dense adjacency matrix in the input tensor, resulting in an N×NN\times N matrix for a graph with NN nodes. In general, adjacency matrix in GNNs are encoded as a sparse format, where the computation can be highly dependent on the number of edges. This is where the graph sparsification can offer a benefit. However, in the case of dense matrix representations as PGNN, both zero and non-zero elements are treated equally in terms of storage and computational overhead. Consequently, the inherent characteristics of PGNN's dense matrix approach do not align well with the advantages offered by graph sparsification.

Regarding GraphGPS [2], its primary efficacy is demonstrated in graph-level tasks, diverging from our focus on node classification. Similarly, Specformer [3], while benchmarked against GraphGPS in graph-level tasks, was evaluated against other baselines in node-level tasks. In our experiments, we observed that GraphGPS did not yield satisfactory performance on Cora and Citeseer compared to classic GNNs.

Consequently, we explore alternative graph transformers, specifically UniMP [4], NAGphormer [5], and Specformer [3], which have demonstrated strong performance in node classification without significant degradation on our benchmark datasets. However, both Specformer and NAGphormer replace the input adjacency matrix with eigenvectors and eigenvalues from eigendecomposition. The baselines’ differentiable mask for the adjacency matrix requires backpropagation through eigenvectors and eigenvalues, introducing a practical challenge with high complexity of O(N3)O(N^3) for the number of node NN per iteration. Given these constraints, we conducted the evaluation of these two transformer architectures only on TEDDY, whereas including experiments on TEDDY and all baselines for UniMP [4], which has also been recognized as a graph transformer benchmark in [6].

As depicted in Figure 13 and 14, TEDDY accomplishes stable pruning performance, surpassing all baselines when equipped with UniMP as a backbone. In particular, UGS and WD-GLT show significant degradation as the sparsity increases, whereas the performance of TEDDY is stable across all benchmark datasets. Note that differentiable mask approach from baselines may prune edges more than pre-defined ratio, due to the possibility of multiple mask elements having the same value. Our method also yields decent performance in NAGphormer and Specformer, notably in NAGphormer on the Pubmed dataset, with a marginal accuracy loss of 0.56. Overall, these results demonstrate TEDDY's versatility across diverse foundational architectures.

评论

Thank you for your feedback. Regarding Figures 13 and 14, there are just minor variations in the y-axis of the performance curves. In similar studies, such as UniMP [4], NAGphormer [5], and Specformer [3], the standard deviation is typically reported, providing a clearer understanding of performance variability. The slight differences observed at different graph sparsity in Figures 13 and 14 could potentially be explained by training noise. Could you offer additional insights on this matter?

评论

[On Efficiency of PGD]

Q8. The existing iterative approaches to parameter sparsification involve the full training of an 1\ell_1-regularized mask to identify crucial parameter coordinates. After training, pθp_\theta% of coordinates with the smallest signals in the mask are zeroed out, and the obtained mask is used to train a subnetwork to obtain the graph lottery ticket. Therefore, iterative methods inevitably require two rounds of training to obtain the mask, and achieving the desired target sparsity ratio necessitates multiple times of training process. In contrast, TEDDY can obtain sparse parameters in a single training pass using Projected Gradient Descent on the 0\ell_0-ball for any parameter initialization. Furthermore, since the parameters are already sparse during the training process, TEDDY can also reduce the training time. To validate this, we conduct wall-clock time comparisons for parameter sparsification across each method. Toward this, we use the original dense adjacency matrix while pruning only model parameters for fair comparisons. For your convenience, we attached the below tables in Table 6-8 in the revised paper.


References

[1] Provably Powerful Graph Networks, NeurIPS 2019.

[2] Recipe for a general, powerful, scalable graph transformer, NeurIPS 2022.

[3] Specformer: Spectral Graph Neural Networks Meet Transformers, ICLR 2023.

[4] Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification, IJCAI 2021.

[5] NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs, ICLR 2023.

[6] Towards Deep Attention in Graph Neural Networks: Problems and Remedies, ICML 2023.

[7] A Unified Lottery Ticket Hypothesis for Graph Neural Networks, ICML 2021.

[8] Rethinking Graph Lottery Tickets: Graph Sparsity Matters, ICLR 2023.

[9] Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, AAAI 2019.

[10] Towards Understanding Generalization of Graph Neural Networks, ICML 2023.

评论
Datasets & GNNsMethods1-st5-th10-th15-th20-th
Cora
GCNUGS6.09 ± 0.815.42 ± 0.225.71 ± 0.475.47 ± 0.295.51 ± 0.22
WD-GLT62.78 ± 3.0660.89 ± 0.9661.38 ± 0.9361.82 ± 1.1561.92 ± 0.87
TEDDY2.39 ± 0.062.43 ± 0.042.44 ± 0.102.41 ± 0.042.41 ± 0.04
GATUGS6.00 ± 1.045.58 ± 0.345.60 ± 0.295.50 ± 0.185.48 ± 0.19
WD-GLT61.95 ± 3.5261.45 ± 2.1361.40 ± 1.9059.99 ± 1.4860.27 ± 2.29
TEDDY2.43 ± 0.092.47 ± 0.092.50 ± 0.152.45 ± 0.082.44 ± 0.08
GINUGS4.33 ± 0.744.15 ± 0.464.05 ± 0.183.99 ± 0.293.87 ± 0.20
WD-GLT61.98 ± 3.1560.63 ± 0.3960.38 ± 0.4660.46 ± 0.3060.42 ± 0.45
TEDDY1.76 ± 0.031.80 ± 0.051.82 ± 0.041.81 ± 0.021.81 ± 0.01
Citeseer
GCNUGS8.84 ± 2.277.87 ± 1.167.56 ± 0.817.56 ± 0.627.43 ± 0.65
WD-GLT56.97 ± 2.4556.05 ± 0.8655.96 ± 1.3355.38 ± 0.6055.29 ± 0.53
TEDDY3.27 ± 0.043.37 ± 0.063.46 ± 0.033.50 ± 0.013.60 ± 0.05
GATUGS8.42 ± 1.177.95 ± 0.587.91 ± 0.707.89 ± 0.497.77 ± 0.58
WD-GLT57.67 ± 2.2956.90 ± 0.5756.47 ± 0.8057.30 ± 1.1756.51 ± 0.57
TEDDY3.35 ± 0.053.39 ± 0.043.50 ± 0.033.55 ± 0.013.66 ± 0.05
GINUGS8.28 ± 1.008.37 ± 1.018.18 ± 0.848.16 ± 0.957.81 ± 0.64
WD-GLT57.50 ± 2.0256.11 ± 0.2656.19 ± 0.5456.88 ± 1.4056.38 ± 0.71
TEDDY3.42 ± 0.053.46 ± 0.043.57 ± 0.043.62 ± 0.023.71 ± 0.05
Pubmed
GCNUGS9.21 ± 0.968.48 ± 0.468.66 ± 0.628.28 ± 0.208.38 ± 0.42
WD-GLT202.28 ± 1.98201.13 ± 1.54200.82 ± 2.08201.07 ± 1.53201.16 ± 1.22
TEDDY3.91 ± 0.083.88 ± 0.073.90 ± 0.113.89 ± 0.103.91 ± 0.12
GATUGS13.69 ± 0.9713.18 ± 0.2313.05 ± 0.2413.07 ± 0.2113.12 ± 0.14
WD-GLT66.03 ± 0.7565.68 ± 0.0665.65 ± 0.1265.66 ± 0.1465.62 ± 0.09
TEDDY6.33 ± 0.056.32 ± 0.036.33 ± 0.046.32 ± 0.046.33 ± 0.02
GINUGS8.68 ± 0.788.38 ± 0.158.37 ± 0.198.35 ± 0.118.26 ± 0.07
WD-GLT157.18 ± 0.48157.03 ± 1.12156.43 ± 0.93157.75 ± 0.59157.88 ± 0.51
TEDDY4.07 ± 0.044.04 ± 0.034.06 ± 0.014.04 ± 0.044.04 ± 0.06
评论
Datasets & GNNsMethods1-st5-th10-th15-th20-th
Arxiv
GCNUGS130.66 ± 15.32125.78 ± 13.97128.90 ± 14.12113.48 ± 11.06119.41 ± 11.84
WD-GLTN/AN/AN/AN/AN/A
TEDDY62.68 ± 1.2957.10 ± 1.3351.75 ± 0.9747.55 ± 1.2145.07 ± 1.20
SAGEUGS111.71 ± 7.45104.29 ± 9.21106.31 ± 8.1394.02 ± 7.7890.71 ± 6.59
WD-GLTN/AN/AN/AN/AN/A
TEDDY64.03 ± 0.1664.03 ± 0.0964.44 ± 0.1764.13 ± 0.0764.34 ± 0.13
Reddit
GCNUGS210.36 ± 51.92200.75 ± 43.13197.71 ± 47.39173.81 ± 39.13170.48 ± 42.65
WD-GLTOOMOOMOOMOOMOOM
TEDDY70.34 ± 0.0565.69 ± 0.1158.43 ± 0.0351.72 ± 0.1547.03 ± 0.13
SAGEUGS203.75 ± 29.72194.58 ± 25.91188.02 ± 31.74181.86 ± 35.12182.97 ± 26.94
WD-GLTOOMOOMOOMOOMOOM
TEDDY80.63 ± 0.1178.21 ± 0.0667.93 ± 0.1363.11 ± 0.0451.03 ± 0.16
评论

[On Superiority over Iterative Methods]

Q7. Our experimental results demonstrate superior performance compared to baselines across various graph sizes, underscoring our method's robust generalization ability. This is particularly evident in Section 6.1, where TEDDY, when utilizing GAT as the base architecture, shows a performance enhancement ranging from 12.8% to 20.4% over the optimal performances of baseline models. Additionally, TEDDY consistently exhibits stable performance across different datasets, as highlighted in our inductive semi-supervised node classification experiments. For instance, on the Pubmed dataset, TEDDY maintains decent accuracy despite the significant scarcity in the ratio of training nodes (20% for training vs. 80% for inference). These results confirm TEDDY's outstanding generalization abilities, marking it as a superior choice across diverse datasets compared to conventional iterative baselines.

评论

[On distillation]

Q6. We thank the reviewer for raising an important point. Our findings, as illustrated in Figure 17 and 18, do indicates an enhancement in performance due to the model distillation. However, our analysis emphasizes the significance of 2-hop degree based edge score, TedgeT_{edge}, in Figure 18 in revised paper. As demonstrated in the figure, the design of edge scores is more critical, since any configuration with the baseline including distillation fails to achieve better pruning performance than our method with a sole TedgeT_{edge} (represented as Ours w/o Ldt\mathcal L_{dt}).

Moreover, the use of pretrained models is a common initial step in iterative pruning approaches as well, since they fully exploit the dense graph and GNNs at the initial pruning stage. Our method is tailored for practical scenarios in mind: for example, facilitating the deployment of GNNs on local devices with performance equivalent to fully trained models on central servers. Hence, the term “single training” in our paper refers to the elimination of the iterative cycle of retraining GNNs for each pruning level. Instead, our TEDDY accomplishes the sparsification in a single training phase when given the sparsity levels and trained GNNs, promoting a significant efficiency, as displayed in Table 3-5 in the revised version.

评论

[On Eq. (5)]

Q5. As we explicitly stated in Section 3.1, TEDDY is specifically tailored for undirected graphs, and thus does not directly address directed graphs. Although TedgeT_{edge} is symmetric in directed graphs as the reviewer pointed out, directed graphs were out of scope for the design of TedgeT_{edge}.

评论

[On Ablation Study]

Q3-1-a. (On graph sparsification) We sincerely appreciate the reviewer’s constructive feedback. However, we would like to clarify that the purpose of graph sparsification is to preserve the original GNN’s accuracy, rather than improving it as a form of regularization. Hence, we respectfully disagree in that our graph pruning method doesn’t seem to improve empirical performance at all. And in fact, according to a sole graph sparsification performance comparison between our pruning approach and baselines, illustrated in Figure 19, our approach consistently achieves superior performance to baselines. Thus, this clearly substantiates that our graph sparsification indeed well-preserves, and in most cases even surpasses (7 out of 9 settings in Figure 19) the original GNNs’ performance.

Furthermore, following the reviewer’s thoughtful opinion that components TEDDY can be applied separately, we provide comprehensive ablation studies for diverse combinations of our components. Utilizing GAT as a backbone architecture, we evaluated 7 distinct scenarios detailed in Figure 18:

(Scenario 1) TEDDY with all components (specified as Ours),

(Scenario 2) TEDDY without distillation loss (specified as Ours w/o Ldt\mathcal L_{dt}),

(Scenario 3) TEDDY with magnitude-based weight mask pruning (specified as Ours w/ mθm_\theta),

(Scenario 4) TEDDY without weight sparsification, i.e., our graph sparsification method with dense weights (specified as Ours w/o weight spar.)

(Scenario 5) UGS [7] with all components (specified as UGS),

(Scenario 6) UGS with distillation loss (specified as UGS w/ Ldt\mathcal L_{dt}),

(Scenario 7) UGS with 0\ell_0-based projected gradient descent (specified as UGS w/ 0\ell_0 PGD).


Q3-1-b. (On distillation loss) The results, as depicted, do indicate performance improvement due to the model distillation. However, the results also clearly demonstrate that even when UGS is augmented with or without Ldt\mathcal L_{dt} (scenario 6 and 5), it fails to match the performance of our framework without Ldt\mathcal L_{dt} (scenario 2). Similar results regarding WD-GLT [8] are illustrated in Figure 17. This demonstrates that our edge sparsification method rooted in low edge degree possesses more significant impact than the distillation component in GNN pruning.


Q3-1-c. (On weight sparsification) Furthermore, our method including 0\ell_0 PGD (scenario 1) shows comparable performance to that of TEDDY with dense weights (scenario 4), across varying pruning simulations, underscoring the robustness of our weight sparsification method. Additional studies to further evaluate this effect on other architectures will be included in the later revision.


[On Novelty of Multi-hop Consideration]

Q3-2. We acknowledge that multi-hop subgraph analysis is a well-established concept in graph learning. Nevertheless, the application of leveraging a sole degree information in multi-level consideration in node classification task has been under-explored. The referenced work [9] provided mainly addresses the performance enhancement on graph-level tasks by incorporating a broader range of neighborhood features and structural complexities, rather than concentrating on the pruning aspect of GNNs to preserve the original performance. Our approach, which specifically considers multi-hop edge degree for GNN pruning, introduces a novel angle within this domain.

评论

Thank you for your response. Could you provide insights on the isolated impact of the proposed edge sparsification technique? How does the performance of the proposed edge pruning, as outlined in Equation (5), compare to basic high-degree pruning (as shown in Figure 1, Figure 10, and Table 2) when both the distillation loss and weight sparsification are disabled?

Based on the observations from the top of Figure 18, it appears that weight sparsity has a minimal impact on performance. Additionally, the effectiveness of the distillation loss has been shown to be quite significant. Considering these factors, it would be interesting to explore the specific performance benefits attributable to the proposed edge sparsification technique. This would help in understanding the extent to which the edge pruning method, as a standalone component, contributes to the overall performance gains.

评论

[Isolated Impact of TEDDY’s Edge Sparsification]

Following the reviewer’s insightful suggestions, we present a comparison of a sole graph sparsification performance between multi-level considered pruning approach of TEDDY, basic high-degree pruning, and baselines. We applied uniform three random seeds to all settings, to guarantee the fair comparisons.

As depicted in the below table, the multi-level considered pruning approach of our method demonstrates superiority over all baselines, including a naive high-degree pruning, particularly on the Pubmed dataset. Thus, the results clearly demonstrate the efficacy of our degree-reflected sparsification technique, as well as the significance of the multi-level consideration. We will provide additional experimental results on GCN and GAT in the later version.

Again, we are deeply grateful to the reviewer for thoughtful suggestions. Your feedback has been invaluable in enhancing the quality of our work. We hope our overall responses have addressed your concerns.


Table1. A sole graph sparsification performance (in percentage) of proposed method, high degree-based pruning, and baselines.

Simulations1-th5-th10-th15-th20-th
Cora
UGS71.83 ± 2.7569.83 ± 2.9668.27 ± 2.1868.53 ± 2.1567.60 ± 2.49
WD-GLT76.83 ± 0.3175.47 ± 0.0573.80 ± 0.9371.43 ± 1.4770.97 ± 1.23
Ours73.24 ± 0.3677.60 ± 0.2476.52 ± 0.2776.48 ± 0.2875.78 ± 0.19
High degree77.52 ± 0.5977.54 ± 1.8675.10 ± 1.4673.76 ± 0.8170.90 ± 1.01
Citeseer
UGS64.80 ± 2.0265.57 ± 3.3565.37 ± 2.0864.47 ± 1.8663.50 ± 1.84
WD-GLT63.93 ± 2.9064.63 ± 2.3862.93 ± 2.9063.60 ± 1.9262.57 ± 1.50
Ours69.64 ± 0.4868.96 ± 0.1568.78 ± 0.2068.75 ± 0.1968.58 ± 0.17
High degree67.92 ± 1.1468.98 ± 0.6268.80 ± 0.5867.08 ± 0.1765.52 ± 1.03
Pubmed
UGS72.57 ± 1.9374.70 ± 2.9474.37 ± 2.2573.27 ± 2.0172.43 ± 2.03
WD-GLT76.43 ± 1.2276.33 ± 2.1977.50 ± 0.9975.87 ± 0.6075.47 ± 0.66
Ours78.66 ± 0.1778.44 ± 0.3978.18 ± 0.3577.54 ± 0.3477.14 ± 0.33
High degree78.12 ± 0.4377.36 ± 0.3976.78 ± 1.2574.82 ± 0.5873.00 ± 0.51
评论

Thank you for addressing most of my technical concerns in your responses. I'd like to update my rating to "5: marginally below the acceptance threshold".

However, I am still concerned about the presentation of the distillation effect in the paper. The experimental results clearly indicate that distillation is a crucial and significant component. Yet, this aspect is not mentioned until P6, Section 5.2. The earlier sections, up to Section 5.2, primarily focus on edge sparsification, without even a mention of 'distillation'. This approach raises serious doubts for me regarding the structure and narrative style of the paper.

As reviewer mYP6 also points out,

one still needs to train a dense network on the entire graph first

This is a somewhat significant shortcoming in terms of efficiency. However, it is somewhat hidden from the reader and is not made visible until Page 6 and Section 5.2. To the reviewer, for a paper that emphasizes empirical performance, it's more critical to clearly identify the most effective components in the main text, rather than adopting a storytelling approach. Being straightforward about the key elements contributing to performance should be prioritized over narrative-oriented writing.

评论

We are deeply grateful to reviewer k6Ge for insightful suggestions and for raising the score of our paper.

Following the reviewer's valuable feedback, we will revise our paper to refine our manuscript to improve clarity and precision, particularly concerning the distillation technique and the utilization of a trained model, in subsequent versions.

Once again, we sincerely thank the reviewer for the reassessment of our work.

评论

[On Figure 1]

Q2. In the analysis of Figure 1, we define the edge score of each edge (i,j)(i,j) as Tedge(i,j)=(N(i)+N(j))/2T_{edge}(i, j) =(|\mathcal N(i)|+|\mathcal N(j)|)/{2}, where N(k)\mathcal N(k) denotes the set of neighboring nodes connected to node kk. For our experiments, we conduct 20 simulations where GNNs are trained after pruning edges based on their scores. Specifically, given a graph sparsity ratio pgp_g for each simulation, we pruned pg%p_g\% of edges having the highest edge score TedgeT_{edge} in the high degree pruning scenario (represented as blue lines in the figure), and pruned edges with the lowest scores in the low-degree scenario (represented as orange lines in the figure). Hence, our TEDDY does not involve any stochastic procedures such as random sampling.

However, to address the reviewer’s suggestion, we conduct additional simulations with 5 different random seeds (0 to 4) and reported the average performance in Table 2. In the table, deg_low refers to pruning based on the lowest degree, while deg_high corresponds to the highest degree-based pruning. The results clearly demonstrate that pruning low-degree edges, i.e., low TedgeT_{edge}, can significantly deteriorate the performance, especially in GAT on the Pubmed dataset, where the performance between deg_low and deg_high reaches the maximum gap of 24.1% in the final simulation. For convenience, we attach the below tables in Table 2 in the revised manuscript.

Simulations1-st1-st5-th5-th10-th10-th15-th15-th20-th20-th
GNNsDatasetdeg_lowdeg_highdeg_lowdeg_highdeg_lowdeg_highdeg_lowdeg_highdeg_lowdeg_high
GCNCora80.38 ± 0.1981.28 ± 0.6477.66 ± 0.4177.92 ± 0.3475.36 ± 0.4277.68 ± 0.3274.24 ± 0.1676.02 ± 0.4369.54 ± 0.3272.62 ± 0.32
Citeseer70.44 ± 0.2170.88 ± 0.2867.0 ± 0.2770.60 ± 0.1864.78 ± 0.3167.58 ± 0.7263.46 ± 0.4267.72 ± 0.3763.22 ± 0.5165.70 ± 0.56
Pubmed78.32 ± 0.1979.18 ± 0.1276.66 ± 0.7977.66 ± 0.2777.52 ± 0.2877.32 ± 0.2077.72 ± 0.1776.74 ± 0.1075.20 ± 0.2076.54 ± 0.08
GATCora77.64 ± 1.3379.72 ± 1.0668.34 ± 1.2678.50 ± 1.1656.72 ± 0.4872.44 ± 1.1050.12 ± 0.9367.74 ± 0.9244.78 ± 0.8051.64 ± 1.67
Citeseer69.48 ± 1.3670.96 ± 0.7557.02 ± 0.6466.86 ± 0.6342.68 ± 2.1459.74 ± 1.6435.40 ± 1.5049.86 ± 1.8228.40 ± 0.9743.24 ± 1.43
Pubmed78.40 ± 0.6178.14 ± 1.0171.66 ± 0.5976.32 ± 0.5558.48 ± 0.8574.52 ± 0.5649.52 ± 0.6472.12 ± 0.6444.10 ± 0.5768.20 ± 0.61
GINCora74.86 ± 1.1476.48 ± 1.6670.70 ± 1.1376.84 ± 1.8166.70 ± 0.3374.70 ± 2.3262.72 ± 1.3872.96 ± 0.5558.82 ± 1.9970.28 ± 0.99
Citeseer67.52 ± 0.9968.78 ± 0.9863.80 ± 1.4869.16 ± 0.9662.32 ± 1.1468.46 ± 0.4860.60 ± 1.2766.92 ± 0.2759.48 ± 2.2565.64 ± 1.09
Pubmed75.64 ± 0.7277.96 ± 0.3675.12 ± 0.6677.38 ± 0.2773.68 ± 1.2677.44 ± 0.1070.74 ± 1.3974.88 ± 0.6270.88 ± 2.7273.30 ± 0.11
评论

[Additional Insights on Training Noise]

In response to the reviewer’s constructive feedback, we reported average pruning performance of our TEDDY and baselines across all sparsities, represented in Figure 13 and 14 in the revised manuscript. To ensure fair comparisons, we adopted uniform random seeds to all methods for each setting. Owing to the constraints of the rebuttal period, we present performance results of NAGphormer [1] and Specformer [2] on Cora and Citeseer datasets, averaged over 3 runs (Note that the eigendecomposition of Pubmed dataset required prohibitive time consumptions). Meanwhile, we adopted uniform five random seeds for UniMP [3].

As depicted in the figures, our TEDDY still accomplishes prominent performance, surpassing all baselines when equipped with UniMP as a backbone. In particular, UGS [4] and WD-GLT [5] exhibit severe unstability as the sparsity increases, whereas the performance of TEDDY remains stable across all simulations with notably small standard deviations. Our method also achieves decent performance in NAGphormer and Specformer, notably in NAGphormer on the Cora dataset, with considerable performance enhancement over the original result. Hence, the results clearly demonstrate our method’s versatility and robustness across diverse foundational architectures.


References

[1] NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs, ICLR 2023.

[2] Specformer: Spectral Graph Neural Networks Meet Transformers, ICLR 2023.

[3] Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification, IJCAI 2021.

[4] A Unified Lottery Ticket Hypothesis for Graph Neural Networks, ICML 2021.

[5] Rethinking Graph Lottery Tickets: Graph Sparsity Matters, ICLR 2023.

审稿意见
6

This paper proposes a one shot graph pruning algorithm to find Graph Lotttery Tickets (GLTs) by (i) deleting graph edges from nodes with higher degrees and (ii) sparsifying node parameters using l0l_0 regularization with Projected Gradient Descent. The sparse networks obtained by the authors are shown to improve performance over existing Graph Lottery Ticket methods.

优点

  1. The proposed method finds GLTs using a one shot training approach outperforming existing iterative graph pruning algorithms.

缺点

  1. While the proposed idea of sparsifying the graph edges based on the degree information is simple and makes intuitive sense (as also shown empirically), why do the authors consider only the degree information as opposed to other metrics like spectral information of the graph or centrality which might convey more information about the graph. Is the degree information sufficient in this regard to obtain a sufficiently sparse graph in comparison with these other metrics? Recent work has shown randomly dropping graph edges can also help training, in comparison to the proposed idea in Yu et al. [1], how does randomly pruning graph edges compare?

  2. The parameters sparsification method used is PGD with l0l_0 regularization. But there is no justification provided for using this particular method. There are other continuous sparsification schemes that have shown to achieve highly sparse networks in a single training run for feedforward networks like Kusupati et al. [2] and Louizos et al. [3].

  3. The authors use GATs and GCNs in their experiments. However, the structure of GATs allows them to inherit some degree information in the form of attention while GCNs do not. Does this difference change the performance of the proposed graph pruning criterion for either architectures?

  4. The results in Table 1 might be better visualized via a heatmap showing how much Graph sparsity and Weight sparsity can be achieved and if there is a tradeoff between the two.

[1] Rong, Yu, et al. "Dropedge: Towards deep graph convolutional networks on node classification." arXiv preprint arXiv:1907.10903 (2019).

[2] Kusupati, Aditya, et al. "Soft threshold weight reparameterization for learnable sparsity." International Conference on Machine Learning 2020.

[3] Louizos, Christos, Max Welling, and Diederik P. Kingma. "Learning Sparse Neural Networks through L_0 Regularization." International Conference on Learning Representations. 2018.

问题

  1. The proposed method uses a pretraining strategy on the full graph before the pruning step. Are the parameters reinitialized at the end of the pruning stage like lottery tickets or does training continue after graph sparsification?

  2. The authors mention the use of multilevel degree information for the graph pruning criterion in Eq. 4. However, it is not clear if the number of hops considered is determined by the number or layers in the network or is a hyperparameter that is tuned?

评论

[On Table 1]

W4. We appreciate the reviewer’s helpful feedback. In response to the reviewer’s suggestion, we present the results of our method on GIN under extreme sparsity as a heatmap in Figure 20, in the revised paper. The performance in each heatmap element is an average performance over 5 random seeds.

Overall, the performance of TEDDY is not uniformly affected by increased sparsity. In the Cora dataset, our method demonstrates resilience to higher graph sparsity levels, surpassing the vanilla GIN’s accuracy of 76.34%. This suggests that a sparser graph generated from our method has a potential to mitigate overfitting and achieve generalization. For the Citeseer dataset, the performance remains relatively stable across a range of weight sparsity levels, underscoring a robustness to parameter reduction. Furthermore, all configurations exceed the vanilla GIN's accuracy of 68.1%, strongly indicating the effectiveness of TEDDY across diverse sparsities. The similar efficacy is observed in the Pubmed dataset, where all elements surpasses the vanilla GIN’s performance of 77.9%. Interestingly, we observe a trend in the Pubmed where the performance generally improves with increased sparsity, with the highest accuracy achieved under the most extreme sparsity (pg=pθ=85p_g=p_\theta=85%).


[On parameter reinitialization at the end of the pruning stage]

Q1. Our method does not adhere to the standard lottery ticket hypothesis framework, where parameters are reinitialized to their original values during the post-pruning. Instead, our approach is agnostic to the initial parameter values. Following the graph sparsification, we do not revert to the original initialization but continue training from the current initialized state of the parameters, which may be randomly initialized or pre-trained. Hence, TEDDY can obtain robust performance as the original GNNs, irrespective of the initial starting points of the parameters.


[The number of hops considered for multi-level degree consideration]

Q2. In our method, as outlined in Eq. 3-5, we considered two-hop degree information to calculate edge-wise scores TedgeT_{edge} for all settings. There could be other options for the choice of the number of hops, but we fixed it as two, since two hop degree consideration is sufficient to yield prominent performance. Consequently, the number of hops in our pruning criterion is a predetermined choice.


References

[1] Dropedge: Towards deep graph convolutional networks on node classification, ICLR 2020.

[2] A Unified Lottery Ticket Hypothesis for Graph Neural Networks, ICML 2021.

[3] Soft threshold weight reparameterization for learnable sparsity, ICML 2020.

[4] Learning Sparse Neural Networks through L_0 Regularization, ICLR 2018.

评论

I thank the author's for additional experiments and revising the manuscript. I am satisfied with their response and I will keep my score for acceptance.

I wanted to follow up on the W3 above, in the case where the input to the attention is zero the output of the softmax would be nonzero. However, if the edge has been masked out, it should not be included in the aggregation, which would imply that the attention value does not matter?

评论

We sincerely thank the reviewer vxbf for the reassessment and positive opinion for our work. Your constructive feedback has enhanced our work during the rebuttal. We are glad that our revision and responses have addressed your concerns.

In the original code of the UGS [1], the edge mask is only applied to the edge weight, i.e., pre-normalized attention coefficient, NOT the embeddings of nodes. Hence, since the softmax-normalized attention coefficient for a pruned edge (i,j)(i, j) will be non-zero according to the original code, the aggregation for the update of node ii will involve the embedding of node jj.


References

[1] A Unified Lottery Ticket Hypothesis for Graph Neural Networks, ICML 2021.

评论

We sincerely appreciate the reviewer’s helpful and constructive feedbacks. We would like to address the reviewer’s concerns as below.

[On Motivation]

W1-1. While we agree with the reviewer that considering aspects such as spectral information or centrality is an appealing approach, it is worth noting that in case of large graphs, such as Reddit or ArXiv dataset, understanding the spectral properties of the graph Laplacian/adjacency matrix is often computationally infeasible. To consider spectral information, one would require information for spectrums (e.g. eigenvectors/eigenvalues), and exploiting such information necessitates algorithms like eigendecomposition, which requires a expensive time complexity of O(V3)O(|V|^3). Thus, our primary goal is to design a pruning algorithm that can quickly identify to-be-pruned edges even for large graphs. As an initial step toward our goal, we focus on considering degree information and believe that we successfully demonstrated that even with only degree information, our TEDDY could effectively maintain the performance of dense graphs/GNNs through extensive experiments (Section 6). However, in line with the reviewer's suggestion, we acknowledge the potential to incorporate a more diverse range of graph information and consider it as our future avenue.


[On Comparison with Random Edge Pruning]

W1-2. In response to the reviewer’s insightful suggestion, we expanded our experiments including random dropping strategy in DropEdge [1]. In this experiment, we solely conducted a graph sparsification to directly compare the impacts of diverse edge pruning approaches, including our TEDDY.

According to the results illustrated in Figure 19, TEDDY consistently surpasses baseline approaches including DropEdge. Notably, our method not only preserves but outperforms the accuracy of vanilla GNNs in the Citeseer and Pubmed datasets across all benchmark GNNs. Figure 19 also demonstrates that utilizing DropEdge technique results in suboptimal performance, underscoring the necessity of sophisticated pruning strategy. Interestingly, in a distinct graph sparsification scenario, UGS [2] tends to exhibit the similar performance with that of the DropEdge.


[On Parameter Sparsification]

W2. The most crucial aspect of our PGD on 0\ell_0-ball is its capability for one-shot pruning at any given sparsity level. For example, the previous approaches like UGS and WD-GLT require sensitive tuning of the 1\ell_1 regularization parameter due to an iterative process to reach the target sparsity level, while our method enables one-shot pruning without any regularization parameter. In a similar sense, references [3], [4], as the reviewer brings to us, also necessitate tuning of the regularization parameter to achieve the target sparsity level, which may be burdensome in practice. Additionally, by encouraging sparsity during training, we can gain computational advantages in terms of actual number of operations even during the training phase.


[On Architectures]

W3. In addressing the differences between GATs and GCNs in our experiments, it's important to note the issue of the public code of UGS [2] on GAT (https://github.com/VITA-Group/Unified-LTH-GNN/blob/main/NodeClassification/gnns/gat_layer.py). According to the code, edge masking is applied prior to the softmax normalization of attention coefficients. However, this implementation leads to a notable issue where pruned edges inevitably receive non-zero normalized attention coefficients, incurring the revival of pruned connections during the neighborhood aggregation phase. Our experimental results using a revised implementation revealed a significant performance decrease compared to the results reported in [2]. This suggests that the removal of edges before forwarding GAT layer, may not significantly impact the performance if the attention mechanism inherently captures degree information beneficial for pruning. Given these observations, it remains unclear whether the attention mechanism in GATs offers a substantial advantage over GCN in encoding degree information.

评论

Dear reviewers and AC,

We sincerely appreciate your dedicated time and insightful feedback on our manuscript.

As highlighted by the reviewers, we introduce TEDDY, a novel edge sparsification framework in a one-shot approach (vxbf, mYP6, SQ4D, QTmY). This has been appreciated for its simplicity (vxbf, SQ4D), conciseness (k6Ge), and novelty (mYP6). Our method is inspired by a critical observation (SQ4D, QTmY) on the importance of preserving low-degree edges, which we have substantiated both empirically and theoretically (vxbf, mYP6). TEDDY has demonstrated its ability to outperform existing state-of-the-art iterative graph pruning methods (k6Ge, vxbf, SQ4D), effectively balancing efficiency and effectiveness (mYP6, QTmY).

In accordance with your constructive comments, we have carefully made the revision with the following additional discussions and experiments:

  • Elucidation of the novel aspects distinguishing TEDDY from contemporary sparsification techniques.
  • A comparison of pruning durations between TEDDY and other existing approaches.
  • An extensive ablation study, covering various scenarios including:
    • TEDDY with all components integrated.
    • TEDDY without the distillation loss.
    • TEDDY employing magnitude-based pruning.
    • TEDDY with dense weights.
    • Baseline with all components integrated.
    • Baseline with the distillation loss.
    • Baseline employing 0\ell_0-based projected gradient descent.

We have marked the revisions in blue for your convenience. We believe that these adjustments better deliver the effectiveness of TEDDY, offering a valuable contribution to the ICLR community.

Thank you once again for your guidance and support.

Authors.

评论

Dear Reviewers and Area Chair,

We kindly request that you review our responses and the revision at your earliest convenience, since this week (22nd ~ 23rd) marks the end of our interaction period.

We have carefully responded to your feedbacks and faithfully reflected them in the rebuttal with the revised paper, as well as considerable additional experimental results including comprehensive ablation studies.

We sincerely thank you for your time and efforts in reviewing our paper, and your insightful and constructive comments.

Thank you,

Authors

评论

Dear Reviewers and Area Chair,

As the rebuttal period is swiftly approaching its conclusion, we kindly remind you to review our responses and revisions at your earliest convenience. Your attention to our revisions and responses will be profoundly valuable. We deeply appreciate the time and effort you have dedicated to evaluating our work.

Sincerely,

Authors

AC 元评审

The paper introduces a new method for graph sparsification to speed-up learning. The graph sparsification method is based on three main techniques: degree-based graph sparsification, distillation by matching the classification logits of the model pre-trained on the whole graphs; parameter sparsification by using a subset of parameters.

The paper studies an important problem and introduces a new simple method with good experimental results. Overall, the paper is an interesting contribution and it would be a good addition to the ICLR program.

Nevertheless, the committee would strongly suggest few changes to the presentation before publication:

  • the paper should emphasize more the role of distillation that seems central in the experimental results.
  • the paper should include a discussion on the complexity of the method in the final version

为何不给更高分

The paper is interesting but it has few shortcoming listed above.

为何不给更低分

The paper presents a new method for an important problem with convincing experimental results

最终决定

Accept (poster)