PaperHub
5.7
/10
Poster3 位审稿人
最低5最高6标准差0.5
6
6
5
3.3
置信度
ICLR 2024

Graph Lottery Ticket Automated

OpenReviewPDF
提交: 2023-09-16更新: 2024-03-12

摘要

Graph Neural Networks (GNNs) have emerged as the leading deep learning models for graph-based representation learning. However, the training and inference of GNNs on large graphs remain resource-intensive, impeding their utility in real-world scenarios and curtailing their applicability in deeper and more sophisticated GNN architectures. To address this issue, the Graph Lottery Ticket (GLT) hypothesis assumes that GNN with random initialization harbors a pair of core subgraph and sparse subnetwork, which can yield comparable performance and higher efficiency to that of the original dense network and complete graph. Despite that GLT offers a new paradigm for GNN training and inference, existing GLT algorithms heavily rely on trial-and-error pruning rate tuning and scheduling, and adhere to an irreversible pruning paradigm that lacks elasticity. Worse still, current methods suffer scalability issues when applied to deep GNNs, as they maintain the same topology structure across all layers. These challenges hinder the integration of GLT into deeper and larger-scale GNN contexts. To bridge this critical gap, this paper introduces an $A$daptive, $D$ynamic, and $A$utomated framework for identifying $G$raph $L$ottery $T$ickets ($AdaGLT$). Our proposed method derives its key advantages and addresses the above limitations through the following three aspects: 1) tailoring layer-adaptive sparse structures for various datasets and GNNs, thus endowing it with the capability to facilitate deeper GNNs; 2) integrating the pruning and training processes, thereby achieving a dynamic workflow encompassing both pruning and restoration; 3) automatically capturing graph lottery tickets across diverse sparsity levels, obviating the necessity for extensive pruning parameter tuning. More importantly, we rigorously provide theoretical proofs to guarantee $AdaGLT$ to mitigate over-smoothing issues and obtain improved sparse structures in deep GNN scenarios. Extensive experiments demonstrate that $AdaGLT$ outperforms state-of-the-art competitors across multiple graph datasets of various scales and types, particularly in scenarios involving deep GNNs.
关键词
Lottery Ticket HypothesisGraph Neural NetworksGraph Neural Tangent Kernel

评审与讨论

审稿意见
6

This paper argues that existing graph lottery ticket algorithms like UGS heavily rely on trial-and-error pruning rate tuning and scheduling, and suffer from stability issues when extended to deeper GCNs. Those limitations call for an adaptive framework to automatically identify the pruning hyper-parameters that improve the scalability of GCNs. To this end, the authors propose a framework called AdaGLT, adaptive graph lottery tickets, to overcome such limitations. Extensive experiments validate the effectiveness of the proposed AdaGLT as compared to previous ad-hoc UGS.

优点

  1. The paper is well-written and organized. As compared to previous UGS, the proposed AdaGLT jointly sparsifies both graphs and GNN weights and adopts an adaptive layer sparsification process. In addition, it offers the practitioner the chance to adopt automatic pruning and dynamic restoration for extracting the graph lottery tickets.

  2. The effective combination of automation, layer adaptivity, and dynamic pruning/restoration provides better properties as compared to previous methods.

缺点

Given the contribution of UGS and other series of graph lottery tickets work, this automated tool sounds like incremental work. However, the analysis of scalability issues when applied to deep GNNs is worth reading.

Most experiments are conducted on small and transductive graph/tasks. The ogbn-arxiv is a small graph and ogbn-protein is a medium graph actually but are claimed as large graphs, more reasonablely large graphs should be considered. Also, how about inductive settings like GraphSAGE or other SOTA ones?

Other than the above aspects, I think the other perspectives of this paper are clear.

问题

See weaknesses. I wonder whether the contribution applied to an inductive setting or evolving graphs with increasing nodes or edges.

伦理问题详情

N/A

评论

(continued response to Weakness 2)

2. Modification for inductive tasks

We will first (1) discuss why conventional GLT methods [1,2] cannot be extended to inductive settings; (2) explain how the components of AdaGLT can be slightly modified for inductive settings.

Allow us to respectfully recall that, UGS introduces two differentiable masks mAm_A and mθm_\theta, with shapes identical to A\mathbf{A} and Θ\mathbf{\Theta}. This decides that UGS is inherently transductive and cannot be applied to scenarios with multiple graphs and varying numbers of nodes. In contrast, AdaGLT adopts an edge explainer to generate a soft edge mask of size O(E)\mathcal{O}(E), which means that it can generate edge masks for multiple graphs of arbitrary sizes.

However, another component of AdaGLT, namely the trainable thresholds, hinders its direct transition to inductive settings. It is noteworthy that, for graph pruning, we employ a threshold vector tA(l)RNt^{(l)}_A \in \mathbb{R}^{N} for each layer ll. However, such fixed-length thresholds are inherently transductive.

This obstacle is readily addressed: akin to the transition from a trainable mask mAm_A in UGS to an edge explainer Ψ\Psi in AdaGLT, we can seamlessly replace trainable thresholds with generated ones. Specifically, we employ a simple MLP Υ\Upsilon in each layer to generate the trainable threshold:

tA(l)=Sigmoid(Υ(l)(X))RN,  l{0,1,,L1}t_A^{(l)} = \operatorname{Sigmoid}\left(\Upsilon^{(l)} (\mathbf{X})\right) \in \mathbb{R}^{N},\;l\in\{0,1,\cdots,L-1\}.

Note that X\mathbf{X} is the node feature matrix and Υ(l)\Upsilon^{(l)} followed by a sigmoid function transform X\mathbf{X} into the node-wise thresholds at layer ll. Practically, we leverage a 2-layer MLP with hidden dimensions {{64,1}\{64,1\}} in the next section.

This modification to AdaGLT involves simply replacing Line 6 in the Algorithm table with the following one:

tA(l)=Sigmoid(Υ(l)(X)),mA,ij(l)=1[sij<tA,i(l)]k=0l1mA,ij(k)t_A^{(l)} = \operatorname{Sigmoid}\left(\Upsilon^{(l)} (X)\right) , m_{{A},ij}^{(l)} = \mathbb{1} [{s_{ij} < t_{A,i}^{(l)}}] \prod_{k=0}^{l-1}m_{{A},ij}^{(k)}

With both the edge masks and thresholds generated rather than being fixed parameters, AdaGLT can be seamlessly incorporated in any inductive scenario. We employ this version of AdaGLT in the subsequent section to report its performance.

3. Performance in graph classification

Experimental Setup. We choose MNIST[3] and MUTAG datasets to validate AdaGLT's scalability to graph classification tasks. Following [4], we split MNIST dataset to 55000 train/5000 validation/10000 test. Following [5], we perform 10-fold cross-validation on MUTAG.

Table 4. The performance of AdaGLT on MNIST with 3-layer GraphSAGE (Baseline=93.40) compared with random pruning. We report an average of 5 runs.

Graph Sparsity20%30%40%50%60%
Random89.16±0.7687.81±0.5879.46±0.8971.48±1.1267.20±3.64
AdaGLT94.88±0.8393.53±1.1293.47±1.3392.22±1.7989.64±1.66

Table 5. The performance of AdaGLT on MUTAG with 3-layer GraphSAGE (Baseline=85.78) compared with random pruning. We report the average of 10-fold cross-validation.

Graph Sparsity20%30%40%50%60%
Random84.22±3.4983.78±4.0879.26±3.5973.11±7.6565.28±9.40
AdaGLT86.76±2.7186.30±2.9685.41±2.4483.23±3.7979.55±4.27

It is noteworthy that AdaGLT extends its impressive performance from node classification to graph classification. Specifically, AdaGLT successfully identifies winning tickets with 40% and 30% graph sparsity on MNIST and MUTAG, respectively.

Question 1: I wonder whether the contribution applied to an inductive setting or evolving graphs with increasing nodes or edges.

We validate the performance of AdaGLT in graph classification in response to Weakness 2 :)

Summary:

Your valuable insights have significantly contributed to advancing our work! In the newly uploaded draft, we have made the following revisions and the revised or newly incorporated text is highlighted in blue:

  • Our response to Weakness 2.1, concerning the results of AdaGLT with Ogbn-Product, is now provided in Appendix J.
  • Our response to Weakness 2.2, regarding modifications for inductive settings, is introduced in Appendix M.
  • Our response to Weakness 2.3, focusing on AdaGLT for graph classification, is incorporated into Appendices G and H.

References:

[1] A Unified Lottery Ticket Hypothesis for Graph Neural Networks, ICML'21

[2] Rethinking Graph Lottery Tickets: Graph Sparsity Matters, ICLR'23

[3] Open graph benchmark: Datasets for machine learning on graphs, NeurIPS'20

[4] Benchmarking graph neural networks, JLMR'22

[5] How powerful are graph neural networks?, ICLR'19

评论

We express our sincere thanks for the detailed and thoughtful review of our manuscript and for the encouraging appraisal of our work. We give point-by-point responses to your comments and describe the revisions we made to address them:

Weakness 1: Given the contribution of UGS and other series of graph lottery tickets work, this automated tool sounds like incremental work. However, the analysis of scalability issues when applied to deep GNNs is worth reading.

Thank you for recognizing our analysis of GLT and deep GNNs! However, we would like to clarify the significance of finding an "automated" ticket: In fact, the impact of the pruning ratios, pgp_g (graph pruning ratio) and pθp_\theta (weight pruning ratio), on GLT performance is more substantial than conventionally understood in previous GLT methods.

Table 1. The performance of 10 rounds iterative UGS on Citeseer + GCN with different pgp_g and pθp_\theta.

pgp_g \ pθp_θ0.050.100.150.200.25
0.0569.0869.6370.3469.8368.27
0.1064.2367.8867.9766.9966.14
0.1560.2362.5565.4363.4560.96
0.2059.6161.2862.7061.3860.44
0.2559.0860.9261.4462.1757.60

Table 2. The performance of 10 rounds iterative UGS on PubMed + GIN with different pgp_g and pθp_\theta.

pgp_g \ pθp_θ0.050.100.150.200.25
0.0577.3078.4076.7976.9276.25
0.1076.0976.1876.4476.0475.18
0.1575.3374.9375.8075.6673.12
0.2075.5776.6075.3074.7873.58
0.2572.1873.1873.6270.4468.40

We observe that (1) traditional GLT methods are significantly influenced by the values of pgp_g and pθp_\theta, resulting in substantial performance disparities, reaching as high as 12.7412.74%. (2) the original setting of UGS (pg=0.05,pθ=0.20p_g = 0.05, p_\theta = 0.20) is not universally optimal;in the case of Citeseer + GCN, the optimal combination is {pg=0.05,pθ=0.15p_g = 0.05, p_\theta = 0.15}, and for Pubmed + GIN, it is {pg=0.05,pθ=0.10p_g = 0.05, p_\theta = 0.10}.

In this context, the pursuit of an "automated" GLT that is free of pruning ratio tuning becomes particularly meaningful.

Weakness 2: …more reasonablely large graphs should be considered. Also, how about inductive settings like GraphSAGE or other SOTA ones?

Thanks for your thoughtful comments that make our results evens stronger! Verifying the performance of AdaGLT on larger graphs or in an inductive setting is indeed crucial. We will respond to your inquiries in three aspects:

  • How is the performance of AdaGLT on Ogbn-Products (2,449,029 nodes and 61,859,140 edges) ?
  • How can we (slightly) modify AdaGLT to facilitate its migration to inductive tasks?
  • How does AdaGLT perform on other inductive tasks like graph classification?

1. Performance with Ogbn-Products

Experimental Setup. We opt for Ogbn-Products with Cluster-GCN to validate AdaGLT on a larger graph. Dataset splitting is consistent with that in [3]. The reason for not selecting DeeperGCN, as in the other three OGB datasets, is the excessively time-consuming training of DeeperGCN on Ogbn-Products (nearly 20h/run with NVIDIA V100 32GB).

Table 3. The performance of AdaGLT on Ogbn-Products with 4- and 8- layer Cluster-GCN. We report the average of 3 runs.

Layer4 (Baseline=79.23±0.47)8 (Baseline=78.82±0.73)
Graph Sparsity10%30%50%10%30%50%
UGS78.44±0.5874.67±0.6273.05±1.0276.30±0.7972.13±1.2571.32±1.14
AdaGLT80.35±0.7179.67±0.8676.46±0.6980.22±0.7878.13±1.1274.62±0.90

We list two straightforward observations:

Obs. 1. AdaGLT can scale up to large graphs like Ogbn-Products and effectively find GLTs, while UGS completely fails.

Obs. 2. Discovering winning tickets within shallow GNNs is more feasible. Note that under 30%30\% graph sparsity, AdaGLT successfully identifies winning tickets at a 4-layer GCN, while struggling to achieve baseline performance on an 8-layer GCN. This observation aligns with our findings in Section 4.3 Obs. 4 and Appendix J.

评论

Dear Reviewer 9nRP,

Hope this message finds you well. As the author-reviewer discussion deadline approaches, we respectfully seek your confirmation on the adequacy of our rebuttal in addressing the concerns raised in your review.

We are rather gratified to receive your insightful response! Your feedback on the scalability to larger graphs and inductive setting has significantly enhanced the quality of our manuscript, and we have diligently addressed and responded to your concerns in both the OpenReview response and the updated manuscript.

We really appreciate the substantial time commitment on your part and extend heartfelt thanks for any additional insights. Your expertise is instrumental in refining our projects and positively impacting our research endeavors.

Thank you once again for your time and valuable perspectives. We eagerly await your further guidance with utmost respect.

Best regards,

Authors

审稿意见
6

The authors proposed the AdaGLT to obtain the graph lottery ticket to sparsify both the trainable weights and the graph structure together. The AdaGLT learns different masks for different layers of graph and weights, providing higher freedom of the sparsification. To save the memory for the training mask, the edge explainer was also introduced. Through extensive evaluation, the proposed framework achieves satisfactory results.

优点

  1. The evaluation is sufficient to support the claims.
  2. The paper presentation is clear enough to get the main ideas.
  3. AdaGLT can work on deep GNNs and large-scale datasets.

缺点

  1. Section 3.1 is the direct application of Liu et al. 2022, as stated in the paper. The edge explainer is also available. So, the main contribution seems incremental.
  2. The assumption of Theorem 1 may not be true for G(l)G^{(l)} in most cases, since G(l)G^{(l)} should be different due to the layer-adaptive pruning.
  3. The algorithm does not include the "Dynamic Pruning and Restoration" paragraph. And the statement of equation 10 is unclear to me. What does the restoration refer to?

问题

  1. What does the "irreversible fashion" refer to? The baseline UGS updates the mask only, and both the original A and weights are stored, so it should be considered as "reversible"?
  2. The authors stated that "Existing GLT methods .... and their lack of flexibility stems from the necessity of hyperparameter tuning". Can the author explain more in detail? The baseline UGS also has the trainable mask, which can be considered as the other version of the trainable threshold.
  3. Could the author describe how to get the equation 10?

伦理问题详情

N/A

评论

(continued response to Weakness 1)

2. Edge Explainer

Table 3. The comparison of edge explainer utilized in AdaGLT and previous GNN explanation methods.

AdaGLTPGexplainer[6]GNNExplainer[7]
StageWithin trainingPost-hocPost-hoc
ParameterizedYes (Eq.6)Yes (MLP)No
Differentiable TrickGradient estimator (STE/SR-STE/LTE)Reparameterization trickN/A
TargetPrune edgesSample explantory graphsSample explantory graphs
Layer-adaptiveYesNoNo
Towards Higher Pruning RtioSparse RegularizationBudget constraintNo

As shown in Table 3, our edge explainer is different from previous edge explainers in the following aspects:

  • Stage: Our edge explainer is concurrently trained with network parameters. In contrast to post-hoc explanations in [6,7], our approach allows for better integration with the network, yielding more robust explanation scores.
  • Parameterized: Our edge explainer is paramterized as [6], but is with different implementations.
  • Differentiable Trick: Sampling edges from the edge scores provided by edge explainers is inherently not differentiable, and the methods we adopted are different from those in [6,7].
  • Target & Layer-adaptive: The utilization of edge explainer in prior studies typically aimed at extracting explanatory subgraphs (note that here, "subgraph" refers to the same structure across all layers). In contrast, we are the first to explore the use of edge explainer in the domain of graph sparsification to achieve layer-adaptive pruning.
  • High Pruning Ratio: [7] does not explicitly ensure a high sparsity rate for the subgraph, and [6] imposes a budget constraint to obtain a compact explanatory graph. However, they fall short of achieving high graph sparsity. In contrast, AdaGLT achieves state-of-the-art sparsity with no performance compromise through sparse regularization.

Summary:

We have diligently addressed each of the concerns raised, and your comments have significantly improved the manuscript. Hopefully, our responses and revisions meet your expectations for a higher rating.

References:

[1] A Unified Lottery Ticket Hypothesis for Graph Neural Networks, ICML'21

[2] Adversarial Erasing with Pruned Elements: Towards Better Graph Lottery Tickets, ECAI'23

[3] Rethinking Graph Lottery Tickets: Graph Sparsity Matters, ICLR'23

[4] Pruning graph neural networks by evaluating edge properties, KBS'22

[5] Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, IJCAI'23

[6] Parameterized explainer for graph neural network, NeurIPS'20

[7] Gnnexplainer: Generating explanations for graph neural networks, NeurIPS'19

[7] Interpretable and Generalizable Graph Learning via Stochastic Attention Mechanism, ICML'22

评论

The authors addressed my major concerns, so I increased my score.

评论

We thank the reviewer 9csi for more strongly supporting our paper! We are glad our revision and rebuttal have addressed your concerns!

评论

Thank you for your thoughtful and constructive reviews of our manuscript! Based on your questions and recommendations, we give point-by-point responses to your comments and describe the revisions we made to address them. For better reading and comprehension, we slightly rearrange the order of your comments.

Weakness 2: The assumption of Theorem 1 may not be true for G(l)G^{(l)} in most cases, since G(l)G^{(l)} should be different due to the layer-adaptive pruning.

Thanks for your insightful question. We would like to clarify that G(l)\mathbf{G}^{(l)} differs across the layer index ll. In our setting, the key distinction is in the smallest singular value σ0(l)=1αl\sigma^{(l)}_0 = 1 - \alpha^l, which varies with the layer index. This setting of the smallest singular value is in accordance with layer-adaptive pruning. Additionally, other singular values can also differ within G(l)\mathbf{G}^{(l)}. Consequently, the operation matrix G(l)\mathbf{G}^{(l)} varies from layer to layer.

Question 1: What does the "irreversible fashion" refer to? The baseline UGS updates the mask only, and both the original A and weights are stored, so it should be considered as "reversible"?

Really sorry for bringing confusion! "Irreversible" refers to the fact that, in the iterative magnitude pruning process of UGS, the elements (either edge or weight) pruned at the end of each iteration (i.e., setting the mask mijm_{ij} to zero) will not have an opportunity to be restored in the next iteration (i.e., mijm_{ij} will never be set back to one).

For a more detailed elucidation, we would like to respectfully outline the pruning logic of UGS (take graph pruning for example):

  • Iteration 1: Train GNN with G={mgA,X}\mathcal{G}=\{m_g \odot \mathbf{A},\mathbf{X}\} for nn epochs. Set pg%p_g\% of the lowest values in mgm_g to zero, thus obtaining mg(1)m_g^{(1)}. We denote mgmg(1)m_g \setminus m_g^{(1)} as the pruned edges in iteration 1.
  • Iteration 2: Train GNN with G={mg(1)A,X}\mathcal{G}=\{m_g^{(1)} \odot \mathbf{A},\mathbf{X}\} for another nn epochs. Use the same pruning ratio pg%p_g\% and obtain mg(2)m_g^{(2)}. mg(1)mg(2)m_g^{(1)} \setminus m_g^{(2)} denotes the pruned edges in iteration 2.
  • This iterative process continues as shown in our Figure 2 (Right*Right*).

It is crucial to note that mg(k1)mg(k)m_g^{(k-1)} \setminus m_g^{(k)} that is pruned in Iteration kk will not be reactivated in any subsequent iteration: once pruned, they are irreversibly set to zero, which represents what we term an "irreversible fashion."

Adhering to the conventional LTH, UGS resets the GNN weights to their initialization values at the start of each iteration, which corresponds with the recoverability of initialization (in a reversible way). However, when we refer to "irreversible", we specifically mean the irrecoverability of the pruned elements.

Weakness 3: The algorithm does not include the "Dynamic Pruning and Restoration" paragraph. And the statement of equation 10 is unclear to me. What does the restoration refer to?

Thanks for pointing out this issue! We would like to clarify: the characteristic of dynamic pruning in AdaGLT is not a specifically designed module; rather, it is a natural result of our decision to discard fixed pruning ratios in favor of using trainable thresholds. In the algorithm table (Appendix C), the dynamic pruning is done in Lines 5-6, and restoration is done in Lines 9-10. Namely, the gradient updates of Θ,Ψ,tθ\mathbf{\Theta}, \Psi, t_\theta, and tAt_A implicitly entail the recovery of certain edges or weights.

With this in mind, we would like to offer a more explicit explanation of Equation 10 and how the restoration is done:

Take weight pruning for example. In epoch kk, WRN×MW \in \mathbb{R}^{N \times M} is a learnable matrix. Given certain i,ji,j, if Wij(k)<ti(k)W_{ij}^{(k)} < t_i^{(k)}, then mij(k)=0m_{ij}^{(k)} = 0, which means Wij(k)W_{ij}^{(k)} is pruned in epoch kk. However, possibilities exist that WijW_{ij} can be recovered or restored, namely mij(k+1)=1m_{ij}^{(k+1)} = 1, in epoch k+1k+1, as long as the updated Wij(k+1)W_{ij}^{(k+1)} is greater than ti(k+1)t_i^{(k+1)}. In other words:

Wij(k+1)=Wij(k)αgrad w.r.t Wij(k)W_{ij}^{(k+1)} = W_{ij}^{(k)} - \alpha\cdot\text{grad w.r.t } W_{ij}^{(k)} > ti(k)αgrad w.r.t ti(k)=ti(k+1)t_{i}^{(k)} - \alpha\cdot\text{grad w.r.t } t_{i}^{(k)} =t_i^{(k+1)}

In the original Equation 10, we rearrange the expression as follows:

Wij(k)ti>α(grad w.r.t Wij(k)grad w.r.t ti(k))W_{ij}^{(k)} - t_i > \alpha\cdot(\text{grad w.r.t } W_{ij}^{(k)}- \text{grad w.r.t } t_{i}^{(k)})

The content following the equality sign in Equation 10 directly exhibits the computed gradients, derived in part from Equation 4.

Hopefully, our explanation of the restoration process provides a clearer understanding of how we achieve "dynamic restoration". We have modified Appendix C to better illustrate the adaptive, dynamic, and automated nature of our algorithm.

评论

Question 2: The authors stated that "Existing GLT methods .... and their lack of flexibility stems from the necessity of hyperparameter tuning". Can the author explain more in detail? The baseline UGS also has the trainable mask, which can be considered as the other version of the trainable threshold.

Yes, we are glad to clarify! Allow us to give a straightforward explanation of why conventional GLT methods are characterized by a lack of flexibility stemming from the necessity of hyperparameter tuning. As shown in Table 1, the performance of UGS is hugely influenced by the combination of pgp_g and pθp_\theta, and the original setting (pg=0.05,pθ=0.20p_g = 0.05, p_\theta=0.20) is actually not the optimal hyperparameter configurations for Citeseer + GCN. On the contrary, our proposed AdaGLT completely eliminates the need for adjusting pgp_g and pθp_\theta, and it is capable of automatically adjusting how many elements to prune or recover during different training stages.

Table 1. The performance of 10 rounds iterative UGS on Citeseer + GCN with different pgp_g (pruning ratio for graph) and pθp_\theta (pruning ratio for weight).

pgp_g \ pθp_θ0.050.100.150.200.25
0.0569.0869.6370.3469.8368.27
0.1064.2367.8867.9766.9966.14
0.1560.2362.5565.4363.4560.96
0.2059.6161.2862.7061.3860.44
0.2559.0860.9261.4462.1757.60

Regarding your concern that The baseline UGS also has the trainable mask, which can be considered as the other version of the trainable threshold, we would like to respectfully clarify that, though the mAm_A and mθm_\theta in UGS are indeed trainable, they are pruned in an "untrainable" manner: it is up to practitioners to manually define the pruning ratios. And we believe this brought inconvenience in real-world applications.

Question 3: Could the author describe how to get the equation 10?

Please refer to our response to Weakness 3 :)

Weakness 1: Section 3.1 is the direct application of Liu et al. 2022, as stated in the paper. The edge explainer is also available.

Thanks for your appreciation of our presentations and experimental designs!

Please allow us to systematically elaborate on the connections and differences between our work and previous ones:

1. Trainable thresholds

Table 2. The comparison of trainable thresholds utilized in AdaGLT and previous work.

AdaGLTA-ViT[6]
Threshold LevelThreshold Scalar/Vector/MatrixThreshold Scalar
Gradient EstimatorSTE/SR-STE/LTESTE
Pruning ObjectEdge & WeightAttention token
High SparsityYesNo

As shown in Table 2, the thresholds employed in AdaGLT exhibit notable distinctions from those in [5] in at least the following four aspects:

  • Threshold Level: We thoroughly discuss and compare the performance of thresholds at different levels, ultimately selecting the threshold vector, which is not presented in [5]. (in Section 4.5 and Appendix B)

  • Gradient Estimator: We extensively discuss and compare various gradient estimators, confirming that AdaGLT is not sensitive to the choice of estimator and ultimately selecting STE, a comprehensive analysis absent in [5]. (in Section 4.5 and Appendix K)

  • Pruning Object: The pruning targets in AdaGLT differ significantly from those in [5]. The former involves joint pruning of AA and Θ\mathbf{\Theta}, while the latter focuses on pruning attention tokens. It is noteworthy that applying trainable thresholds to edges with intrinsic values of {0,1}\{0, 1\} is non-trivial. To address this challenge, we introduced the edge explainer to provide dense edge scores. (in Section 3.2)

  • High Sparsity: The thresholds in [5] cannot achieve the high sparsity desired for GLT. Consequently, we introduced sparse regularization tailored to our application scenario. (in Section 3.3)

审稿意见
5

The paper proposes Adaptive, Dynamic, and Automated framework for identifying Graph Lottery Tickets to overcome the integration of GLT into deeper and larger-scale GNN contexts. It attempts tailoring layer-adaptive sparse structures for various datasets and GNNs, thus endowing it with the capability to facilitate deeper GNNs; integrating the pruning and training processes, thereby achieving a dynamic workflow encompassing both pruning and restoration; and automatically capturing graph lottery tickets across diverse sparsity levels, obviating the necessity for extensive pruning parameter tuning.

优点

  1. The paper is well written with motivation explicitly explained. The evolution of both edges and weights might dynamically during the training process as well as the flexibility of prune ratio are important missing links that are explored.
  2. The introduction of an edge explainer into GNN pruning that ensures interpretability during the pruning process while reducing unimportant edges, is a good way to mitigate the quadratic increase in parameters.
  3. Appendix is rich. I will recommend the authors move some large-scale experiments on OGBN-Arxiv and Products to be moved to main draft.

缺点

Although the introduced components like join sparsification, edge explainer etc make sense, one major concern I have is how these individual components affect the performance of AdaGLT. I appreciate the authors for their extensive experiments, but the role/importance of individual modules is not well explained. How does removing some components affect the performance. Another question, no significant performance benefit is observed for low sparsity ratios eg 30-50%. Any explanation of why it is effective only in high sparsity ratios will improve the manuscripts. I am open to increasing my score after the rebuttal discussion.

问题

See above.

评论

We sincerely thank you for your careful comments and thorough understanding of our paper! Here we give point-by-point responses to your comments and describe the revisions we made to address them. We would be happy to add additional clarifications and revisions to the paper to address any additional recommendations.

Comment 1: I will recommend the authors move some large-scale experiments on OGBN-Arxiv and Products to be moved to main draft.

Thanks for your suggestions! In our newly submitted manuscript, we replace the experimental results in Figure 5 with the performances of AdaGLT applied to Arxiv/Collab/Proteins on 12-layer DeeperGCN for graph sparsity.

Comment 2: Although the introduced components like join sparsification, edge explainer etc make sense, one major concern I have is how these individual components affect the performance of AdaGLT. I appreciate the authors for their extensive experiments, but the role/importance of individual modules is not well explained. How does removing some components affect the performance.

Thank you for pointing out this issue. In the current draft, we indeed lack ablation experiments regarding the importance of some components. From our perspective, the following components of AdaGLT are amenable to ablation (Ab.) experiments:

Ab.1. Edge explainer: What is the impact on performance if the edge explainer is replaced by a trainable mask employed in UGS?

Ab.2. Layer-adaptive pruning: How does performance change if the same sparsity level is maintained for each layer, as opposed to increasing sparsity with the growth of layers?

Ab.3. [Solved] Gradient estimator: What is the effect of different gradient estimators on AdaGLT's performance? (Refer to Section 4.5 and Appendix K)

Ab.4. [Solved] Threshold level: How do trainable thresholds at different levels affect AdaGLT's performance? (Refer to Section 4.5 and Appendix B)

Automated pruning (without tuning pruning ratios) is another crucial component in our work. However, replacing it with fixed pruning rates is not a trivial task (as it requires setting different pruning rates for each layer and discarding the gradient estimator). Therefore, we do not consider it here.

To address your concerns regarding component ablation, we select Pubmed (for small graphs) and Ogbn-Arxiv (for large graphs) to complement the experiments for Ab1 and Ab2.

Table 1. Ablation study on Citeseer with GCN backbone of 282 \rightarrow 8 layers and 206020 \rightarrow 60 graph sparsity, evaluating the edge explainer (EE) and layer-adaptive pruning (LP). 'w/o EE' signifies the replacement of the edge explainer with a trainable mask, and 'w/o LP' indicates the maintenance of the same sparsity across all layers. The bold number denotes the highest performance under certain graph sparsity across all AdaGLT variants.

Layer2468
Graph Sparsity20%40%60%20%40%60%20%40%60%20%40%60%
AdaGLT71.3270.6666.2375.2374.3672.9374.4971.4671.1271.3369.2769.42
AdaGLT w/o EE68.6954.1648.0552.6045.7742.2853.1747.1040.8552.9140.6640.66
AdaGLT w/o LP71.4771.1565.9075.5071.4665.9969.7467.10*62.8367.0560.7955.63

Table 2 Ablation study on Ogbn-Arxiv with DeeperGCN backbone of 4284 \rightarrow 28 layers and 6060% graph sparsity, evaluating the edge explainer (EE) and layer-adaptive pruning (LP).

Layer4122028
AdaGLT70.1370.0871.3469.79
AdaGLT w/o EE60.3452.6047.4442.40
AdaGLT w/o LP68.7463.8960.3358.05

We list two straightforward observations:

Obs.1. Substituting the edge explainer resulted in a significant performance decline. Specifically, in deeper GNNs, such as 6- and 8-layer GCN/GAT, replacing the edge explainer with a trainable mask rendered the network unable to identify any winning tickets.

Obs.2. Layer-adaptive pruning significantly aids in discovering winning tickets in deeper GNNs. We observe that in 2- and 4-layer GCNs, AdaGLT without layer-adaptive pruning occasionally outperforms original AdaGLT. However, in deep scenarios, the elimination of layer-adaptive pruning results in a sharp performance decline (8.48% under 40% sparsity and 13.79% under 60% sparsity in an 8-layer GCN).

评论

Comment 3: Another question, no significant performance benefit is observed for low sparsity ratios eg 30-50%. Any explanation of why it is effective only in high sparsity ratios will improve the manuscripts.

Thanks again for making our results even stronger! We encountered a similar problem when presenting our experimental results, and we are delighted to share additional insights with you (which are not included in the manuscript). Allow us to humbly recap the features of AdaGLT: To achieve a desired sparsity level, we apply Sparse Regularization to increase the values of the threshold vector (corresponding to higher sparsity, see Section 3.3). Consequently, even when aiming for GLTs of 30%~50% sparsity, the network tends to converge to 70% or even higher sparsity in the end (especially for certain backbones like GAT). This implies that the obtained mAm_A at lower sparsity levels often does not receive sufficiently thorough training compared to higher sparsity levels, resulting in less significant improvements in found GLTs.

To address this issue, we initially propose a straightforward solution: adjusting the coefficient ηgη_g of Sparse Regularization to control the network's convergence to different sparsity levels. Experimental results demonstrate that this approach can improve GLTs' performance at lower sparsity levels (as shown in the tables below).

Citeseer + GCN (average of 3 runs); baseline=70.08
sparsity20%30%40%50%60%
ηg\eta_g1e-31.2e-31.8e-31.8e-32e-3
Performance (with carefully tuned ngn_g)72.04 ± 0.8571.33 ± 1.1271.48 ± 0.9369.94 ± 1.3468.03 ± 1.52
Performance (reported in the paper)70.3270.1870.3469.0268.12
PubMed + GAT (average of 3 runs); baseline=78.5
sparsity20%30%40%50%60%
ηg\eta_g0.6e-30.75e-30.85e-31e-31e-3
Performance (with carefully tuned ηg\eta_g)80.24 ± 1.1180.33 ± 1.5280.03 ± 0.8979.64 ± 1.8779.88 ± 1.01
Performance (reported in the paper)79.9378.9078.4680.1078.77

However, we ultimately chose not to adopt this strategy, primarily for the following three reasons:

  • We aim to diverge from the conventional GLT methods that extensively adjust the pruning ratio, hence asserting our contribution to "automated pruning." Tuning the sparsity coefficient for optimal performance would introduce another form of hyperparameter tuning, which we seek to avoid.
  • While GLTs at lower sparsity levels did not exhibit significant improvements in certain setups, they showed notable enhancements in more settings (like GIN and GAT backbones in Figure 4).
  • Considering real-world applications, we contend that GLTs with higher sparsity levels hold more value.

Based on these considerations, we present the manuscript as you have seen it, and hopefully this response meets your satisfaction.

We have incorporated responses to Comments 1 & 2 in the updated manuscript. Thank you again for your thoughtful comments that greatly strengthen our paper! We would be delighted to address any further inquiries.

评论

Dear Reviewer 2ZB4,

We hope this message finds you well. As the deadline for the author-reviewer discussion phase is nearing, we would like to respectfully inquire if our rebuttal has effectively addressed the concerns raised in your review.

Your insightful feedback, especially regarding the ablation study and observations at low sparsity, has been invaluable in refining our work. We have endeavored to address each of these points meticulously in our rebuttal and are eager to know if our responses and subsequent revisions have met your expectations and clarified the issues you pointed out.

We deeply recognize the demands of your time and deeply appreciate any further feedback on our rebuttal. Your expertise is not only vital for the review process but also enriches our ongoing learning and growth in this field.

Thank you once again for your time and invaluable insights. We eagerly await your response.

Best regards,

Authors

评论

Dear reviewers,

Thank you again for your thoughtful and constructive comments! We are really encouraged to see that the reviewers appreciate some positive aspects of our paper, such as technical novelty (Reviewers 2ZB4, 9nRP), theoretical guarantees (Reviewer 9nRP), thorough experimental validation (Reviewers 2ZB4, 9csi, 9nRP) and superior scalability (Reviewer 9csi, 9nRP).

Your expertise significantly helps us strengthen our manuscript – this might be the most helpful review we have received in years! In addition to addressing your thoughtful comments point-to-point on OpenReview, we have made the following modifications to the newly uploaded manuscript (all updated text is highlighted in blue):

  1. To Reviewer 2ZB4:

    • Following your Comment 1, we replaced the experimental results in Figure 5 with the performances of AdaGLT applied to Arxiv/Collab/Proteins on 12-layer DeeperGCN for graph sparsity;
    • Following your Comment 2, we have comprehensively supplemented the ablation study regarding edge explainer and layer-adaptive pruning in Appendix L.
  2. To Reviewer 9csi:

    • Following your Weakness 3, we revised the Algorithm table for better comprehension in Appendix C.
  3. To Reviewer 9nRP, following your Weakness 2:

    • we supplemented the results of AdaGLT on Ogbn-Product in Appendix J;
    • we elaborated on the minor modification of AdaGLT for inductive settings in Appendix M;
    • we supplemented the performance of AdaGLT for graph classification in Appendices G and H.

We have made earnest efforts to address the primary concerns raised, and sincerely appreciate the acknowledgment and improved rating from reviewer 9csi. We also respectfully look forward to the thoughtful feedback from the reviewers to further enhance the quality of our manuscript.

评论

Dear Reviewers and Area Chair,

We hope this message finds you well. We would like to respectfully remind you that the author-reviewer discussion deadline is approaching in under five hours. We are always eager to hear your valuable feedback and provide further responses.

Warm regards,

Authors

AC 元评审

This paper proposes a new strategy for pruning graph neural networks. This was a borderline paper; the reviewers liked the approach and found the results to be productive, but they were concerned about the incremental nature of the paper and the opacity of how much each of the proposed techniques contributed to the final method. I'm ambivalent about whether to accept the paper, and I would be open to acceptance or rejection.

为何不给更高分

The paper is borderline. It's incremental and there are some questions about the rigor of the ablations.

为何不给更低分

The technique does still represent a step forward and the reviewers agreed on that.

最终决定

Accept (poster)