PaperHub
4.5
/10
Rejected4 位审稿人
最低3最高5标准差0.9
5
5
5
3
3.8
置信度
正确性2.3
贡献度2.3
表达2.0
ICLR 2025

TopGQ: Post-Training Quantization for GNNs via Topology Based Node Grouping

OpenReviewPDF
提交: 2024-09-25更新: 2025-02-05
TL;DR

We propose the first PTQ framework for graph neural networks, providing up to 358x less quantization time compared to the existing QAT baselines.

摘要

关键词
Graph Neural NetworksNeural Network Quantization

评审与讨论

审稿意见
5

TopGQ proposes a post-training quantization (PTQ) framework for Graph Neural Networks (GNNs), achieving favorable quantization results without the need for backpropagation. To address the challenge of significant diversity in node features, TopGQ introduces a local Wiener index from a grouping perspective, clustering nodes with similar in-degrees and local Wiener indices for quantization. Additionally, TopGQ optimizes the computation of the local Wiener index, enhancing the efficiency of the grouping process. Finally, it employs scale absorption techniques to merge feature scales into the adjacency matrix, resulting in a more uniform feature distribution.

I read the rebuttals and decided to keep my decision. In my eyes, adding too many experiments in the rebuttal should not be encouraged. Besides, this paper still lacks comprehensive discussions on GNN quantization.

优点

  1. A framework for GNN post-training quantization (PTQ) without backpropagation is proposed, achieving superior quantization results with reduced calibration time.
  2. The concept of grouping is introduced, utilizing the Wiener index as a replacement for the traditional in-degree metric for effective grouping.
  3. An accelerated algorithm for computing the Wiener index is presented, enhancing parallelism and reducing overhead during the grouping process.
  4. The code is released, and practical deployment is demonstrated on a GPU.

缺点

  1. The literature review is insufficient. Many SOTA works [1-3] are not mentioned and have not been included in the experimental comparisons. In the summary of GNN quantization works, only Degree-Quant, SgQuant, and A2Q (highlighted in red) were compared in the experiments, lacking consideration of newer works.
  2. Section 5.1: The rationale for using the Wiener index as the basis for grouping is not explained. As stated in the paper, there are various metrics to describe the topology of graphs. More theoretical derivation or comparative experiments are needed to demonstrate the advantages of using the Wiener index over other metrics.
  3. Section 5.3: The introduction of scale absorption does not clarify its purpose. According to Equation (16), it merges the scale of X into the adjacency matrix A after quantization. However, the results provided in Section 6.6 show that the distribution before quantization is more uniform and is considered easier for quantization, which appears inconsistent with the method presented in Section 5.3.
  4. Lack of Training and Without-Training Comparative Experiments: Without-training should only be considered a viable option when training performance is average or shows minimal improvement. It is inappropriate to directly present without-training as a contribution, as it is relatively easy to implement.
  5. Section 6.4: The use of Scale Absorption for INT8 generally results in precision loss, which is not explained.
  6. Section 6.5: Although actual inference speedup is provided, INT8 only achieves 1.25 times acceleration compared to FP32, or even lower. However, reference [2] demonstrates 3-4 times or higher inference speedup. Additionally, while INT4 is mentioned, there is no deployment of INT4, requiring a reasonable explanation.
  7. Section 6.5: The presented speed performance of the optimized local Wiener index is compared to very outdated classical methods, which weakens the argument's credibility.

[1] EPQuant: A Graph Neural Network compression approach based on product quantization [NC 2022]

[2] Low-bit Quantization for Deep Graph Neural Networks with Smoothness-aware Message Propagation [CIKM 2023]

[3] Haar wavelet feature compression for quantized graph convolutional networks [TNNLS 2023]

问题

  1. This paper states that TopGQ is the first PTQ framework for GNNs; however, according to the review [1], SGQuant is actually a PTQ work. The paper also claims that SGQuant is a method for Quantization-Aware Training (QAT), which requires verification.
  2. The experimental results of [2] and A2Q [3] exhibit fluctuations (e.g., 81.5±0.7%), while the experimental results presented in this paper are fixed values. Is this a normal phenomenon?

[1] A Survey on Graph Neural Network Acceleration: Algorithms, Systems, and Customized Hardware

[2] Low-bit Quantization for Deep Graph Neural Networks with Smoothness-aware Message Propagation [CIKM 2023]

[3] A2Q: Aggregation-Aware Quantization for Graph Neural Networks [ICLR 2022]

评论

W7. Section 6.5: The presented speed performance of the optimized local Wiener index is compared to very outdated classical methods, which weakens the argument's credibility.

\to We are working on the additional experiment to compare our accelerated Wiener index algorithm against parallel versions of other traditional algorithms. We will promptly notify the reviewer with the updated results as soon as they are available.

Q1. This paper states that TopGQ is the first PTQ framework for GNNs; however, according to the review [1], SGQuant is actually a PTQ work. The paper also claims that SGQuant is a method for Quantization-Aware Training (QAT), which requires verification.

\to SGQuant is a method for QAT, according to its paper. The authors of SGQuant explicitly present “GNN quantization finetuning” as a key contribution, detailed in Section III.B. This section describes settings like SGQuant using the original GNN loss and employing gradient backpropagation via STE. Although the framework of SGQuant is able to separate the training process and its methodology to adjust to the setting of PTQ, this does not mean SGQuant is strictly a PTQ-targeted method, as the original SGQuant clearly includes its training process in its proposal.

In addition, several papers acknowledge SGQuant as a QAT method, which we cite below:

[1] Novkin, Rodion, Florian Klemme, and Hussam Amrouch. "Approximation-and Quantization-Aware Training for Graph Neural Networks." IEEE Transactions on Computers (2023).
[2] Ma, Yuxin, et al. "Eliminating Data Processing Bottlenecks in GNN Training over Large Graphs via Two-level Feature Compression." Proceedings of the VLDB Endowment 17.11 (2024): 2854-2866.
[3] Tao, Zhuofu, et al. "Lw-gcn: A lightweight fpga-based graph convolutional network accelerator." ACM Transactions on Reconfigurable Technology and Systems 16.1 (2022): 1-19.
[4] Wu, Chen, et al. "Skeletongcn: a simple yet effective accelerator for gcn training." 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022.
[5] Saad, Leila Ben, and Baltasar Beferull-Lozano. "Quantization in graph convolutional neural networks." 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 2021.

In conclusion, we would like to highlight that TopGQ is the first work to provide a GNN quantization tailored for post-training application, satisfying the traditional PTQ definition of not updating the weights through backpropagation.

Q2. The experimental results of [2] and A2QA^2Q [3] exhibit fluctuations (e.g., 81.5±0.7%), while the experimental results presented in this paper are fixed values. Is this a normal phenomenon?

\to We observed the same fluctuations when running A2QA^2Q in our experiments, but we omitted the error bar from the main table for better readability. We present the accuracy table with standard deviation values of citation datasets as below. We will add additional results with error bars in the supplementary section F.

MethodBitMethodCoraCiteseerPubmed
ModelGCNGINGSGCNGINGSGCNGINGS
Degree-QuantINT479.02±\pm0.5571.88±\pm5.1073.50±\pm1.2322.34±\pm1.5747.92±\pm7.6617.14±\pm2.9678.62±\pm0.7176.56±\pm10.9078.18±\pm1.81
SGQuantINT479.02±\pm0.8270.21±\pm5.2275.30±\pm3.3168.08±\pm0.9146.70±\pm5.8248.34±\pm5.9376.08±\pm0.9265.28±\pm7.0171.08±\pm2.21
A2QA^2QINT452.68±\pm5.8264.64±\pm4.1474.16±\pm0.6454.00±\pm6.1246.04±\pm7.7566.22±\pm4.2469.72±\pm4.5451.90±\pm7.6673.92±\pm3.84
TopGQINT481.50±\pm0.4478.58±\pm0.4279.64±\pm0.1571.90±\pm0.3770.14±\pm0.3471.76±\pm0.5879.58±\pm0.1277.70±\pm0.1479.00±\pm0.16
Degree-QuantINT881.80±\pm0.7074.64±\pm5.0077.50±\pm1.0969.72±\pm0.6958.34±\pm7.9569.10±\pm4.7379.24±\pm0.7879.70±\pm11.0778.42±\pm1.03
SGQuantINT880.51±\pm0.5973.32±\pm4.2375.32±\pm3.8668.34±\pm0.4851.30±\pm5.0154.12±\pm5.1578.06±\pm0.5475.22±\pm2.4473.44±\pm0.62
A2QA^2QINT879.96±\pm2.2878.74±\pm2.6876.12±\pm3.0970.48±\pm1.2967.26±\pm5.1366.04±\pm3.0476.44±\pm1.2976.40±\pm0.9875.36±\pm0.60
TopGQINT882.08±\pm0.3978.42±\pm0.5380.30±\pm0.6172.28±\pm0.5370.26±\pm0.6071.96±\pm0.7580.30±\pm0.1978.62±\pm0.7478.94±\pm0.47
评论

W4. Lack of Training and Without-Training Comparative Experiments: Without-training should only be considered a viable option when training performance is average or shows minimal improvement. It is inappropriate to directly present without training as a contribution, as it is relatively easy to implement.

\to In neural network quantization, enabling PTQ (non-training quantization) is recognized as a contribution for two reasons: 1) PTQ is considered more efficient than QAT for practical deployment. 2) Enabling PTQ is usually difficult due to a severe accuracy drop compared to QAT. This is because PTQ has limited capacity than QAT which can freely update weights. Thus, simply building a stable PTQ method that can minimize such accuracy loss is difficult and is considered a meaningful contribution.

Nevertheless, we agree with the reviewer that comparing with and without-training can further enhance the soundness of our paper. Thus we provide TopGQ with QAT settings and compare the results with the original TopGQ. We can observe that TopGQ with QAT can perform to a significant level, with several settings close to the original TopGQ. We thank the reviewer for the feedback on the experiment suggestion, and for the discovery that the proposed quantization techniques of TopGQ leveraging topology can also be effective in a QAT setting.

MethodBitCoraGCNGINGSCiteseerGCNGINGSPubmedGCNGINGS
TopGQ + QATINT881.1278.3076.0070.2469.1469.5079.4078.8678.10
Original TopGQINT882.0878.4280.3072.2870.2671.9680.3078.6278.94
TopGQ + QATINT480.0876.3076.6470.5869.1069.5678.5077.0076.72
Original TopGQINT481.5078.5879.6471.9070.1471.7679.5877.7079.00
ModelBitPROTEINSGCNGINGSNCI1GCNGINGS
TopGQ + QATINT873.1965.6873.7377.8679.8677.92
Original TopGQINT875.6574.3472.2079.7281.3678.43
TopGQ + QATINT467.3666.3166.6169.7466.6673.12
Original TopGQINT469.9470.9268.9365.8875.3775.98

W5. Section 6.4: The use of Scale Absorption for INT8 generally results in precision loss, which is not explained.

\to The reason for the precision loss is because Scale Absorption quantizes FP32 quantization scale of XX (sXs_X) for enhancing the quality of quantized XX. In quantizing AXA \cdot X operation in the GNN layer, using Scale Absorption improves the precision of XX by absorbing its FP32 quantization scales (sXs_X) into AA, and then quantize AA. This indicates that enhancement of precision in activation XX comes at the cost of quantizing sXs_X with AA, potentially resulting in inaccurate scale parameters.

In INT4 settings, Scale Absorption improves quantized networks as the limited range of 4bit quantization makes feature-wise quantization highly vulnerable to outliers. In INT8 settings, however, the broader integer range may allow preserving information even with outlier nodes, thus sometimes making Scale Absorption less advantageous due to scale compression effects.

评论

W1. The literature review is insufficient. Many SOTA works are not mentioned and have not been included in the experimental comparisons. In the summary of GNN quantization works, only Degree-Quant, SGQuant, and A2QA^2Q (highlighted in red) were compared in the experiments, lacking consideration of newer works.

\to We acknowledged the weakness in the literature review of GNN quantization, and updated the recommended papers in the revised version of TopGQ. If the reviewer has additional feedbacks regarding the updated version, please let us know.

We are currently working on implementing the recommended SOTA works with our evaluation architecture to enhance our paper. We will promptly let the reviewer know as soon as the tables are ready to report.

W2. Section 5.1: The rationale for using the Wiener index as the basis for grouping is not explained. As stated in the paper, there are various metrics to describe the topology of graphs. More theoretical derivation or comparative experiments are needed to demonstrate the advantages of using the Wiener index over other metrics.

\to To further identify the advantages that localized Wiener Index has for quantization, we compared the quantization results with Proteins and NCI1 datasets over other graph properties such as betweenness centrality, closeness centrality, and Katz centrality. The results are shown below.

The other node centrality measures depict suboptimal performance compared to using localized Wiener Index, in both INT4 and INT8 settings. We believe that the result stems from the unique expressiveness of the localized Wiener Index in capturing local compactness of a node within k-hop neighbors: A small value of a node indicates a dense connectivity within its neighbors, and relatively rapid propagation of features via message passing. Therefore, TopGQ can effectively group node features with distinctive ranges, as shown in Figure 2 in the paper, leading to enhanced quantization quality.

MethodBitProteins GCNProteins GINProteins GSNCI1 GCNNCI1 GINNCI1 GS
-FP3276.1974.7972.8780.4181.4678.46
Degree Centrality onlyINT872.5771.8670.4878.9181.2878.32
+ Betweenness CentralityINT862.1061.5555.0876.8975.1875.13
+ Closeness CentralityINT862.4864.9657.3376.4976.6875.85
+ Katz CentralityINT856.8257.9748.5664.2062.1964.27
+ OursINT875.9474.8674.0080.9181.8879.16
Degree Centrality onlyINT456.1545.0450.6560.5469.7175.46
+ Betweenness CentralityINT459.0354.2550.5863.8167.5570.61
+ Closeness CentralityINT458.5261.7350.4863.1469.5471.97
+ Katz CentralityINT453.6855.2444.0857.1957.3657.77
+ OursINT470.1570.6169.6767.5378.4976.43

W3. Section 5.3: The introduction of scale absorption does not clarify its purpose. According to Equation (16), it merges the scale of X into the adjacency matrix A after quantization. However, the results provided in Section 6.6 show that the distribution before quantization is more uniform and is considered easier for quantization, which appears inconsistent with the method presented in Section 5.3.

\to The purpose of Scale Absorption is to preserve integer-format aggregation speedups while mitigating quantization challenges in GNN activations. GNNs have quantization difficulties induced by the nature of activations in GNN layers. Activation outliers occur node-wise due to the message-passing mechanism in GNN layers, where repeated aggregation can amplify values, leading to significant outliers. This observation is illustrated in Figure 5. Its application prevents activations from quantizing activations in a poor feature-wise(column-wise) manner, and ensures integer operations in aggregation.

We have acknowledged the clarity issue for the readers, and revised the section of Scale Absorption and its analysis in Section 6.7, to further clarify its design objective. If it needs more improvements, we will be happy to hear in the discussion, please let us know.

As to any confusion by our description, we also provide an additional explanation of Figure 5 mentioned in Section 6.7 in the comments. The left figure shows the FP32 format of XcombX_{comb}, while the right figure depicts its quantized INT8 version after applying scale absorption. In the FP32 representation, significant node-wise outliers are visible. When it is quantized in a feature-wise manner (i.e., activations with the same feature index are quantized together), it results in most values being mapped to a few integers by outliers, causing a skewed distribution and inefficient use of integer range (bits).

Scale absorption addresses this issue by enabling node-wise quantization, which isolates node-wise outliers for separate quantization. It ensures more evenly distributed values across the integer range, as seen in the right figure, where the distribution of quantized values appears significantly uniform.

评论

We apologize for the inconvenience of our response to W6 unintentionally being omitted from the order of the previous comments. We provide it below.

W6. Section 6.5: Although actual inference speedup is provided, INT8 only achieves 1.25 times acceleration compared to FP32, or even lower. However, reference [2] demonstrates 3-4 times or higher inference speedup. Additionally, while INT4 is mentioned, there is no deployment of INT4, requiring a reasonable explanation.

\to As the computational cost of our method and the referenced paper [1] is similar, we believe that the speedup of [1] comes from a highly optimized CUDA kernel from [2]. One of our baselines, Degree-Quant [3], uses the same per-tensor quantization as [1] but reports a 1.1× speedup compared to FP32, which is comparably lower than reported in [1]. This varying speedup despite similar quantization settings is due to the use of different CUDA kernels, not from the efficiency of the algorithms. Building a highly optimized kernel for GNN inference is another line of work that is orthogonal to ours. For now, we focus on reducing accuracy drop of GNN PTQ.

Accelerated INT4 deployment on GNNs is currently very limited due to the absence of publicly available INT4 sparse matrix multiplication (SPMM) kernels that accommodate widely-used quantization settings with full support of TensorCore. Application of [2] is restricted to naive per-tensor symmetric quantization and lacks compatibility with other quantization methods. We anticipate that future kernels leveraging INT4 operations will better support efficient GNN inference.

Nevertheless, Table 5 of TopGQ shows that TopGQ can accelerate inference in integer formats when paired with appropriate kernels supporting integer operations. Additionally, we improved our inference kernel to resolve speed concerns, and therefore provide a new table that presents enhanced full-batch inference time. This updated table is now included as Table 5 in the revised version of TopGQ.

MethodTypeBitReddit (s)SpeedupOGBN-Products (s)Speedup
--FP321.41-1.45-
Degree-QuantQATINT81.221.15×\times1.301.12×\times
A2QQATINT81.301.08×\times1.780.82×\times
SGQuantQATINT81.251.13×\times1.311.11×\times
TopGQPTQINT81.241.13×\times1.301.11×\times

[1] Wang, Shuang, et al. "Low-bit quantization for deep graph neural networks with smoothness-aware message propagation." Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2023.

[2] Wang, Yuke, Boyuan Feng, and Yufei Ding. "QGTC: accelerating quantized graph neural networks via GPU tensor core." Proceedings of the 27th ACM SIGPLAN symposium on principles and practice of parallel programming. 2022.

[3] Tailor, Shyam Anil, Javier Fernandez-Marques, and Nicholas Donald Lane. "Degree-Quant: Quantization-Aware Training for Graph Neural Networks." International Conference on Learning Representations. 2020.

评论

As our experiment results are ready, we provide our responses to W7 as below.

W7. Section 6.5: The presented speed performance of the optimized local Wiener index is compared to very outdated classical methods, which weakens the argument's credibility.

\to We update the baselines with parallel algorithms on GPU for all-pair shortest paths and present its speed performance with a new table.

DatasetsRedditogbn-proteinsogbn-products
MethodProcess Time (h)
Dijkstra0.160.118.52
Parallel Dijkstra0.180.136.99
Floyd-Warshall0.570.4135.44
Parallel Floyd-Warshall0.260.121.98
Ours0.00040.00020.2855
Speedup409.67602.306.93

In ogbn-products, both the parallel Dijkstra and parallel Floyd-Warshall methods improve the speed of the traditional approaches but remain slower than our Accelerated Localized Wiener Index computation. This is due to the proposed algorithm, where the distance between any arbitrary node pair is limited by a hop-count of k when calculating localized Wiener indices. This approach is not utilized by the baseline methods (Dijkstra, Bellman-Ford, Floyd-Warshall), which leads to their suboptimal performance in terms of speed.

In the case of Reddit and ogbn-proteins with parallel Dijkstra, which are smaller datasets with relatively low workloads compared to ogbn-products, the overhead of parallelization is likely to outweigh the benefits, causing parallel Dijkstra to perform slower than the traditional method. However, the advantages of parallelism become more evident in ogbn-products.

The source of each parallel methods are cited as below.

[1] Lund, Ben, and Justin W. Smith. "A Multi-Stage CUDA Kernel for Floyd-Warshall." arXiv preprint arXiv:1001.4108, 2010.

[2] Harish, Pawan, and Petter J. Narayanan. "Accelerating large graph algorithms on the GPU using CUDA." International conference on high-performance computing. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007.

评论

As our experiment results are ready, we provide our responses to W1 as below.

W1. The literature review is insufficient. Many SOTA works are not mentioned and have not been included in the experimental comparisons. In the summary of GNN quantization works, only Degree-Quant, SgQuant, and A2Q (highlighted in red) were compared in the experiments, lacking consideration of newer works.

\to We compare EPQuant[1] and SMP[2] with citation datasets, GCN architecture.

CoraCiteseerPubmed
MethodBitAcc.Q. Time (s)Acc.Q. Time (s)Acc.Q. Time (s)
FP3282.08%-72.34%-80.32%-
[1] EPQuantINT879.87%24.0769.39%44.0676.46%97.09
[2] SMPINT881.93%28.0869.11%32.9180.73%49.73
TopGQINT882.08%1.1272.28%1.1180.30%1.08
[1] EPQuantINT475.62%24.0766.41%44.1354.99%97.25
[2] SMPINT479.33%28.2668.00%34.7578.67%49.76
TopGQINT481.50%1.4071.90%1.1779.58%1.21

TopGQ outperforms in accuracy with the quickest quantization time when compared to [1] and [2]. In [1], product quantization is used to compress datasets for reduced memory usage, which accounts for most of the initial quantization time. [2] introduces skewness-aware bitwise truncation and learnable ranges, which require additional computations during feature propagation in model training, resulting in longer training times.

As for [3], [3] leverages compressed graph wavelet transform convolution combined with convolution layers for quantization. Due to the complexity and theoretical nature of the method described in the paper, as well as the unavailability of its implementation code, a direct and fair comparison at this stage remains challenging. Thus we aim to compare with [3] in future work.

[1] Huang, Linyong, et al. "EPQuant: A Graph Neural Network compression approach based on product quantization." Neurocomputing 503, 2022.

[2] Wang, Shuang, et al. "Low-bit quantization for deep graph neural networks with smoothness-aware message propagation." Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2023.

[3] Eliasof, Moshe, Benjamin J. Bodner, and Eran Treister. "Haar wavelet feature compression for quantized graph convolutional networks." IEEE Transactions on Neural Networks and Learning Systems, 2023.

评论

Dear Reviewer fkMV,

We sincerely thank the reviewer for the thoughtful suggestions and concerns regarding TopGQ. The comments significantly helped enrich our paper, which led to the following experiments of TopGQ as below.

  • We enhanced the literature review regarding state-of-the-art GNN quantization works.
  • We added the comparison results of the localized Wiener Index and other centralities in Section 6.6, as experimental support of our node grouping method.
  • We added further clarification of Scale Absorption in Section 5.3.
  • We added comparison results of TopGQ with its QAT settings in Appendix G as Table 14 and 15.
  • We updated the experimental results of baselines in citation datasets regarding fluctuations in Appendix F, as in Table 13.
  • We revised the quantized inference time measurement in Table 5 in the main paper.
  • We compared the state-of-the-art quantization results with TopGQ with citation datasets.
  • We further compare the parallel all-pair shortest path methods with our accelerated localized Wiener Index calculation algorithm.

The experiments above helped us explore an in-depth analysis of TopGQ, and we deeply thank the reviewer for this enhancement.

As the discussion phase will end soon, we wanted to kindly follow up to confirm whether our response has addressed your concerns. Please feel free to let us know if there are any remaining questions or issues, and we would be happy to provide further clarification or discussion.

We thank the reviewer again for dedicating the time and effort to reviewing our work.

Sincerely, the authors of TopGQ.

审稿意见
5

This paper proposed a post-training quantization method for graph neural network. It uses both the degree information and the topological information (localized Wiener index) to efficiently group the node with similar embedding magnitude together. The paper also employed a fast algorithm for calculating the local Wiener index to reduce the quantization and inference overhead. Experiments show that the method well maintain the accuracy of the models.

优点

  1. This paper uses post-training quantization without backprop, while the related works are mostly quantization-aware training, or requires backpropagating gradients in the quantization procedure.
  2. It is novel to consider topology information to group the nodes and performs well.
  3. The method has little harm on the accuracy performance.

缺点

  1. The overhead of calculating the topological information is faster compared to other baselines, but it seems still a burden compared with the inference time, especially considering most inference is done in batch.
  2. It looks like that the inference is not done in a batched manner (correct me if I'm wrong), making the inference time for fp32 baseline extremely long. It would be better if batched inference results can be shown and compared to have a full understanding of the capabilities of this method.
  3. There is no discussion of combining this method with sampling methods, like neighbor sampling or subgraph sampling. Real world graph training for node tasks mostly requires sampling. And the inference for neighbor sampling is much faster. I would like to see how the method performs in these cases and how large the overheads are.

问题

  1. I wonder how the inference is done, because the inference time on ogbn-products is surprisingly long. Is it done in a batched manner or each single node goes through the network individually?
  2. The accuracy of SAGE on ogbn-products Table 1 is abnormally low. On the ogbn-leaderboard, SAGE on ogbn-products is over 78%. Why is this happening?
  3. Are there any other indexes that may be better than Wiener index, like the spectral information?
  4. Can the authors give more information about why quantization for GNN models is important, or in what scenario is it important? The largest dataset used here (ogbn-products) can be trained and inferred in the full-graph manner in one single card (A6000), which is much faster than the inference time number shown here.

I would like to raise my score if my concerns are properly addressed.

评论

W1. The overhead of calculating the topological information is faster compared to other baselines, but it seems still a burden compared with the inference time, especially considering most inference is done in batch.

\to We want to emphasize that computing localized Wiener indices is done only once for each graph during the quantization time. Then, during inference, we only need to calculate the Wiener indices on the unseen nodes. To compare this overhead with the inference time, we provide comparison results in GCN INT8 settings with the graph datasets PROTEINS and NCI1. The nodes from the test set graphs will be the unseen nodes, and their information will have to be calculated during inference time.

kDatasetGCN Test Inference (s)Overhead (s)Percentage
2Proteins0.04380.00010.23%
NCI10.133210.00030.22%
3Proteins0.04380.00184.07%
NCI10.133210.00483.63%

As we can see in the table, the overhead for calculating Wiener index of unseen test nodes accounts for only a very small portion(smaller than 1%) of the total inference time. Note that this is made possible from a specialized algorithm to accelerate the computation of the localized Wiener Index, which is another contribution of TopGQ. We added this analysis in Section G.

W2, W3, Q1: How is inference performed in your method, particularly for the FP32 baseline, where the inference time seems extremely long? Is it conducted in a batched manner or node-by-node? Additionally, have you considered evaluating the method with sampling techniques (e.g., neighbor or subgraph sampling) to improve inference efficiency

\to Our unoptimized kernel has caused the long inference time measure. We improved our inference kernel to resolve speed concerns and, therefore, provide a new table that presents enhanced full-batch inference time with practical durations. This updated table is now included as Table 5 in the revised version of TopGQ.

MethodTypeBitReddit (s)SpeedupOGBN-Products (s)Speedup
--FP321.41-1.45-
Degree-QuantQATINT81.221.15×\times1.301.12×\times
A2QQATINT81.301.08×\times1.780.82×\times
SGQuantQATINT81.251.13×\times1.311.11×\times
TopGQPTQINT81.241.13×\times1.301.11×\times

We can see that along the baselines, TopGQ can provide speedups from integer operations. We can also use sampling methods with TopGQ batched inference. We measured the batched inference with neighbor sampling with a size factor of [25, 10], and mini-batch size 4096. We compared its inference time with the full-batch inference, and the baselines likewise.

DatasetMethodInference TimeSlowdown compared to full-batch
RedditDegree-Quant QAT1.47s0.83×\times
A2Q QAT2.23s0.59×\times
SGQuant QAT1.52s0.82×\times
TopGQ PTQ1.49s0.83×\times
ogbn-productsDegree-Quant QAT35.36s0.037×\times
A2Q QAT54.46s0.033×\times
SGQuant QAT36.30s0.036×\times
TopGQ PTQ35.77s0.036×\times

We can observe that the batched inference time of TopGQ with sampling methods has a consistent amount of slowdown, compared to other methods except A2QA^2Q, whose inference speed differs as it processes quantization parameter search in run-time. This guarantees that the overhead of applying sampling methods to TopGQ inference remains consistent with other GNN quantization methods.

Q2. The accuracy of SAGE on ogbn-products Table 1 is abnormally low. On the ogbn-leaderboard, SAGE on ogbn-products is over 78%. Why is this happening?

\to We believe the accuracy difference comes from the choice of aggregator functions. In the leaderboard, the selected aggregator function was “mean”, while our setting selected “max” as the aggregator function in the experiments.

We would like to present the quantization results of TopGQ for graphSAGE architecture with “mean” aggregators, with FP32 accuracies comparable to those of ogbn-leaderboard scores. As presented in the table, TopGQ can preserve performance regardless of aggregator functions. We added these results in Section E of the revision.

MethodBitAcc. (%)Q.time (s)
-FP3279.00-
Degree-QuantINT878.73482702.99
A2QINT877.4356232.71
SGQuantINT841.22114816.30
TopGQINT877.17436.23
Degree-QuantINT475.65493003.42
A2QINT445.8858034.59
SGQuantINT426.95118242.50
TopGQINT470.23435.12
评论

Q3. Are there any other indexes that may be better than Wiener index, like the spectral information?

\to To further identify the advantages that localized Wiener Index has for quantization, we compared the quantization results with proteins and nci1 datasets over other graph properties such as betweenness centrality, closeness centrality, and Katz centrality. The results are as below.

The other node centrality measures depict suboptimal performance compared to using localized Wiener Index, in both INT4 and INT8 settings. We believe that the result stems from the unique expressiveness of the localized Wiener Index in capturing local compactness of a node within k-hop neighbors: A small value of a node indicates a dense connectivity within its neighbors, and relatively rapid propagation of features via message passing. Therefore, TopGQ can effectively group node features with distinctive ranges, as shown in Figure 2 in the paper, leading to enhanced quantization quality.

MethodBitProteins GCNProteins GINProteins GSNCI1 GCNNCI1 GINNCI1 GS
-FP3276.1974.7972.8780.4181.4678.46
Degree Centrality onlyINT872.5771.8670.4878.9181.2878.32
+ Betweenness CentralityINT862.1061.5555.0876.8975.1875.13
+ Closeness CentralityINT862.4864.9657.3376.4976.6875.85
+ Katz CentralityINT856.8257.9748.5664.2062.1964.27
+ OursINT875.9474.8674.0080.9181.8879.16
Degree Centrality onlyINT456.1545.0450.6560.5469.7175.46
+ Betweenness CentralityINT459.0354.2550.5863.8167.5570.61
+ Closeness CentralityINT458.5261.7350.4863.1469.5471.97
+ Katz CentralityINT453.6855.2444.0857.1957.3657.77
+ OursINT470.1570.6169.6767.5378.4976.43

We experimented with other node centralities to substitute the localized Wiener Index and provide quantization results based on new grouping methods with PROTEINS and NCI1 datasets. Results show that the localized Wiener Index performs significantly better than other information centralities, especially in low-bit settings.

Q4. Can the authors give more information about why quantization for GNN models is important, or in what scenario is it important? The largest dataset used here (ogbn-products) can be trained and inferred in the full-graph manner in one single card (A6000), which is much faster than the inference time number shown here.

\to GNN quantization addresses the high memory and computational demands of large-scale graph processing, a unique challenge in GNNs compared to other neural networks, as hardware requirements for GNN model inferences are significantly more sensitive to data size. It enables broader deployment of GNNs, especially in domains like traffic forecasting [1], IoT [2], bioinformatics [3], and knowledge graphs [4], where scalability to their real-world large graphs is critical. By compressing GNN models with minimal performance loss, quantization broadens the usage and deployment of GNNs in resource-constrained systems.

We want to add that exploration of improving model performance with large-scale graphs is in its infancy in the GNN quantization field. we reported the performance on the ogbn-product dataset (a graph with over 2,400,000 nodes), which was a scale not experimented on baseline works of GNN quantization. We believe expanding quantization to GNNs trained with larger-scale graphs than ogbn-products is a future work to continue, and will contribute to extensive usage of GNN models in real-world.

[1] Jiang, Weiwei, and Jiayun Luo. "Graph neural network for traffic forecasting: A survey." Expert systems with applications 207 (2022): 117921.

[2] Dong, Guimin, et al. "Graph neural networks in IoT: A survey." ACM Transactions on Sensor Networks 19.2 (2023): 1-50.

[3] Li, Yu, et al. "Deep learning in bioinformatics: Introduction, application, and perspective in the big data era." Methods 166 (2019): 4-21.

[4] Chen, Huiyuan, et al. "Tinykg: Memory-efficient training framework for knowledge graph neural recommender systems." Proceedings of the 16th ACM Conference on Recommender Systems. 2022.

[5] Rahmani, Saeed, et al. "Graph neural networks for intelligent transportation systems: A survey." IEEE Transactions on Intelligent Transportation Systems 24.8 (2023): 8846-8885.

评论

Dear Reviewer 7KRu,

We sincerely appreciate the reviewer’s thoughtful concerns and suggestions regarding TopGQ, which we believe have provided valuable guidance for further enhancing our paper. The comments helped enrich our revised version of TopGQ, leading to the following modifications in our paper.

  • We added experimental results of localized Wiener Index calculation cost of unseen nodes in Appendix H as Table 16.
  • We revised the quantized inference time measurement in Table 5 in the main paper.
  • We provide quantization results of TopGQ on GraphSAGE with “mean” aggregators in Appendix E.
  • We added the comparison results of the localized Wiener Index and other centralities in Section 6.6, as experimental support of our node grouping method.

The above experiments have greatly contributed to enhancing the quality of our work, and we deeply thank the reviewer for this improvement.

As the discussion phase is ending soon, we would appreciate it if the reviewer could let us know whether our response has addressed the concerns raised regarding the paper. We would gladly answer any additional questions if the response is insufficient, and the reviewer has unresolved concerns.

We thank the reviewer again for dedicating the time and effort to reviewing our work.

Sincerely, the authors of TopGQ.

审稿意见
5

This paper proposes TopGQ, a post-training quantization (PTQ) framework for GNNs. Unlike existing quantization methods that rely on quantization-aware training (QAT), which involves retraining the model with gradient updates, TopGQ achieves efficient quantization by leveraging the topological information of the graph without requiring any retraining. Specifically, TopGQ groups vertices with similar topology information, including inward degree and localized Wiener index, to share quantization parameters within the group, which can quantize GNNs without backpropagation and accelerate the quantization. To further optimize inference efficiency, TopGQ absorbs group-wise scale factors into the adjacency matrix during aggregation steps, which allows for efficient integer matrix multiplication. Experiments show that TopGQ outperforms SOTA GNN quantization methods in performance with a faster quantization speed.

优点

  • The proposed method is innovative. Both indegree information and localized Wiener index are used for node grouping to effectively address the high feature magnitude variance issue in GNN quantization.
  • Accelerating the quantization process of GNNs is necessary.

缺点

  • This work may not achieve a wall-clock speedup on graph-level tasks because it requires grouping the nodes of unseen graphs and computing quantization parameters.
  • The application of this method is limited. Because of the use of scale absorption, it seems hard to apply this method to GAT models.
  • The acceleration of the Accelerated Wiener Index Computation Algorithm is mainly because of parallel computing rather than the proposed algorithm.
  • The third method, scale absorption, is commonly used in network quantization. It should not be a key contribution to this paper.

问题

  • In Table 6, the baseline methods use the SciPy implementation. It is better to compare the Accelerated Wiener Index Computation Algorithm with a parallel Dijkstra method.
  • Can this method be applied to GAT models? If so, it is better to add more experiments about GAT quantization to show the generalization of TopGQ.
  • In the experiments, the baseline A2QA^2Q method also uses a uniform quantization, but it is a mixed-precision method. So it would be better to compare a mixed-precision version of A2QA^2Q under the same compression or computation constraint.
评论

Q3. In the experiments, the baseline A2QA^2Q method also uses a uniform quantization, but it is a mixed-precision method. So it would be better to compare a mixed-precision version of A2QA^2Q under the same compression or computation constraint.

\to As the reviewer mentioned, we fix the bitwidth of A2QA^2Q to make a comparison under the same compression or computation constraint. While we focus on fast and efficient fixed-precision quantization, we believe comparison on mixed-precision setting is out of our scope.

We would like to clarify that the reason we set the baseline A2QA^2Q method to use fixed-precision quantization, was not to win A2QA^2Q in a more advantageous setting, but to illustrate best that TopGQ targets and can address quantization issues that current GNN quantization work faces challenges: A fast and effective fixed-precision quantization method for GNNs, and therefore does not intend nor introduce unfair comparison. We selected A2QA^2Q for a baseline method as it represents one of the latest successful works of GNN Quantization in the separate domain of mixed-precision quantization. We want to additionally note that quantization methods rarely perform across both fields as each aims for distinct application scenarios, including differences in bit-width constraints and deployable hardware.

评论

W1. This work may not achieve a wall-clock speedup on graph-level tasks because it requires grouping the nodes of unseen graphs and computing quantization parameters.

kDatasetGCN Test Inference (s)Overhead (s)Proportion
2Proteins0.04380.00010.23%
NCI10.133210.00030.22%
3Proteins0.04380.00184.07%
NCI10.133210.00483.63%

\to We provide the inference time for the setting concerned by the reviewer with graph datasets, Proteins and NCI1, for the inductive setting. As we can see in the table, the overhead required to compute the topological information of unseen test nodes accounts for only a very small portion(smaller than 1%) of the total inference time. This is enabled by localized Wiener Index acceleration by TopGQ.

W2. The application of this method is limited. Because of the use of scale absorption, it seems hard to apply this method to GAT models. Q2. Can this method be applied to GAT models? If so, it is better to add more experiments about GAT quantization to show the generalization of TopGQ.

\to The reason is that the GAT’s characteristic attention-based edge weights require dynamic quantization, which cannot be precomputed. Please note that this restriction applies not only to ours but also to all GNN quantization methods. However, we modify our proposed scale absorption method for GAT, and provide experimental results below. The experimental results show that TopGQ also performs well in GAT architecture.

Citation graphs

MethodTypeBitCora Acc.(%)Q.time(s)Citeseer Acc.(%)Q.time(s)PubMed Acc.(%)Q.time(s)
--FP3282.10-74.10-79.42-
Degree-QuantQATINT881.7018.30s69.8041.31s79.2061.02s
A2QQATINT877.502.44s69.502.55s72.803.11s
SGQuantQATINT879.905.71s68.408.72s76.009.74s
TopGQPTQINT882.020.86s73.701.11s79.321.26s
Degree-QuantQATINT480.7018.78s23.1040.91s74.5060.89s
A2QQATINT476.802.44s61.802.47s70.503.16s
SGQuantQATINT474.705.55s66.208.67s72.409.77s
TopGQPTQINT480.340.92s66.921.20s78.061.25s

Graph classification tasks

MethodTypeBitProteins Acc.(%)Q.time(s)NCI1 Acc.(%)Q.time(s)
--FP3275.56-79.73-
Degree-QuantQATINT872.413580.78s74.507988.47s
A2QQATINT872.42385.62s72.28997.62s
SGQuantQATINT868.82267.65s74.42753.09s
TopGQPTQINT875.744.87s79.489.49s
Degree-QuantQATINT471.963626.77s74.018078.41s
A2QQATINT470.36396.62s66.161002.95s
SGQuantQATINT459.56267.66s58.49754.85s
TopGQPTQINT469.094.71s69.709.90s

We also specify how Scale Absorption is applied in GAT models. GAT requires quantization of edge weights (AA) during inference due to the need for FP32 operations to obtain edge weights. In GAT, Scale Absorption is implemented right before the run-time quantization process by an FP32 element-wise multiplication between A and the precalculated scales.

W3, Q1: The acceleration of the Accelerated Wiener Index Computation Algorithm is mainly because of parallel computing rather than the proposed algorithm. In Table 6, the baseline methods use the SciPy implementation. It is better to compare the Accelerated Wiener Index Computation Algorithm with a parallel Dijkstra method.

\to The acceleration of our localized Wiener Index calculation comes from the proposed algorithm and does not mainly come from parallel computing. We modified the algorithm from the idea that the distance of an arbitrary node pair is always bounded by hop-count k, which is a perception that baseline methods (Dijkstra, Bellman-ford, Floyd-Warshall) do not utilize. This theoretical bound reduces excessive searches and computations of feasible paths, enabling great speedup.

We are working on the additional experiment to compare our accelerated Wiener index algorithm against parallel versions of other traditional algorithms. We will promptly notify the reviewer with the updated results as soon as they are available.

评论

We apologize for the inconvenience of our response to W4 unintentionally being omitted from the order of the previous comments. We provide it below.

W4. The third method, scale absorption, is commonly used in network quantization. It should not be a key contribution to this paper.

\to Scale Absorption is a unique method as its purpose is to preserve integer-format aggregation speedups while mitigating quantization challenges in GNN activations. GNNs have quantization difficulties induced by the nature of activations in GNN layers. Activation outliers occur node-wise due to the message-passing mechanism in GNN layers, where repeated aggregation can amplify values, leading to significant outliers. This observation is illustrated in Figure 5. Its application prevents activations from quantizing activations in a poor feature-wise(column-wise) manner, and ensures integer operations in aggregation.

We are deeply interested in understanding the similarities the reviewer perceives between other quantization methods and Scale Absorption. We look forward to discussing further with the reviewer about the topic.

评论

As our experiment results are ready, we provide our responses to W3 and Q1 as below.

W3, Q1. The acceleration of the Accelerated Wiener Index Computation Algorithm is mainly because of parallel computing rather than the proposed algorithm. / Q1. In Table 6, the baseline methods use the SciPy implementation. It is better to compare the Accelerated Wiener Index Computation Algorithm with a parallel Dijkstra method.

\to We update the baselines with parallel algorithms on GPU for all-pair shortest paths and present its speed performance with a new table.

DatasetsRedditogbn-proteinsogbn-products
MethodProcess Time (h)
Dijkstra0.160.118.52
Parallel Dijkstra0.180.136.99
Floyd-Warshall0.570.4135.44
Parallel Floyd-Warshall0.260.121.98
Ours0.00040.00020.2855
Speedup409.67602.306.93

By the table, it is clear that the acceleration of the localized Wiener Index computation is from its algorithm, rather than parallel computing. In ogbn-products, both the parallel Dijkstra and parallel Floyd-Warshall methods improve the speed of the traditional approaches but remain slower than our Accelerated Localized Wiener Index computation. This is due to the proposed algorithm with its theoretical bounds, where the distance between any arbitrary node pair is limited by a hop-count of k when calculating localized Wiener indices. This approach is not utilized by the baseline methods (Dijkstra, Bellman-Ford, Floyd-Warshall), which leads to their suboptimal performance in terms of speed.

In the case of Reddit and ogbn-proteins with parallel Dijkstra, which are smaller datasets with relatively low workloads compared to ogbn-products, the overhead of parallelization is likely to outweigh the benefits, causing parallel Dijkstra to perform slower than the traditional method. However, the advantages of parallelism become more evident in ogbn-products.

The source of each parallel methods are cited as below.

[1] Lund, Ben, and Justin W. Smith. "A Multi-Stage CUDA Kernel for Floyd-Warshall." arXiv preprint arXiv:1001.4108, 2010.

[2] Harish, Pawan, and Petter J. Narayanan. "Accelerating large graph algorithms on the GPU using CUDA." International conference on high-performance computing. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007.

评论

Dear Reviewer APES,

We deeply thank the reviewer for the insightful suggestions and concerns for TopGQ. The valuable comments have been greatly helpful in improving our paper. The feedback has significantly contributed to enhancing our revision of TopGQ, leading to conducting new experiments with TopGQ as below.

  • We added experimental results of localized Wiener Index calculation cost of unseen nodes in Appendix H as Table 16.
  • We provide quantization results of TopGQ on GAT models in Appendix D.
  • We added further clarification of Scale Absorption in Section 5.3.
  • We further compare the parallel all-pair shortest path methods with our accelerated localized Wiener Index calculation algorithm.

The experiments mentioned above have been instrumental in improving the quality of our work, and we sincerely thank the reviewer for this enhancement.

As the discussion phase is approaching its end, we hope to know if our response has addressed the concerns and questions raised by the reviewer. If there are any remaining issues or if our response falls short, we would be glad to discuss them further.

We thank the reviewer again for generously dedicating valuable time to reviewing our work, TopGQ.

Sincerely, the authors of TopGQ.

审稿意见
3

The paper presents a post-training quantization (PTQ) framework tailored for graph neural networks (GNNs) that mitigates the quantization error by grouping nodes based on topology (indegree and local Wiener index) and absorbing group-specific scales into the adjacency matrix, TopGQ enables highly efficient integer matrix multiplication. Experiments demonstrate that TopGQ achieves faster quantization speeds compared to quantization-aware training (QAT) methods.

优点

1-TopGQ introduces a PTQ approach, eliminating the need for gradient computation, which reduces quantization time significantly.

2-Using indegree and local Wiener index to group nodes based on topological similarity seems novel

缺点

1- Lack of motivation and not providing proper application for the work

2-Considerable accuracy drop in 4-bit scenarios.

3-Less detailed justification about the obtained results.

问题

1- The motivation of the paper is not clear to me. The paper needs to provide applications where fast quantization is urgently needed. From the plots, the QAT takes about 2 hours for large workloads in most cases. Since it needs to be done one time, quantization time cannot be a good motivation to me especially when the proposed method has an accuracy drop. Please provide more applications and cite a few notable references that show quantization time matters. It would be helpful to suggest specific applications or scenarios where fast quantization could be particularly valuable.

2- The motivation behind quantization can be time and storage. However, as Table 5 shows, TOPGQ is not showing any gains in terms of inference time. No results for storage reduction are provided as well. The authors can provide a more comprehensive analysis of the trade-offs between accuracy, quantization time, inference time, and storage requirements.

3- The author didn’t study GAT. Is there any reason behind it?

4- Several studies show that PTQ starts degrading performance when it goes below 4 bits. I think the proposed PTQ technique will not work well compared to QAT in below 4-bit unless the authors show some results that nullify the hypothesis. Even for 4-bit, I can see quite a notable accuracy drop compared to fp32 (e.g., around 7% Reddit GCN and 21% on ogbnproducts GCN). The authors need to provide an application where such drops are acceptable.

5-The author needs to provide detailed justification where PTQ can perform better than QAT and FP32. For example, why in the case of ogbnproteins, TopGQ is better than FP32 by a noticeable margin?

6-How sensitive is TopGQ to changes in group sizes or to variations in the rank used in low-rank approximations?

7- Outliers might still exist within topologically grouped nodes, especially in large-scale graphs. How does this affect quantization quality?

8- Table 5 needs to provide results for 4-bit as well.

9- An ablation study comparing the use of indegree and Wiener index with other centrality measures (e.g., closeness or betweenness) would provide insights into the robustness of TopGQ’s topology-based grouping.

10- An analysis of how changing grouping parameters (e.g., group size, hop count for Wiener index) affects quantization error would clarify the stability and adaptability of TopGQ.

11-The provided code is just .py file without any instructions on how to run and get results. That limits the reproducibility. The could should provide a README file with setup instructions, example commands, and a requirements.txt file for dependencies.

12-Minor Typos:

A– Consider changing "huge" to " significant, substantial, etc"

B- Change "On the quantization time" to "In terms of quantization time."

C- "a fair comparison, we use fixed-precision quantization" – Add "a" before "fixed-precision quantization"

D. with a significant drop in accuracy of 9.0%p!

评论

W1. Lack of motivation and not providing proper application for the work.

We would like to address the issue of weak motivation and application studies of fast GNN quantization by answering related questions (Q1, Q2) as below.

Q1. The paper needs to provide applications where fast quantization is urgently needed. Please provide more applications and cite a few notable references that show quantization time matters.

\to We provide some typical applications that need fast quantization:

  • GNNs processing temporal graphs with rapid modification over time [1,2]
  • GNNs in edge-device-enabled transportation systems [3,4], and recommendation systems [5,6]
  • GNNs in anomaly-sensitive program designs such as fraud detection in financial transactions [7,8,9]

Despite the state-of-the-art performance of GNNs, the field still has limited usage in practical applications due to the growing size of real-world graphs. Especially with user-end devices with limited computational power and memory, GNNs have to be tailored or compressed to accommodate the hardware constraints. However, real-world graphs such as traffic network graphs or social media graphs are continuously updated, which requires multiple compressing of the GNNs to keep the application up-to-date. In such cases, fast quantization of GNN can be the only viable solution. The issue can be more severe in safety-critical applications such as edge fraud detection in financial transactions, where among the discovery of a new security flaw, a rapid update for safety patch is required. We believe GNNs in these domains are especially sensitive to changes in both the graph and the model and thus require fast adaptations, extremely to a real-time level.

We added this in the revised version of our paper, in the introduction section.

[1] Gao, Shihong, et al. "ETC: Efficient Training of Temporal Graph Neural Networks over Large-scale Dynamic Graphs." Proceedings of the VLDB Endowment 17.5 (2024): 1060-1072.

[2] Longa, A., et al. "Graph Neural Networks for temporal graphs: State of the art, open challenges, and opportunities." TRANSACTIONS ON MACHINE LEARNING RESEARCH (2023).

[3] Jiang, Weiwei, and Jiayun Luo. "Graph neural network for traffic forecasting: A survey." Expert systems with applications 207 (2022): 117921.

[4] Sharma, Amit, et al. "A graph neural network (GNN)-based approach for real-time estimation of traffic speed in sustainable smart cities." Sustainability 15.15 (2023): 11893.

[5] Gao, Chen, et al. "A survey of graph neural networks for recommender systems: Challenges, methods, and directions." ACM Transactions on Recommender Systems 1.1 (2023): 1-51.

[6] Yao, Yuhang, et al. "FedRule: Federated rule recommendation system with graph neural networks." Proceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation. 2023.

[7] Lu, Mingxuan, et al. "Bright-graph neural networks in real-time fraud detection." Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022.

[8] Zhou, Hongkuan, et al. "Accelerating large scale real-time GNN inference using channel pruning." arXiv preprint arXiv:2105.04528 (2021).

[9] Liu, Ziqi, et al. "Heterogeneous graph neural networks for malicious account detection." Proceedings of the 27th ACM international conference on information and knowledge management. 2018.

评论

Q2. The motivation behind quantization can be time and storage. The authors can comprehensively analyze the trade-offs between accuracy, quantization time, inference time, and storage requirements.

MetricsAccuracyInference Time(s)Inference SpeedupTheoretical CostQuant. Time(h)Quant. SpeedupTheoretical Storage
A(FP32)78.41%1.4501OFP(N2F1+NF1F2)O_{FP}(N^2F_1+NF_1F_2)--OFP(E+F1F2+NF0)O_{FP}(E+F_1F_2+NF_0)
B(DQ)75.26%1.2951.120OINT(N2F1+NF1F2)+OFPelem(NF2)O_{INT}(N^2F_1+NF_1F_2)+O_{FP_{elem}}(NF_2)95.951OINT(E+F1F2+NF0)+OFP(1)O_{INT}(E+F_1F_2+NF_0)+O_{FP}(1)
B(DQ-PTQ)46.57%1.2941.121OINT(N2F1+NF1F2)+OFPelem(NF2)O_{INT}(N^2F_1+NF_1F_2)+O_{FP_{elem}}(NF_2)0.28343OINT(E+F1F2+NF0)+OFP(1)O_{INT}(E+F_1F_2+NF_0)+O_{FP}(1)
C(TopGQ)76.94%1.3041.112OINT(N2F1+NF1F2)+OFPelem(NF2)O_{INT}(N^2F_1+NF_1F_2)+O_{FP_{elem}}(NF_2)0.34282OINT(E+F1F2+NF0)+OFP(NT+F2)O_{INT}(E+F_1F_2+NF_0)+O_{FP}(N_T+F_2)
  • OFP()O_{FP}(): complexity for floating-point operations / Storage complexity for floating-point values
  • OFPelem()O_{FP_{elem}}(): complexity for elementwise floating-point operations
  • OINT()O_{INT}(): complexity for fixed-point operations / Storage complexity for fixed-point values

As shown in the table, TopGQ finds a good balance between reducing quantization time and preserving accuracy, while other choices in (A), (B), (C) demonstrate disadvantages in either accuracy, time, or memory. (A) suffers from the expensive costs of computation and storage. While (B) alleviates this cost via quantization, the long quantization time is required to obtain the benefits. (C) is free from the quantization time problem but at the cost of huge performance degradation. TopGQ aims to find the best way of addressing each issue by leveraging topological node similarities with an additional amount of storage cost.

As for the theoretical costs, we assume GNN layer propagation as AXW operation, with ARNN,XRNF1,WRF1F2A \in \mathbb{R}^{N*N}, X \in \mathbb{R}^{N*F1}, W \in \mathbb{R}^{F1*F2} with initial dataset size of N * F0. The computation cost shows that quantization converts the expensive floating-point matrix multiplication into integer operations. The additional floating-point cost of (B)~(D) comes from converting integer outputs back to floating-point values.

The measurement is provided by an improved version of our kernel, and the theoretical analysis is based on [1]. We added this in Section C of the revision.

[1] Zhu, Zeyu, et al. "A2Q\rm A^ 2Q : Aggregation-Aware Quantization for Graph Neural Networks." arXiv preprint arXiv:2302.00193 (2023).

W2, W3 &Q4. Considerable accuracy drop in 4-bit scenarios and less detailed justification about the obtained results: I think the proposed PTQ technique will not work well compared to QAT in below 4-bit unless the authors show some results that nullify the hypothesis. The authors need to provide an application where such drops are acceptable.

TopGQ can be effective in cases where fast inference speed of GNNs is much more prioritized, such as real-time applications in:

  • point cloud processing based tasks such as indoor navigation, shape modeling, and 3D object detection. [1,2]
  • analyzing high energy physics in particle physics, where GNNs decide whether to collect or discard data from a particle collider within nanoseconds to capture vital information [3]
  • ride-hailing platforms that have to process real-time surrounding traffic data and physical environments for event prediction [4]

In low-bit quantization, accuracy drop to some degree is inevitable when compared to FP32 because it essentially limits model capacity. While INT4 TopGQ shows degradation to FP32 in some cases, it shows a clear advantage over baselines in both accuracy and quantization time, suggesting a better option than existing methods. Finally, we emphasize that reaching the performance of FP32 is a shared goal for all quantization methods, and we aim to further close this gap in low-bit settings in our future work.

[1] Shao, Jiawei, et al. "Branchy-GNN: A device-edge co-inference framework for efficient point cloud processing." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

[2] Shi, Weijing, and Raj Rajkumar. "Point-gnn: Graph neural network for 3d object detection in a point cloud." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[3] Iiyama, Yutaro, et al. "Distance-weighted graph neural networks on FPGAs for real-time particle reconstruction in high energy physics." Frontiers in big Data 3 (2021): 598927.

[4] Luo, Wenjuan, et al. "Dynamic heterogeneous graph neural network for real-time event prediction." Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020.

评论

Q3. The author didn’t study GAT. Is there any reason behind it?

\to The reason is that the GAT’s attention-based edge weights are computed at runtime, therefore quantization scales of the adjacency matrix are also computed at runtime, meaning our method of absorbing the scale to the adjacency matrix cannot be precomputed. However, scale absorption can be modified to accommodate such dynamic quantization scenarios, which we provide in the below table. The results show that TopGQ also performs well in GAT architecture.

Citation graphs

ModelMethodPrec.Cora Acc.(%)Q.time(s)Citeseer Acc.(%)Q.time(s)PubMed Acc.(%)Q.time(s)
FP32--82.10-74.10-79.42-
Degree-QuantQATINT881.7018.30s69.8041.31s79.2061.02s
A2QQATINT877.502.44s69.502.55s72.803.11s
SGQuantQATINT879.905.71s68.408.72s76.009.74s
TopGQPTQINT882.020.86s73.701.11s79.321.26s
Degree-QuantQATINT480.7018.78s23.1040.91s74.5060.89s
A2QQATINT476.802.44s61.802.47s70.503.16s
SGQuantQATINT474.705.55s66.208.67s72.409.77s
TopGQPTQINT480.340.92s66.921.20s78.061.25s

Graph classification tasks

ModelMethodPrec.Proteins Acc.(%)Q.time(s)NCI1 Acc.(%)Q.time(s)
FP32--75.56-79.73-
Degree-QuantQATINT872.413580.78s74.507988.47s
A2QQATINT872.42385.62s72.28997.62s
SGQuantQATINT868.82267.65s74.42753.09s
TopGQPTQINT875.744.87s79.489.49s
Degree-QuantQATINT471.963626.77s74.018078.41s
A2QQATINT470.36396.62s66.161002.95s
SGQuantQATINT459.56267.66s58.49754.85s
TopGQPTQINT469.094.71s69.709.90s

In our modified scale absorption for GAT, the absorption is performed at runtime right before the quantization operation, simply by adding an FP32 element-wise multiplication between the adjacency matrix and the precalculated scales. We clarify this in Section D in the revised paper.

Q5. The author needs to provide detailed justification where PTQ can perform better than QAT and FP32. For example, why in the case of ogbn-proteins, TopGQ is better than FP32 by a noticeable margin?

\to The reason TopGQ can outperform QAT baselines is that our proposed topology-aware node grouping helps to find better quantization parameters. While QAT has an upper hand in the fact that they can train the weights, the existing QAT baselines do not consider the nature of GNN and minimize the quantization error of each node feature. On the other hand, our method directly integrates the nature of GNN aggregation into the quantization parameters by grouping nodes by their k-hop topological structure. Therefore, we believe the superior performance of TopGQ is due to a better ability to find quantization parameters, and is orthogonal to the PTQ/QAT differences. In other words, while TopGQ is implemented for PTQ for efficiency, it can bring performance gain in both scenarios. To validate this, we present two variants: Degree-Quant-PTQ and TopGQ-QAT, which are the PTQ and QAT versions of each method, respectively. The experimental results are shown in the table below. The results show that our proposed topology-aware grouping shows better performance regardless of PTQ and QAT.

INT4Cora GCNCora GINCora GSPubmed GCNPubmed GINPubmed GS
Degree-Quant79.0071.9073.5078.6076.6078.20
Degree-Quant-PTQ78.4230.4678.5478.3450.2077.64
TopGQ-QAT80.0876.3076.6478.5077.0076.72
Original TopGQ81.5078.5879.6479.5877.7079.00

As for the results that outperform FP32 accuracies, we believe this phenomenon often occurs when the low-bit format is sufficient to handle the original model complexity. We cite some papers that exhibit the mentioned occasion in their experiments. For example, some of [1] and [2] report better performance in 8-bit settings than FP32 settings at various tasks.

We added this explanation in Section 6.2 and Appendix I in the revision.

[1] Wu, Di, et al. "Easyquant: Post-training quantization via scale optimization." arXiv preprint arXiv:2006.16669. 2020.
[2] Shomron, Gil, et al. "Post-training sparsity-aware quantization." NeurIPS. 2021.

评论

Q6 & Q10. How sensitive is TopGQ to changes in group sizes or to variations in the rank used in low-rank approximations? An analysis of how changing grouping parameters (e.g., group size, hop count for Wiener index) affects quantization error would clarify the stability and adaptability of TopGQ.

\to We conducted sensitivity studies regarding the value of hop count k for the Wiener Index, and the results of graph datasets are included in the original paper as Table 7. Additionally, we prepared the study with citation datasets for generalization. The tables are as below.

kPrec.CoraGCNGINGSCiteseerGCNGINGSPubMedGCNGINGS
1INT481.0077.1278.8271.8669.1070.8678.2075.4278.50
2INT481.5078.5879.6471.9070.1471.7679.5877.7079.00
3INT481.5678.3278.5672.1270.4771.3879.2077.0078.72
1INT881.9678.3679.9272.2470.1871.8480.1878.3478.90
2INT882.0878.4280.3072.2870.2671.9680.3078.6278.94
3INT882.1078.3879.5472.2470.6071.9280.2478.6878.84
kPrec.ProteinsGCNGINGSNCI1GCNGINGS
1INT460.8651.0465.7762.6870.9175.90
2INT466.0663.9667.0166.1477.3376.50
3INT470.1570.6169.6765.0978.4976.43
1INT873.3472.8873.0380.8181.6078.88
2INT876.0574.6174.2280.8681.8479.10
3INT875.9474.8674.0080.9181.8879.16

In the table, we can observe that TopGQ shows stability in overall quantization performance, with a preference of value k for better accuracy results for several settings.

As for the question of low-rank approximations, TopGQ does not have techniques associated with the concept. Can we kindly ask the reviewer to give more details about the question? Scale Absorption may have brought confusion as it depicts a similar illustration, therefore we include additional clarification of its method below.

Scale Absorption enables node-wise quantization of XX in the matrix multiplication A×XA \times X to preserve the precision of XX during quantization. Repetitive aggregation in GNN layers induces node-wise outlier activations, as seen in Figures 5 and 6 of our paper. Feature-wise (column-wise) quantization of XX, required for integer matrix multiplication, degrades precision due to outlier-influenced scales in each column. To address this, Scale Absorption integrates the precalculated quantization scale of XX into the edge weights of AA before inference, allowing efficient multiplication between quantized AA and node-wise quantized XX.

Q7. Outliers might still exist within topologically grouped nodes, especially in large-scale graphs. How does this affect quantization quality?

\to We are currently working on illustrating the connection between outliers and quantization quality in large-graph datasets, as we strongly believe the context will provide deeper insights about the performance of TopGQ and enhance its understanding. We will promptly let the reviewer know as soon as the analysis is ready to report.

Q8. Table 5 needs to provide results for 4-bit as well.

\to Practical INT4 deployment is currently a challenge due to the lack of public INT4 sparse matrix multiplication (SPMM) kernels supporting channels-wise asymmetric quantization. Note that building a kernel that can fully utilize the INT4 support of TensorCore (which does not support SPMM operation) is difficult enough to be recognized as a contribution worth a paper. [1] has succeeded in building an INT4 kernel, but only supports naive per-tensor symmetric quantization, and needs major modification to accommodate ours.

Nevertheless, Table 5 of TopGQ shows that TopGQ can accelerate inference in integer formats when paired with appropriate kernels supporting integer operations. Additionally, we improved our inference kernel to resolve speed concerns, and therefore provide a new table that presents enhanced full-batch inference time. This updated table is now included as Table 5 in the revised version of TopGQ.

MethodTypeBitReddit (s)SpeedupOGBN-Products (s)Speedup
-FP321.41-1.45-
Degree-QuantQATINT81.221.15×\times1.301.12×\times
A2QQATINT81.301.08×\times1.780.82×\times
SGQuantQATINT81.251.13×\times1.311.11×\times
TopGQPTQINT81.241.13×\times1.301.11×\times

[1] Wang, Yuke, Boyuan Feng, and Yufei Ding. "QGTC: accelerating quantized graph neural networks via GPU tensor core." Proceedings of the 27th ACM SIGPLAN symposium on principles and practice of parallel programming. 2022.

We anticipate that future kernels leveraging INT4 operations will better support efficient GNN inference.

评论

Q9. An ablation study comparing the use of indegree and Wiener index with other centrality measures (e.g., closeness or betweenness) would provide insights into the robustness of TopGQ’s topology-based grouping.

\to To further identify the advantages that localized Wiener Index has for quantization, we compared the quantization results with PROTEINS and NCI1 datasets over other graph properties such as betweenness centrality, closeness centrality, and Katz centrality. The results are as below.

The other node centrality measures depict suboptimal performance compared to using localized Wiener Index, in both INT4 and INT8 settings. We believe that the result stems from the unique expressiveness of the localized Wiener Index in capturing local compactness of a node within k-hop neighbors: A small value of a node indicates a dense connectivity within its neighbors, and relatively rapid propagation of features via message passing. Therefore, TopGQ can effectively group node features with distinctive ranges, as shown in Figure 2 in the paper, leading to enhanced quantization quality. We added this in Section 6.6 in the revised version.

MethodBitProteins GCNProteins GINProteins GSNCI1 GCNNCI1 GINNCI1 GS
-FP3276.1974.7972.8780.4181.4678.46
Degree Centrality onlyINT872.5771.8670.4878.9181.2878.32
+ Betweeness CentralityINT862.1061.5555.0876.8975.1875.13
+ Closeness CentralityINT862.4864.9657.3376.4976.6875.85
+ Katz CentralityINT856.8257.9748.5664.2062.1964.27
+ OursINT875.9474.8674.0080.9181.8879.16
Degree Centrality onlyINT456.1545.0450.6560.5469.7175.46
+ Betweeness CentralityINT459.0354.2550.5863.8167.5570.61
+ Closeness CentralityINT458.5261.7350.4863.1469.5471.97
+ Katz CentralityINT453.6855.2444.0857.1957.3657.77
+ OursINT470.1570.6169.6767.5378.4976.43

Q11. The provided code is just .py file without any instructions on how to run and get results. That limits the reproducibility. The could should provide a README file with setup instructions, example commands, and a requirements.txt file for dependencies.

\to We thank the reviewer for the helpful feedback. We have updated the code files with detailed instructions for better reproducibility.

Q12. Minor Typos.

\to We sincerely thank the reviewer for identifying and informing us of the misprints throughout the paper. We have corrected the typos in the revised version.

评论

As our experiment results are ready, we provide our responses to Q7 as below.

Q7. Outliers might still exist within topologically grouped nodes, especially in large-scale graphs. How does this affect quantization quality?

\to To understand the impact of outliers on the quantization quality of TopGQ, we evaluated how outliers influence GNN layer activation quantization methods across different levels of granularity. In the table, the percentage “k%” represents that features of top k% outlier nodes were excluded from quantization, retaining their original precision as FP32 values.

The experiment was set on INT4 quantization setting with GCN architecture, with dataset Reddit.

MethodBits
Percentage of FP32 Outliers0%1%5%10%
FP3291.91%91.91%91.91%91.91%
No Node GroupingINT46.37%39.36%62.50%65.05%
Node Grouping with Only IndegreeINT478.87%80.28%81.25%82.25%
TopGQINT483.02%83.09%83.49%83.98%

As shown in the table, quantization without any node grouping strategies experiences significant degradation, with performance improving sharply—up to a 58.68% increase—when more outlier nodes are excluded from quantization. Similarly, quantization with only node indegree information demonstrates a comparable trend, with a smaller accuracy gap of 3.38%. Both settings show relatively high sensitivity to outlier quantization.

In contrast, TopGQ’s node grouping approach exhibits robustness, with an accuracy gap of no more than 1%. This result clearly demonstrates that TopGQ effectively mitigates the impact of outliers on quantization by its node grouping, ensuring stable and high-quality activation quantization even in the presence of extreme values. TopGQ effectively separates and quantizes outliers, maintaining overall quantization quality even with their inclusion in quantization.

评论

I thank the authors for putting in the effort and providing more results. I think the paper needs major modifications from the initial submission. The authors need to apply the given comments to make the paper stronger. With that, I keep my score.

评论

Dear Reviewer PWvf,

We would like to thank the reviewer for all the constructive feedback which helped improve and analyze TopGQ in various aspects. We hope that the provided experimental results have addressed the questions and concerns raised by the reviewer regarding TopGQ. We also welcome any additional suggestions the reviewer may have for further modification of our paper.

Thank you again for your time and effort.

公开评论

Dear Authors,

Thank you for your work on TopGQ. Could you share the code or details of the kernel used to measure runtimes for TopGQ, A²Q, SGQuant, and DQ? This would help verify the benchmarks, especially for A²Q, which reports a high speedup factors within the original paper.

评论

Dear Samir Moustafa,

We deeply appreciate your interest in our work, TopGQ. We would gladly open our kernel source when our paper is accepted.

Sincerely, the authors of TopGQ.

AC 元评审

This paper proposes a new method for quantizing GNNs without retraining through node grouping. Compared with existing methods like Degree-Quant, A2Q and SGQuant, it has a faster quantization speed at lower cost of accuracy. While the authors have provided extensive experiments to address all the reviewers' concerns during the discussion period, some issues remain even with the updated results, especially about the decreased margin over fp32 results after implementing batch inference for the baseline, and the inconsistency in the observed speedup as reported in previous work "Low-bit Quantization for Deep Graph Neural Networks with Smoothness-aware Message Propagation". Therefore, while I appreciate the authors' efforts putting into the rebuttal, some weakness remains, and I hope to see better comparisons with existing works in the next version

审稿人讨论附加意见

The reviewers (except PWvf) made concrete points about the weakness of the paper. While the reviewers provided convincing experimental results for most of the points, some concerns remain regarding the inconsistency of the speedups reported by previous papers (as in the discussions of reviewer fkMV).

最终决定

Reject