7.1

/10

Poster5 位审稿人

最低4最高5标准差0.5

4.0

置信度

创新性3.2

质量3.0

清晰度2.8

重要性3.2

NeurIPS 2025

HubGT: Fast Graph Transformer with Decoupled Hierarchy Labeling

Ningyi Liao,Zihao Yu,Siqiang Luo,Gao Cong

OpenReview PDF

提交: 2025-05-08更新: 2025-10-29

TL;DR

We propose HubGT as a scalable Graph Transformer enabling fast and hierarchical embeddings.

摘要

关键词

Graph neural networks; Graph Transformer; Shortest Path Distance

评审与讨论

审稿意见

评分: 5置信度: 42025-06-27

This paper proposes a hierarchical method to improve the efficiency of graph transformers. In particular they provide an algorithm which is linear in the number of graph edges and nodes, in precomputing and training. The training stage completely removes graph-related computation. In the paper a thorough analysis is made of the complexities of related algorithms and also the

优缺点分析

Strenghts:

Thorough design optimizing several aspects in the computation
Good analysis of the complexity of different methods.
Significant speed ups with respect to state-of-the-art methods

Weaknesses:

It remains a bit unclear where the real benefits lie. Table 2 which is the main experimental result shows that for many of the datasets other methods are better in terms of accuracy. In terms of precomputation, time for epoch, and time for infer there is no clear winner. It is also hard to determine what the overall timing means. Is every epoch for the different methods comparable? Do all methods require the same number of epochs?
It remains unclear what the competing methods especially the ones with similar performance do. There is some more description in the appendix but also there not much is said about the actual differences with the proposed method.

Improvements:

In the caption of table 1 it would be good to mention all variables that play a role in the complexity. L, n, F, M etc.
Indicate in bold / underline which method is based in computation time for the different elements.
Can't you create a scatterplot like visualization with the tradeoff between acc and computation? Probably separate for training and testing. Your main message is that you can reach a similar accuracy in much less time. A table doesn't easily convey this message.

问题

See weaknesses

局限性

Yes

最终评判理由

The authors have addressed my questions and overall I think this is an interesting paper worthy of presentation at Neurips. Hence I raised my score to accept.

格式问题

No layout problems

作者回复

2025-07-31

We thank the reviewer for the thorough review and suggestions. We elaborate on the comments with further details below.

W1

It remains a bit unclear where the real benefits lie. Table 2 which is the main experimental result shows that for many of the datasets other methods are better in terms of accuracy. In terms of precomputation, time for epoch, and time for infer there is no clear winner. It is also hard to determine what the overall timing means. Is every epoch for the different methods comparable? Do all methods require the same number of epochs?

In this paper, we aim to present an efficient GT with novel tokenization and positional encoding (PE). We elaborate on our considerations of model performance and efficiency following the priorities below:

Most importantly, the model should be scalable enough, that is, capable of completing training within a reasonable time without incurring OOM. Compared to the baselines, our HubGT achieves favorable GPU memory and time scalability, including simple mini-batching as well as fast precomputation and PE querying. We provide a more thorough discussion regarding the scalability of HubGT in Review 1EVJ.W3.
Next, we expect the model to achieve competitive accuracy across different datasets, which relates to the effectiveness of batching strategy and PE. As discussed in Section 6.2, the permutation-invariant batching, hierarchy in tokens, and dense PE contribute to HubGT's performance on both homophilous and heterophilous datasets.
Under comparable accuracy, we comprehensively consider the model efficiency, especially training and inference time. Distinguishing the two times is essential for applications with varied frequencies of training/inference. In particular, if model inference is significantly slower than training, such as DIFFormer, PolyNormer, and SGFormer in our evaluation, the efficiency benefit is less practical.
Regarding training time, the average epoch time is preferred over total training time. This is because: (1) GTs with similar Transformer architectures usually share a similar convergence speed. In our experiments, most GTs converge within 100–200 epochs, so the epoch time is comparable. (2) The total training time is easily affected by other factors such as additional computation and stopping criteria, rendering it hard to present a fair comparison. The epoch time is also used in previous works such as [40]. We will include more detailed discussions and experimental results in the revised version.

W2

It remains unclear what the competing methods especially the ones with similar performance do. There is some more description in the appendix but also there not much is said about the actual differences with the proposed method.

Overall, the design space of GTs includes batching strategy, tokenization, positional encoding (PE), and Transformer architecture. We categorize the baselines into two categories mainly based on the Transformer architecture. We further compare HubGT with the baselines in each category:

Kernel-based GTs: These models are signified by the graph kernels in their attention mechanism, i.e., utilizing the graph structure to modify Eq. (2) computation. Different models employ their distinct kernels, resulting in varying complexity and empirical performance. Corresponding to the dependency on topology, they entail neighborhood sampling as the batching strategy.
Hierarchical GTs: These models use the standard Transformer architecture and choose to retrieve graph information by tokenization and PE. The difference lies in the design of tokenization exhibiting varied utilization of graph topology. For example, PolyFormer uses the coarsened graph as tokens, while ANS-GT adaptively samples nodes in the induced subgraph to form each token. Before HubGT, most of these models use graph Laplacian as PE. The common batching strategy is random sampling among the tokens.
Other models, such as SGFormer, do not include the typical multi-head attention and hence are named non-GT models.
HubGT is considered as a hierarchical GT, and is thus mainly compared to this category in empirical evaluation. The novelty of HubGT lies in tokenization and PE designs highlighting both efficacy and efficiency: it exploits the label graph hierarchy as tokens to introduce global information, and is the first to introduce the dense SPD PE on large graphs, while both parts can be efficiently computed.

In addition, we also highlight the differences in the following aspects:

Complexity: We consider the time and space complexity as the most inherent differences among the GTs, which can largely predict their empirical efficiency. As analyzed in Table 1, the complexity of HubGT is among the best. For the two most important terms dominating scalability: (1) GPU memory: HubGT completely removes graph-related terms $n,m$ , which is scalable in preventing OOM. (2) Training time: HubGT's complexity is linear to the node size $n$ , which is more scalable than models linear to the edge size $m$ , as $m$ increases faster on large graphs.
Empirical performance: As detailed in Section 6.2, overall, the efficiency of HubGT is the best among hierarchical GTs, which are directly comparable. Considering the limitations of other GTs as in W1.(3), its efficiency gains are also significant.

We will include more details in Section 2 and Appendix B in the revised version.

I1&I2

In the caption of table 1 it would be good to mention all variables that play a role in the complexity. L, n, F, M etc.
Indicate in bold / underline which method is based in computation time for the different elements.

We thank the reviewer for the suggestion and will improve the notation in Table 1. $L$ and $F$ are the depth and width of the model, respectively. $d=m/n$ is the average degree of the graph, and $s$ is the sample size for subgraph-based methods. The critical terms affecting scalability are mainly the node size $n$ and edge size $m$ of the graph. Between these two, $m$ is more significant as it increases faster and becomes the bottleneck on large graphs.

I3

Can't you create a scatterplot like visualization with the tradeoff between acc and computation? Probably separate for training and testing. Your main message is that you can reach a similar accuracy in much less time. A table doesn't easily convey this message.

We thank the reviewer for the useful comment. As adding new figures is not available during the rebuttal period, we will add this scatter plot as a visualization of Table 2 in the revised version. We will also provide more comprehensive comparisons distinguishing training and inference performance using a radar plot.

In general, from the efficacy perspective, HubGT achieves competitive accuracy on most datasets, while most kernel-based GTs exhibit accuracy drop on at least one dataset. The efficiency of HubGT is remarkable, with the best memory scalability and fast speed, especially superior to hierarchical baselines. Hence, we believe it achieves a favorable efficiency-accuracy trade-off.

审稿意见

评分: 5置信度: 42025-06-27

The paper introduces HubGT, a novel Graph Transformer (GT) framework that addresses the scalability bottlenecks of existing GTs. The paper introduces a decoupled hub-based hierarchical labeling scheme that supports efficient subgraph sampling and distance-based positional encoding (SPD). The method avoids graph operations during training and enables O(1) SPD querying, making it suitable for large-scale graphs with millions of nodes and edges. Empirical results on 14 benchmark datasets demonstrate state-of-the-art efficiency and strong predictive performance, especially in heterophilous and sparse graphs.

优缺点分析

Strengths of the Paper:

The paper introduces a 3-level hub-labeling scheme (H-0, H-1, H-2) to capture both local and global graph structures. It leverages Pruned Landmark Labeling for efficient index construction and SPD approximation and cleanly decouples precomputation from training to support mini-batch learning without graph traversal.
The framework achieves linear-time precomputation and training complexity independent of edge count. It consistently demonstrates the fastest inference speed across all large-scale benchmark datasets.
The paper enables the use of Shortest Path Distance (SPD) as a dense positional encoding, a strategy previously too expensive for large graphs, effectively enhancing representation power, especially under heterophily.
The model is evaluated on 14 benchmark datasets covering a wide range of scales and homophily levels. Ablation studies confirm the importance of SPD bias, global hub connections, and attention-based node readout for performance gains.

Weaknesses of the Paper:

The core Transformer module is largely standard. The proposed approach could benefit from integrating efficient attention mechanisms to improve scalability.
The hub-label index is constructed in a one-time pre-computation phase and remains fixed. This design may be inadequate for dynamic or temporal graphs, where incremental updates to the label hierarchy are necessary.
While the focus is on Graph Transformers, the paper lacks a direct comparison with recent scalable convolutional GNNs, which also address large-scale and heterophilous settings. The only non-GT baseline included is SGFormer, which limits the scope of empirical benchmarking.
The paper does not provide an in-depth analysis of hub selection, such as the statistical properties of selected hubs, cross-dataset patterns, or their structural significance, which limits the interpretability of the hub hierarchy used in H-1 and H-2 indices.
Model performance depends on choices like subgraph size and the balance between "s_in" and "s_out". While the impact is partially evaluated (Figure 7), the method lacks adaptive mechanisms to optimize these parameters across diverse datasets.

问题

Can the authors provide an in-depth analysis of hub selection, such as the statistical properties of selected hubs, cross-dataset patterns, or their structural significance?
Can the authors provide evidence or discuss how the hub label index could be updated incrementally in dynamic or temporal graph settings?
Could the authors expand the experimental comparison to include scalable non-GT baselines?
Is it feasible to combine HubGT with efficient Transformer variants such as FlashAttention or Linformer, and what are the anticipated gains or trade-offs?

局限性

Yes

最终评判理由

The author has address part of my concerns on the weakness. I would like to keep my current score.

格式问题

作者回复

2025-07-31

We sincerely appreciate the reviewer for the insightful comments and detailed suggestions. We address all comments and related questions below.

W1&Q4

The core Transformer module is largely standard. The proposed approach could benefit from integrating efficient attention mechanisms to improve scalability.
Is it feasible to combine HubGT with efficient Transformer variants such as FlashAttention or Linformer, and what are the anticipated gains or trade-offs?

Yes. While we utilize the standard Transformer in the paper, as it is orthogonal to our main contributions on tokenization and positional encoding, HubGT is compatible with other Transformer architectures. As in Section 5.2, this flexibility is a remarkable advantage of the decoupled design, enabling seamless integration with efficient Transformers.

To demonstrate, we implement FlexAttention [R1] to replace the standard Transformer, as prior methods such as FlashAttention and Linformer do not support trainable attention bias. The results are presented below. Overall, FlexAttention achieves a 10–20× speedup and significant reductions in GPU memory footprint, while the accuracy remains comparable, verifying the benefits of improving the Transformer architecture.

While the improvement is promising, the speedup is less substantial compared to [R1] results, mainly because the speed is also affected by I/O and computation of PE in HubGT. As our current implementation is highly straightforward, better pipelining may further improve efficiency performance. We will further enhance the implementation and include more evaluations to showcase the benefits of efficient Transformers in the revised version.

Dataset	cora				genius
Model	Epoch	Infer	VRAM	Acc	Epoch	Infer	VRAM	Acc
`HubGT`	0.20	0.018	2.5	86.0	23.7	2.58	14.3	89.8
`HubGT` w/ `FlexAttention`	0.14	0.013	1.8	86.9	21.5	2.15	1.5	89.1

[R1] FlexAttention: A Programming Model for Generating Fused Attention Variants. MLSys'25.

W2Z&Q2

The hub-label index is constructed in a one-time pre-computation phase and remains fixed. This design may be inadequate for dynamic or temporal graphs, where incremental updates to the label hierarchy are necessary.
Can the authors provide evidence or discuss how the hub label index could be updated incrementally in dynamic or temporal graph settings?

Yes. Hub labeling for dynamic graphs has been studied in previous works such as [R2]. In brief, it prevents full index updates and only updates the affected labels on candidate paths.

Due to implementation constraints, we are unable to provide experimental evidence for utilizing the algorithm for dynamic GT. Nonetheless, we consider it a promising direction and will include further discussions in the revised version.

[R2] Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling. WWW'14.

W3&Q3

While the focus is on Graph Transformers, the paper lacks a direct comparison with recent scalable convolutional GNNs, which also address large-scale and heterophilous settings. The only non-GT baseline included is SGFormer, which limits the scope of empirical benchmarking.
Could the authors expand the experimental comparison to include scalable non-GT baselines?

We have conducted additional experiments on 4 GNNs, including two heterophilous MPNNs GCNJK and MixHop applicable to neighborhood sampling, and two efficient spectral GNNs TFGNN and S2GNN. The results are presented in Review UKq4.W1. In summary, while convolutional GNNs achieve faster epoch speed, they particularly suffer from issues including slower convergence, less scalable memory overhead, and performance drop under mini-batching.

W4&Q1

The paper does not provide an in-depth analysis of hub selection, such as the statistical properties of selected hubs, cross-dataset patterns, or their structural significance, which limits the interpretability of the hub hierarchy used in H-1 and H-2 indices.
Can the authors provide an in-depth analysis of hub selection, such as the statistical properties of selected hubs, cross-dataset patterns, or their structural significance?

An empirical study regarding the hub labeling properties is presented in Appendix D.3, including the label size and visual clustering effect. Here, we would like to present more details with varying $s$ when constructing labels. A similar study on synthetic graphs is presented in Review c6sP.Q2. We note the following observations:

We compute the average clustering coefficient $C$ before and after hub labeling to evaluate the change in graph topology. Similar to the results in Figure 1, hub labeling can flatten the node distribution by increasing the clustering property on less clustered graphs such as cora. Conversely, for highly clustered graphs and larger label sizes, $C$ is slightly reduced, as additional label connections make the graph denser with a more dispersed edge distribution.
For small graphs such as cora, the average H-1 label sizes $|\mathcal{L}|+|\mathcal{L}'|$ do not vary with different upper bounds of $s$ , which implies that the hubs are sufficiently connected to graph nodes. On the larger physics, increasing the small $s=24$ can increase the actual label size, effectively incorporating more remote nodes to form the subgraph. However, the increase is less significant for $s>48$ , which indicates that the hubs are already sufficiently connected.
As the H-0 label size $|\mathcal{I}|$ is based on H-1 construction, it also exhibits less increase with larger $s$ , as the larger H-1 is able to cover more SPD pairs and suppress the need for H-0. For most configurations, the H-0 size is significantly smaller than the theoretical bound $O(s^2)$ , ensuring empirical space efficiency.

Due to the restriction on including figures during rebuttal, we will present more results regarding the relationship between hub properties and node distribution in the revised paper.

Dataset	Max Label Size $s$	Avg Degree	$C_{before}$	$C_{after}$	H-1 Size	H-0 Size	Index Size
`cora`	24	3.90	0.241	0.648	2.4	2.5	6.2 MB
`cora`	48	3.90	0.241	0.621	2.4	2.5	6.2 MB
`cora`	72	3.90	0.241	0.584	2.4	2.5	6.2 MB
`physics`	24	14.38	0.377	0.302	22.6	135.0	118 MB
`physics`	48	14.38	0.377	0.306	40.6	230.9	147 MB
`physics`	72	14.38	0.377	0.300	55.9	294.6	170 MB

W5

Model performance depends on choices like subgraph size and the balance between $s_{in}$ and $s_{out}$ . While the impact is partially evaluated (Figure 7), the method lacks adaptive mechanisms to optimize these parameters across diverse datasets.

As detailed in Review c6sP.Q1, we set the subgraph size $s$ to be sufficiently large to ensure indexing speed and receptive field. The $s_{in}/s_{out}$ ratio is mainly affected by dataset properties such as heterophily, so we consider it a hyperparameter to be tuned.

In addition, our complexity and experimental analysis of GTs with adaptive schemes such as ANS-GT imply that such mechanisms come with additional overhead. Hence, we consider efficient, adaptive tokenization/positional encoding a promising yet challenging task for GT research.

审稿意见

评分: 4置信度: 42025-06-27

This paper proposes HubGT, a novel and efficient Graph Transformer architecture that leverages hierarchical hub labeling to enable scalable training and inference on large graphs. The key idea is to decouple the graph computation from the transformer training by introducing a three-level indexing structure for precomputing neighborhood and shortest-path information. The authors demonstrate both theoretical and empirical improvements over existing Graph Transformer baselines, especially on large-scale graphs under both homophily and heterophily settings.

优缺点分析

Strengths

The idea of decoupling graph computation from transformer training is novel and leads to significant efficiency gains.
The three-level hierarchical hub labeling scheme (local, global, and 2-hop caching) is theoretically well-grounded and empirically effective.
The paper conducts comprehensive experiments on 14 datasets, including million-scale graphs, showing consistent improvements in speed and accuracy compared to state-of-the-art graph transformers.

Weakness

The evaluation only includes comparisons against Graph Transformer baselines. Stronger message-passing GNNs are missing, which limits understanding of the accuracy-efficiency tradeoff relative to GNNs.

问题

局限性

yes

最终评判理由

I think my score is reasonable.

格式问题

作者回复

2025-07-31

We thank the reviewer for the thoughtful comments. Below, we present more details on experimental comparisons.

W1

The evaluation only includes comparisons against Graph Transformer baselines. Stronger message-passing GNNs are missing, which limits understanding of the accuracy-efficiency tradeoff relative to GNNs.

In the paper, we include SGFormer as a strong baseline representing message-passing GNNs. Here we conduct experiments on 4 additional GNNs, including two heterophilous MPNNs GCNJK and MixHop with neighborhood sampling (RS), and two efficient spectral GNNs TFGNN and S2GNN. Results on precomputation time, training epoch time, peak GPU memory, and accuracy are shown below. In general, we note the following differences for MPNNs:

GNNs using full graph topology for propagation usually entail $O(mF)$ memory overhead, which is less scalable. This can be observed from the OOM issue on datasets such as reddit. Similar to GT, this can be addressed by mini-batching with RS, but risks hindering efficacy, such as GCNJK and MixHop on ogbn-arxiv.
Empirically, MPNNs entail more training epochs, with up to 1000 epochs in practice. In comparison, most GTs converge within 100–200 epochs. Consequently, while the epoch time seems shorter for MPNNs, the total training time is on par with GTs.

We will include more results extending Table 2&5 in the revised version. As suggested by Review Hvuy.I3, we will add new visualizations to study the accuracy-efficiency tradeoff under varying graph scales in the revised version.

Dataset	reddit				ogbn-arxiv				penn94				twitch-gamer
Model	Pre	Epoch	VRAM	Acc	Pre	Epoch	VRAM	Acc	Pre	Epoch	VRAM	Acc	Pre	Epoch	VRAM	Acc
`GCNJK`	/	0.83	12.0	89.0	/	0.29	2.8	60.0	/	0.35	4.6	65.9	/	0.15	7.3	61.5
`MixHop`	/	0.79	2.3	91.6	/	0.21	6.0	60.3	/	0.31	1.5	75.0	/	0.11	1.5	52.1
`TFGNN`			OOM		/	0.11	1.2	75.2^	/	0.19	1.9	75.8	/	0.15	0.7	59.5
`S2GNN`			OOM		195.4	0.34	6.3	69.9	148.1	0.42	6.9	75.1	668.6	1.2	15.9	65.9

^ Results from the original paper since configurations are not provided.

[GCNJK] Representation learning on graphs with jumping knowledge networks. ICML'18.
[MixHop] MixHop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. ICML'19.
[TFGNN] Polynomial Selection in Spectral Graph Neural Networks: An Error-Sum of Function Slices Approach. WWW'24.
[S2GNN] Spatio-Spectral Graph Neural Networks. NeurIPS'24.

评论- Follow-up

2025-08-03

Thank you for the responses. I'm a bit confused about the statement: "we include SGFormer as a strong baseline representing message-passing GNNs." As far as I understand, SGFormer is a graph transformer model, not a typical message-passing GNN.

I also appreciate the inclusion of new comparisons against GNN baselines. However, some of the reported results appear inconsistent with publicly available scores. For example, MixHop is reported to achieve 83.47% accuracy on Penn94 in [1], whereas the authors report only 60.3%.

[1] Lim et al., "New Benchmarks for Learning on Non-Homophilous Graphs", 2021.

2025-08-03

We thank the reviewer for the comments. We would like to clarify the following two points.

SGFormer

As mentioned in the SGFormer paper Appendix C, their model includes a global attention module and a convolutional GNN network. Hence, we also consider it as incorporating message-passing mechanisms. In our Appendix B.3, we further clarify that SGFormer does not contain the multi-head attention and encoding modules in common GTs. Therefore, its architecture is largely orthogonal to our focus on GT tokenization and positional encoding. We will amend Appendix B.3 to address the difference more rigorously.

MixHop

We would like to note that our evaluation of MixHop and GCNJK is under the mini-batch setting, which is more relevant to Table 4 & 8 in [1]. We would like to list and compare ours and [1]'s results below. Specifically:

Our result for MixHop on penn94 is 75.0%, which is closely comparable to the 75.6% in Table 4 in [1]. Note that 60.3% is the result on ogbn-arxiv, not penn94.
The remaining differences are mainly caused by the batching strategy during test time. [1] adopts mini-batch training and full-graph inference by default. However, this configuration causes OOM in some of our experiments, so we also enable mini-batching uniformly during model inference, which can lead to accuracy drops in some cases.

Evaluation	Model	ogbn-arxiv	penn94
[1]	`MixHop`	64.7 (Table 8)	75.6 (Table 4)
[1]	`GCNJK`	66.2 (Table 8)	72.8 (Table 4)
Ours	`MixHop`	60.3	75.0
Ours	`GCNJK`	60.0	65.9

[1] Lim et al., "New Benchmarks for Learning on Non-Homophilous Graphs", 2021.

审稿意见

评分: 4置信度: 42025-07-03

The authors describe HubGT, a Graph Transformer (GT) framework designed to tackle scalability issues that hinder the application of traditional GTs to large graphs. HubGT achieves this by decoupling graph structure processing from model training through an offline precomputation step. This step constructs a 3-level hierarchical index based on hubs, which are specially selected nodes with high connectivity that serve as central reference points within subgraphs. The index, inspired by techniques like Pruned Landmark Labeling (PLL), assigns labels to nodes based on their relationships to hubs, enabling efficient encoding of shortest path distances. During training, this hierarchical index allows shortest path distance (SPD) computation in $O(1)$ time and enriches node tokens with global "hub context", enhancing the model’s ability to capture broader graph structures. As a result, HubGT simplifies GT training to resemble that of a standard Transformer, supporting CPU/GPU pipelining, reducing memory overhead, and delivering state-of-the-art performance on graph benchmarks.

优缺点分析

Some strengths of the paper are obviously related to the benefits of PLL. For example, HubGT’s hub labeling scheme has constant time for SPDs during training. Moreover a nice property is that the hierarchical indexing enriches node tokens with global “hub context,” capturing long-range dependencies and structural patterns. Finally, the idea of decoupling graph structure processing from model training should enable CPU/GPU pipelining which is an advantageous property.

I do note some possible weaknesses also related to hub based labeling. For example, the dependence on quality of hub selection could potentially limit generalizability in graphs with uneven degrees or more complex network topologies (e.g. Erdos-Reyni, Barabasi-Albert, Stochastic Block Model). What about secondary effects like thrashing? For example, it is possible that the $H_0$ cache’s fixed size limit ( $s^2$ ) could thrash in the face of many alternative paths especially with denser graphs.

In terms of clarity, mostly ok, but some problematic notation, viz. the terms theorem and property are used interchangeably, so property should be used throughout the text as the observations do not require proofs.

问题

For practitioners/application developers, how are $s_(sample size)$ and $s_out$ (in/out-neighbor ratio) determined heuristically? Are optimal values related to properties of the entire graph (e.g. diameter, average degree, sparsity, etc.)?
Your hub selection is based on node degree. How robust is this strategy on graph topologies that lack clear high-degree hubs, such as random graphs or stochastic block models?
Have you experimented with other centrality measures for hub selection, and how does HubGT's performance change when high-degree nodes are not the most effective structural landmarks?

局限性

Yes

最终评判理由

The main contribution of the paper is distance PE encoding achieved through hub labeling. The tradeoff is the high complexity of deriving hub labels both for the full graph through PLL or distributed PLL and then computing favorable $H_1$ , $H_0$ etc. labels when sampling a subgraph through a root node $r$ (advantageous if SPs through $r$ exist otherwise just original hub distances.) versus the lower complexity of SP distance calculation once hubs are calculated. I still have a couple of concerns: 1) This method should not work well for dynamic graphs and also under inductive settings because the hub labels and distances will change under dynamism and the preprocessing is likely a wasted overhead. 2) concern about the significance of achieving this through hub labeling as opposed to other measures such as centrality etc. I wish the authors had some theoretical results to back this up, everything is empirical. 3) Hub labeling under distributed as well as dynamic graph scenarios is very challenging and require a very different set of algorithms. Thus I am retaining my original score. The authors should definitely include the explanatory comments about challenges in distributed graphs and the new references in their final version.

格式问题

None noticed.

作者回复

2025-07-31

We thank the reviewer for the thorough review. We address all the comments and questions below.

W1

I do note some possible weaknesses also related to hub based labeling. For example, the dependence on quality of hub selection could potentially limit generalizability in graphs with uneven degrees or more complex network topologies (e.g. Erdos-Reyni, Barabasi-Albert, Stochastic Block Model). What about secondary effects like thrashing? For example, it is possible that the $H_0$ cache’s fixed size limit ( $s^2$ ) could thrash in the face of many alternative paths especially with denser graphs.

We would like to discuss the possibility of secondary effects like thrashing here, and specifically evaluate the application to random graph topologies in Q2.

As introduced in Appendix A, the goal of hub labeling is to add "highway" connections as labels to the graph in order to form a 2-hop cover, where arbitrary node pairs can be reached within 2 hops. On graphs with dense or non-local edge connections, some existing edges can already be regarded as "highway" edges, linking nodes in different local structures. Hence, the number of additional labels is usually small on dense graphs.

In implementation, this is achieved by the H-2 computation, which utilizes a set of global hubs connecting to all nodes in the entire graph. When H-2 effectively captures most SPD queries, the sizes of the H-1 and H-0 indices are significantly reduced. We further study in Q2 that, on these graphs, it is highly likely that SPD is small and multiple shortest paths exist. In other words, only a small portion of hubs is sufficient to form a 2-hop cover for the majority of graph nodes, which prevents potential thrashing.

W2

In terms of clarity, mostly ok, but some problematic notation, viz. the terms theorem and property are used interchangeably, so property should be used throughout the text as the observations do not require proofs.

We notice that this was caused by a bug in our LaTeX hyperreference. We will carefully correct the notation to "Property" in future versions.

Q1

For practitioners/application developers, how are $s_{sample size}$ and $s_{out}$ (in/out-neighbor ratio) determined heuristically? Are optimal values related to properties of the entire graph (e.g. diameter, average degree, sparsity, etc.)?

Sample size $s_{sample}$ : The subgraph size $s$ in GT learning is similar to the sample size in GraphSAGE, indicating the receptive field of node representation. While ideally at the scale of average degree, it may also be affected by factors such as heterophily, where more nodes are favored to capture non-local information. Hence, we set it to be sufficiently large as $s_{sample}=48$ , and leave the model to learn the important relationships.
In/Out neighbor ratio $s_{in}/s_{out}$ : As analyzed by the three properties and evaluated in Figure 7, $s_{in}$ and $s_{out}$ are closely related to hub and fringe nodes, respectively. In practice, a larger ratio of $s_{in}$ to $s_{out}$ is hence preferred when non-local information is important, such as under heterophily. Since it is mainly affected by specific dataset properties, we consider it as a hyperparameter to be tuned.

[GraphSAGE] Inductive Representation Learning in Large Attributed Graphs. NeurIPS'17.

Q2

Your hub selection is based on node degree. How robust is this strategy on graph topologies that lack clear high-degree hubs, such as random graphs or stochastic block models?

Yes. Hub labeling can be effectively applied to graphs with different topologies. Conventionally, hub labeling is developed for realistic graphs such as road networks and social networks, which assume properties such as sparsity and certain locality/hierarchy. The application to random graphs is usually different and signified by considerable dispersed connections across different local structures. To demonstrate the effectiveness, we present a preliminary study on synthetic graphs with $n=1000$ nodes and varying parameters, while a similar evaluation on realistic datasets can be found in Review 1kw8.W4.

For BA, ER, and CSBM graphs, both H-1 and H-0 label sizes are relatively small, and there are extreme cases where no H-1 and H-0 labels are required when the edge probability $p$ is high and the graph is dense. These cases explain W1, that the H-2 computation is especially effective, and a fixed number of global hubs are sufficient for SPD computation. In fact, for most node pairs, there is $SPD(u,v) = 1 \text{ or } 2$ when the raw graph is sufficiently connected, implying that it is already a 2-hop cover.
GenCAT is a graph generator closer to realistic graphs with power-law degree distribution, where larger $\alpha$ indicates a "steeper" distribution with more hubs and less long-tailed nodes. In this case, the label sizes are similar to realistic graphs, where more hubs result in smaller H-1 and larger H-0.

In summary, as noted in Property 2&3, the hierarchy can naturally emerge during the hub labeling process. For random graphs with dispersed/non-local connections, the pattern can be effectively utilized by H-2 to compute SPD. We will include more comprehensive discussions and experiments regarding generalizability on different graphs in the revised version.

Dataset	Parameters	H-1 Size	H-0 Size	Pre. Time
`CSBM`	$p=0.01, q=0.0$	30.1	62.3	0.02
`CSBM`	$p=0.1, q=0.1$	0.0	0.0	0.02
`Erdos-Renyi`	$p=0.01$	1.6	2.1	0.02
`Erdos-Renyi`	$p=0.1$	0.0	0.0	0.02
`Barabasi-Albert`	$d=10$	1.7	1.7	0.02
`Barabasi-Albert`	$d=100$	0.5	0.6	0.06
`GenCAT`	$\alpha=3.0$	30.6	72.7	0.04
`GenCAT`	$\alpha=2.0$	32.8	67.3	0.03

[GenCAT] GenCAT: Generating attributed graphs with controlled relationships between classes, attributes, and topology. Information Systems'23.

Q3

Have you experimented with other centrality measures for hub selection, and how does HubGT's performance change when high-degree nodes are not the most effective structural landmarks?

In the following table, we present experimental results of hub selection based on other metrics such as closeness centrality and PageRank.

Closeness centrality results in larger average label sizes for both H-1 and H-0 indices, which is in line with the observation in [33].
Degree and closeness centrality lead to similar GT accuracy, while PageRank is less representative for selecting useful hubs.
As noted in Appendix A.1, the main consideration in hub selection is indexing efficiency. While other measures may be useful, there is additional overhead for computing these scores, which makes them less applicable to large graphs.
Hence, we still recommend degree centrality, which can be efficiently acquired. Alternatively, the number of hub/fringe nodes in the subgraph can be controlled by the ratio $s_{in}/s_{out}$ as in Q1.

Dataset	Measure	H-1 Size	H-0 Size	Pre. Time	Acc
`cora`	Degree	2.4	2.5	0.02	86.0
`cora`	Closeness	9.9	12.9	0.03+2.59	86.0
`cora`	PageRank	2.3	2.4	0.04+0.04	85.8
`chameleon`	Degree	1.3	1.3	0.02	38.9
`chameleon`	Closeness	4.7	6.5	0.02+0.74	39.9
`chameleon`	PageRank	1.1	1.1	0.02+0.06	36.0

评论- Response to rebuttal

2025-08-08

Thanks to the authors for the responses to the questions. As noted, the key contribution of this paper is (Algs 1-3) the possible acceleration due to advantageous $H_i$ labels for the subgraph around some ego node $r$ where such labels only exist if $r$ is on the SP from $u$ to $v$ in the induced subgraph $S(r)$ of hubs of $r$ . If not, then there is no benefit since $d(u,v)$ has to be computed via the original PLL hub scheme. Given this, my main concern was the significance of the hub labeling methodology that could uniquely lead to better encoding of important nodes in subgraphs as opposed to other techniques like centrality, since the hub nodes play equivalent roles to clusterheads and it is highly likely that central nodes are hubs. As the authors also note, the label set in PLL is quite dependent on the initial ranking methodology, degree ranking is just a commonly used heuristic. In the case of heterophilous graphs or even in general for arbitrary $r$ in random graphs, it is quite possible that the subgraph around $r$ has a majority of nodes that do not produce advantageous labels (shortcuts through $r$ ). Thus it is surprising that $s$ and $s_in/s_out$ are chosen just as hyperparameters and not through some analysis on the type of graph. Since the results are all empirical and not theoretical it may just an artifact of the data sets used in the evaluations. Given that the main contribution of the paper is a set of algorithms for hub labeling of subgraphs around a root node $r$ on to of the existing PLL based hubs, I wonder if the authors can go beyond empirical and provide some theoretical arguments as to the expected number of label shortcuts (and therefore the speedup benefits) for random vs homophilous vs heterophilous graphs around the subgraph of an ego node $r$ . What is the expected set of advantageous labels if $r$ is a low-degree node? Clearly, it will not be on many shortest paths in the subgraph and so overhead should be higher.

The authors in one of the responses claim that this method is amenable to multi-GPU/distributed implementation. When the subgraph is partitioned among multiple distrbuted nodes/GPUs what is the overhead of communication across nodes? Is there a need for synchronization between nodes to construct the distributed labeling for the subgraph oriented around $r$ . PLL on large graphs distributed among multiple nodes has its own extensive body of literature due to challenges in constructing the labels, eliminating duplication of labels etc. along with synchronization and the overhead of message passing among nodes. Given that, do the authors have a distributed version of the main algorithm since the claim about the efficiency of distributed implementation is without proof. For example, when the subgraph is split across multiple nodes and vertex $u$ and $r$ are on one node, and $v$ on another, there seems to be no benefit whether $SP_{u,v}$ goes through $r$ as a hub or otherwise since it involves internode communication. Is there a theoretical analysis of optimal hub distribution across nodes/threads and overhead analysis?

2025-08-09

We appreciate the comments from the reviewer. We would like to clarify some misunderstandings first:

Given that the main contribution of the paper is a set of algorithms for hub labeling of subgraphs around a root node $r$ on to of the existing PLL based hubs

Our contribution mainly lies in the novel tokenization and PE, which enables efficient GT learning on large graphs. To facilitate this, the subgraph SPD problem and Algs 1-3 are proposed.

If not, then there is no benefit since $d(u,v)$ has to be computed via the original PLL hub scheme.

The original PLL Eq.(4) is not required in Alg 3 querying. Instead, the SPD is precomputed by Alg 2 as H-0, which has a lower construction complexity $O(ns^2)$ than the original PLL $O(ns^3)$ .

since the hub nodes play equivalent roles to cluster heads and it is highly likely that central nodes are hubs.

Hub labeling is inherently different from clustering, as it adds a number of label edges, while clustering methods only utilize the raw graph and are more sensitive to data distribution.

In the case of heterophilous graphs or even in general for arbitrary $r$ in random graphs, it is quite possible that the subgraph around $r$ has a majority of nodes that do not produce advantageous labels (shortcuts through $r$ ).
Clearly, it will not be on many shortest paths in the subgraph and so overhead should be higher.

As evaluated in Q2, the overhead in random graphs is actually lower, since global hubs can effectively capture most queries.

C1.Hub Selection

We understand there are concerns that the hub selection process may be affected by data distribution. In our approach, we utilize global and local hubs with considerations of both efficiency and expressiveness, with the primary design goal of facilitating GT learning.

In terms of efficiency, starting from low-degree nodes does not greatly affect the indexing speed, as its complexity is still theoretically bounded by $O(ns^3)$ . However, it increases the index size. Note that despite varied graph patterns, high-degree nodes are more likely to be on SPs. Hence, even if the indexing starts from low-degree nodes, high-degree ones are still explicitly added to a large number of labels.

As Sec 3, the effectiveness in GT learning mainly comes from the hierarchy beyond adjacency. Instead of crafting specific subgraph selection strategies for different graph patterns, we choose to build a sufficiently large subgraph and include a wide range of nodes to allow the model to learn the important relationships through attention. This is also our intuition regarding hyperparameter values, that they are less sensitive as long as there is sufficient information to learn.

Regarding the effectiveness on different graph patterns, hub labeling is closer to graph rewiring, which is well-studied for addressing heterophily [R1-R2], than to clustering. As analyzed in [R3-R4], heterophily is mitigated by increasing overall homophily through edge addition, which is in line with our observation in Sec 3&D.1.

[R1] Make Heterophilic Graphs Better Fit GNN: A Graph Rewiring Approach. TKDE'24.
[R2] Breaking the Limit of Graph Neural Networks by Improving the Assortativity of Graphs with Local Mixing Patterns. KDD'21.
[R3] Towards Understanding and Reducing Graph Structural Noise for GNNs. ICML'23.
[R4] Revisiting Robustness in Graph Machine Learning. ICLR'23.

C2.Multi-device

The setting in Review 1EVJ.W3 is mainly for multi-device training, where the index and input attributes can be shared across devices once precomputed. In this case, HubGT is superior to previous models especially kernel-based ones, since only node-wise inputs are required and can be easily batched.

In distributed settings where the indices also need to be partitioned for training, $\mathcal{L}$ can be divided among devices in a node-wise manner, while $\mathcal{I}$ and attributes $X$ need to be either (1) mirrored or (2) stored centrally and communicated during training, both also in a node-wise manner. (1) is a common practice in distributed hub labeling [R5], where rows $\mathcal{I}_v$ and $X[v]$ are copied to the device with $\mathcal{L}(u)$ and $v \in \mathcal{L}(u)$ . (2) shares a similar setting with the distributed training of sampling-based GNNs.

Regarding distributed indexing, hub labeling also benefits from its node-centric design [R6]. As mentioned by the reviewer, there is an extensive body of literature, which is largely orthogonal to our contribution on incorporating hub labeling into GTs. Hence, we only mention its possibility in this paper.

[R5] Planting trees for scalable and efficient canonical hub labeling. VLDB'19.
[R6] Pruned Landmark Labeling Meets Vertex Centric Computation: A Surprisingly Happy Marriage! arXiv.

审稿意见

评分: 4置信度: 42025-07-05

The paper is an approach to deal with scalability of Graph Transformers (GT). This paper uses graph structure to solve two separate issues for scaling GT.

Since a large graph cannot be held in memory and the attention mechanism is quadratic, naive attention does not work. Mini batching is a way to deal with large graphs. Batching might only capture parts of graph, which can be problematic for hetrophilic graphs.
Positional encoding is an important problem to solve for graphs. On regular sequences the relative position of the tokens leads to natural positional encodings which do not work for permutation-invariant structure of graphs.

The area of GraphTransformers are using many techniques from the graph theory literature to deal with efficient representation learning and improved accuracy on large graphs. This paper uses Pruned Landmark Labelling (PLL) approach by Akiba et al. [33,34] to represent the graph topology. PLL was introduced in 2013-14 in the context of large (web, social etc.) graphs to solve the shortest paths between a pair of nodes. It uses the idea of using landmark nodes to quickly compute distance between a pairs of nodes. These are used for both batching to include representation from other nodes (especially in heterophilous graphs) and for positional encoding.

Even though HubGT does not outperform many of the methods (Table 2), it is close to the performance metrics of the other good methods. The idea proposed is good for the field and I would like the authors can address the weaknesses described below.

优缺点分析

Strengths: The paper is well written. The time complexity analysis is very useful.

Weaknesses/potential future directions: (I) The paper compares against NAGphormer, SGFormer and DIFFormer, which are good benchmarks for mini batching. There are at least a couple of approaches that are not cited/compared against in the paper. (1) The first is Exphormer (Shirzad et al. 2023) which uses (1) virtual nodes and (2) uses an overlay graph (like in HubGT, but using a different intuition). It would be interesting to see the mixing properties due to the two overlay graph approaches. How many additional edges are needed in HubGT are comparable to the degree of the expander graph in Exphormer? How many virtual nodes smooth the network vs. in HubGT? Exphormer might not scale well with minibatching, but SpExphormer might be better in that aspect. So comparison to HubGT will be valuable.

(2) The second paper that might be relevant is Graphormer (Ying et al., 2021), where structural features (centrality and spatial encoding) for GT was proposed. Graphormer was before the introduction of Long Range benchmarks. I am not sure if there have been studies in applying it to the LongRange benchmark. So a study here will be valuable to understand which structural approaches are more valuable.

For the final version, a discussion and some simple comparison would be valuable. And further work will be useful to understand the space well.

(II) Description of the HLL algorithm. I found it hard to follow the labelling ideas for HubGT. I went back and forth between section 4 and appendix, but had to read the original paper to understand the approach better. The complexity of BFS approach for SPD as described in the paper is clear. I would like an easier-to-follow description of the algorithm --- how pruning works, what labels represent (Appendix A refers to algorithm 1, which uses a 3-tuple. It is too hard to follow). An easier English description would be valuable.

(III) Scaling properties. The authors spend a lot of time on the complexity analysis of the paper. Scaling to REDDIT, ogbn-mag and Pokec is impressive. However, I would also like to see the scalability in terms of memory, compute times and graph sizes. I would like a study (instead of brief paragraphs) on the memory and scalability issues. From the description it is not clear what the limitations are on a single GPU or in a multiGPU setting. A thorough description of the experimentations would be very helpful.

问题

Described in the weaknesses section above.

局限性

Not applicable

格式问题

none

作者回复

2025-07-31

We sincerely appreciate the constructive feedback and acknowledgment of our main contributions from the reviewer. We reply to all the points below.

W1

The paper compares against NAGphormer, SGFormer and DIFFormer, which are good benchmarks for mini batching. There are at least a couple of approaches that are not cited/compared against in the paper. (1) The first is Exphormer (Shirzad et al. 2023) which uses (1) virtual nodes and (2) uses an overlay graph (like in HubGT, but using a different intuition). It would be interesting to see the mixing properties due to the two overlay graph approaches. How many additional edges are needed in HubGT are comparable to the degree of the expander graph in Exphormer? How many virtual nodes smooth the network vs. in HubGT? Exphormer might not scale well with minibatching, but SpExphormer might be better in that aspect. So comparison to HubGT will be valuable.
(2) The second paper that might be relevant is Graphormer (Ying et al., 2021), where structural features (centrality and spatial encoding) for GT was proposed. Graphormer was before the introduction of Long Range benchmarks. I am not sure if there have been studies in applying it to the LongRange benchmark. So a study here will be valuable to understand which structural approaches are more valuable. For the final version, a discussion and some simple comparison would be valuable. And further work will be useful to understand the space well.

(1) Exphormer

Difference with HubGT

The large-scale variant of Exphormer is based on GraphGPS [7] according to its paper, exploiting the kernel-based approach to reduce overhead and sharing the similar complexity as in our Table 1. Additionally, the original evaluation mostly focuses on graphs under homophily, and the performance under heterophily is not presented.

The expander degree is set to $6$ for node classification in Exphormer, i.e., adding $6$ edges for each node. For comparison, the empirical average label size in HubGT is $2.0$ and $15.0$ on cora and penn94. Exphormer utilizes $1$ global node, while HubGT uses $8$ hubs. We would like to remark that the selection of both hyperparameters in HubGT is additionally affected by the consideration of indexing efficiency and may be further reduced in GT learning by down-sampling.

Experiments

The efficiency (precomputation time, training epoch time, and peak GPU memory) and accuracy evaluations of Exphormer are presented in the table below, complementary to Table 2&5 in our paper. Note that to prevent OOM in our settings (24GB VRAM cap vs 40GB in the original paper), we reduce the model size to $layer=2, width=32, n_{head}=2$ , so the results may be different. It can be observed that:

The efficiency of Exphormer is mostly comparable to kernel-based methods evaluated in our paper, including fast training and less scalable GPU footprint.
The model accuracy, especially under heterophily, is limited due to the dependency on graph topology and the relatively small model size.
Empirically, applying expander graph and virtual nodes (VN) leads to significant overhead in precomputation and training, respectively. We will present more discussions regarding the trade-off between efficiency and accuracy of different Exphormer variants in the revised paper.

Dataset	cora				ogbn-arxiv				twitch-gamer
Model	Pre	Epoch	VRAM	Acc	Pre	Epoch	VRAM	Acc	Pre	Epoch	VRAM	Acc
`Exphormer`-w/o VN	0.1	0.05	1.2	85.3	23.3	0.8	14.8	67.9	50.8	0.9	16.3	62.3
`Exphormer`-w/ VN	0.1	0.07	1.5	85.8	32.0	1.2	18.8	67.9	41.5	1.3	19.4	64.2

(2) Graphormer

Difference with HubGT

Graphormer [2] is a pioneering work introducing structural encoding and the global virtual node. As analyzed in Table 1, it is a full-graph GT, which is usually applied to many small graphs such as molecules.

In Table 5 of the Graphormer paper, it is demonstrated that SPD is better than Laplacian as positional encoding, which aligns with the intuition from our motivating study 2 in Section 3.

Experiments

Even with neighborhood sampling and mini-batching, Graphormer incurs OOM on small datasets such as cora in our environment. The same issue is also observed in other studies such as [9]. Therefore, we are unable to provide further experimental results.

We will include more comprehensive comparisons for both models in the revised version based on the current discussions and experiments.

W2

Description of the HLL algorithm. I found it hard to follow the labelling ideas for HubGT. I went back and forth between section 4 and appendix, but had to read the original paper to understand the approach better. The complexity of BFS approach for SPD as described in the paper is clear. I would like an easier-to-follow description of the algorithm --- how pruning works, what labels represent (Appendix A refers to algorithm 1, which uses a 3-tuple. It is too hard to follow). An easier English description would be valuable.

We will enhance the representation of the hub labeling algorithm in Section 4 and Appendix A to improve clarity. As an overview, we elaborate on the PLL algorithm and the differences with our Algs 1-3 below.

PLL overview

The classical PLL algorithm proposed in [33] aims to address the conventional full-graph SPD query: given two arbitrary nodes $u,v\in\mathcal{V}$ , the SPD between them can be answered by visiting the labels following our Eq. (4). This is achieved by the two-stage process of indexing as precomputation and the $O(s)$ querying.

In the indexing phase, it constructs labels for each node by pruned BFS. Each label $\mathcal{L}(u)$ is a set of tuples $(v, b(u,v))$ denoting the end node $v$ and SPD $b(u,v)$ . BFS starting from $u$ and visiting $v$ aims to compute $b(u,v)$ based on the existing labels and add it to $\mathcal{L}(v)$ during the search. It is pruned when the computed SPD $b(u,v)$ is smaller than the BFS distance from $u$ , which implies that the SP between $u,v$ has been accessed through a better intermediate node $r$ searched before $u$ .

As it intends to answer arbitrary queries, the querying phase needs to iterate over all labels of $u$ and $v$ and find the SPD through the best intermediate node $r$ .

Hierarchical Labeling in HubGT

In our paper, we design Algs 1-3 to address a different subgraph SPD problem formulated in line 158: given a root $r$ , we aim to efficiently compute SPD $b(u,v)$ for node pairs in the induced subgraph $u,v\in\mathcal{S}(r)$ .

To achieve this, we include additional elements in the label index, i.e., $\mathcal{M}_r^{0}(v), \mathcal{M}_r^{1}(v)$ in Alg 1, which are used to compute SPD by Eq. (3). Intuitively, $\mathcal{M}_r(v)$ denotes the relative position of neighbors in $\mathcal{S}(r)$ with respect to $v$ , and can indicate potential "shortcuts" in SPD computation without increasing the label size. $q(v)$ in Alg 1 is the auxiliary variable representing the search distance between $v$ and $r$ for $\mathcal{M}_r(v)$ construction, i.e., nodes $u$ with $q(u) = q(v)$ are added to $\mathcal{M}_r^0(v)$ , while nodes with $q(u) < q(v)$ are added to $\mathcal{M}_r^1(v)$ .

In this case, the prune conditions are lines 11 & 15 in Alg 1, where the former corresponds to the classical PLL, and the latter further prunes when a shortcut through $\mathcal{M}_r(v)$ or $\mathcal{M}_r(u)$ exists. The difference of Alg 1 with the bit-parallel PLL [33] is that, the strategy only applies to global hubs in [33] for full-graph SPD, while in our setting, each root node $r$ can be regarded as a "local" hub within the subgraph. Hence, it can be integrated into labels to boost querying.

W3

Scaling properties. The authors spend a lot of time on the complexity analysis of the paper. Scaling to REDDIT, ogbn-mag and Pokec is impressive. However, I would also like to see the scalability in terms of memory, compute times and graph sizes. I would like a study (instead of brief paragraphs) on the memory and scalability issues. From the description it is not clear what the limitations are on a single GPU or in a multiGPU setting. A thorough description of the experimentations would be very helpful.

GPU Memory: Evaluation of GPU memory on some datasets is available in Table 5, while the others will be included in the revised version. In general, it aligns with our complexity analysis in Table 1, showing that HubGT is highly scalable, as the GPU memory can be arbitrarily adjusted by batch size to prevent OOM. Our additional experiments with FlexAttention in Review 1kw8.W1 further show that the main GPU memory bottleneck is the standard Transformer architecture, and can be addressed by incorporating more efficient designs. In comparison, GTs such as ANS-GT and Exphormer in W1 incorporate graph-scale data, which leads to OOM on large datasets.
Computation Time and RAM Memory: The time scalability of HubGT is mainly dominated by precomputation time, especially H-1 and H-0 indexing corresponding to the $O(ns^3)$ term in complexity. The RAM memory is dominated by the $O(ns^2)$ H-0 index, which is still considerable for large datasets. As suggested by Review Hvuy.I3, we will add a new figure to study model performance under varying graph scales in the revised version.
Multi-GPU Setting: As HubGT employs the fully decoupled architecture, it benefits multi-GPU and even distributed training. Once the precomputation is completed, data parallelism can be employed with multiple Transformers, each handling a batch of tokens, since node tokens are mutually independent and SPD querying is highly local in storage. One potential limitation may be related to the large H-0 index, rendering index duplication less efficient.

最终决定Accept (poster)

2025-09-17

(a) Scientific Claims and Findings

The paper claims to tackle the scalability issue of Graph Transformers by proposing HubGT, a framework that decouples graph processing from model training. Its main findings are: (1) A novel hierarchical hub labeling scheme can efficiently pre-compute and store graph structural information, including local and global connections. (2) This pre-computed index enables O(1) querying of Shortest Path Distance (SPD), making it feasible to use dense SPD as an effective positional encoding on large-scale graphs for the first time. (3) The resulting training process is independent of graph-scale operations, achieving linear complexity, favorable GPU utilization, and the fastest inference speed among baselines on large graphs.

(b) Strengths

High Efficiency and Scalability: The core strength is the decoupled architecture, which separates the heavy graph computation into a one-time pre-processing step. This design significantly improves training and inference efficiency, reduces GPU memory bottlenecks, and makes the model highly scalable for large graphs.
Novel and Effective Graph Representation: The paper innovatively uses a hierarchical hub labeling scheme for both node tokenization and enabling dense SPD-based positional encoding. This provides a richer representation of graph structure beyond local adjacency, proving effective on both homophilous and heterophilous graphs.
Strong Empirical Results: The paper is supported by extensive experiments on 14 benchmark datasets, demonstrating state-of-the-art efficiency (especially inference speed) while maintaining competitive or superior accuracy. The complexity analysis is also thorough.

Static Graphs Limitation: The pre-computation-based approach is inherently unsuitable for dynamic graphs, as any change in the graph structure would require a costly re-computation of the entire index.
Hub Selection Strategy: The method relies on a simple degree-based heuristic for hub selection. Reviewers noted that this might not be optimal for all graph topologies and that a deeper analysis or comparison with other centrality measures was initially missing.
Limited Initial Baselines: The initial submission lacked comparisons to some relevant and strong baselines, including other scalable GTs (e.g., Exphormer) and recent message-passing GNNs.

(d) Reasons for Acceptance

The paper is accepted because it makes a significant and practical contribution to the field of Graph Transformers. Its core idea of decoupling graph computation is powerful and effectively addresses the critical challenge of scalability. The proposed HubGT framework is technically sound and is validated by strong empirical evidence, showing a superior efficiency-accuracy trade-off, especially its state-of-the-art inference speed. During the rebuttal, the authors provided a comprehensive response that successfully addressed the main weaknesses pointed out by the reviewers. They added new experiments against stronger baselines, provided deeper analysis on design choices like hub selection, and integrated a more advanced attention mechanism to demonstrate the framework's flexibility. This thorough rebuttal strengthened the paper significantly and demonstrated the robustness of their method.

(e) Rebuttal Discussion Summary / Rebuttal

The rebuttal period was very productive. The authors provided detailed responses and new experimental results that addressed the reviewers' primary concerns.

• For Reviewer 1EVJ: The reviewer was concerned about missing baselines (Exphormer, Graphormer), the clarity of the HLL algorithm, and the lack of a detailed scalability study. The authors addressed this by adding experimental results for Exphormer , explaining OOM issues with Graphormer , providing a clearer explanation of their HLL algorithm in relation to classic PLL , and elaborating on the model's scalability in terms of memory and computation time.

• For Reviewer c6sP: The reviewer questioned the generalizability of degree-based hub selection and the choice of hyperparameters. The authors responded by evaluating their method on various synthetic random graphs to demonstrate robustness and comparing different centrality measures for hub selection. They also clarified the heuristics for setting hyperparameters. In the final justification, the reviewer maintained their score, noting concerns about dynamic graphs and the empirical nature of the findings, but the authors had already acknowledged the former as a limitation and provided extensive empirical validation for the latter.

• For Reviewer UKq4: The main weakness identified was the lack of comparison with strong message-passing GNNs. The authors added experiments against four additional GNN baselines (GCNJK, MixHop, etc.). A follow-up discussion clarified a results discrepancy, which the authors attributed to different inference settings (mini-batch vs. full-graph) due to hardware constraints.

• For Reviewer 1kw8: The reviewer pointed out several weaknesses, including the standard Transformer core, limitations on dynamic graphs, and missing baselines. The authors demonstrated the flexibility of HubGT by integrating it with FlexAttention, showing significant performance gains. They acknowledged the limitation regarding dynamic graphs as a future direction and addressed the other points in responses to other reviewers. The reviewer was satisfied with the response and maintained their high score.

• For Reviewer Hvuy: The reviewer initially found it hard to see the clear benefits from the tables. The authors elaborated on their evaluation priorities (scalability, accuracy, then efficiency) and the novelty of their approach. They also agreed to add better visualizations (e.g., scatter plots) to convey the accuracy-efficiency trade-off more clearly in the final version. This clarification was convincing, and the reviewer raised score to "Accept".