6.4

/10

Poster4 位审稿人

最低2最高5标准差1.2

4.0

置信度

创新性2.3

质量3.0

清晰度2.8

重要性2.3

NeurIPS 2025

Relieving the Over-Aggregating Effect in Graph Transformers

Junshu Sun,Wanxing Chang,Chenxue Yang,Qingming Huang,Shuhui Wang

OpenReview PDF

提交: 2025-04-16更新: 2025-10-29

摘要

关键词

graph transformersmessage passinggraph neural networks

评审与讨论

审稿意见

评分: 5置信度: 42025-06-22

This paper discovers a new concept, over-aggregating, which aggregates a large volume of messages with less discrimination. To address it, the authors propose Wideformer, a plug-and-play method for graph attention. Wideformer divides the aggregation of all nodes into parallel processes and guides the model to focus on specific subsets of these processes.

优缺点分析

Strengths

The paper identifies a novel phenomenon called "over-aggregating" in graph transformers, where a large volume of messages is aggregated into a single node with less discrimination, leading to potential information loss. This insight is important for understanding and improving graph transformer models. ‌Proposed Wideformer Method‌:
The Wideformer method is an effective approach to relieve the over-aggregating effect. By dividing the aggregation of all nodes into parallel processes and guiding the model to focus on informative subsets, Wideformer can maintain the global receptive field while reducing information loss.
The paper provides a theoretical analysis to justify the over-aggregating effect and the effectiveness of the proposed Wideformer method. This enhances the credibility of the claims.

Weaknesses & Questions

In Sec. 3.3, I noticed that over-smoothing and over-aggregating have different definitions. The first concern: Is their influence on graph learning different? I also noticed sentences in P4.L156-157. Since the authors think higher entropy suggests similar attention scores, I believe that for each node/token whose representation seems to be similar after this layer. Isn't it a low-pass filtering effect?
When and how do the authors use Algorithm 1? Before training or during training? Are centers in different layers different?
The paper uses a simplistic KMeans++ clustering method to divide the nodes into clusters. A more sophisticated clustering approach based on learned node representations may lead to better performance.
While the experiments cover a diverse set of datasets, the evaluation on large-scale graphs is limited. Evaluating Wideformer on graphs with millions of nodes would provide insights into its scalability.
Question about algorithm 1. In the loop, $\\mathbf{C}$ becomes a matrix, not a vector anymore. I'm confused about it.

问题

Above.

局限性

yes

格式问题

None.

作者回复

2025-07-31

Thank you very much for your support and insightful suggestions that improved our manuscript. We hope our response can adequately address your concerns. The discussion will be added to the revision.

W1: Differences between over-smoothing and over-aggregating.

Response:

Over-smoothing and over-aggregating have different influences on graph learning.
- Over-smoothing indicates low-pass filtering among the locally connected nodes.
- Over-aggregating indicates low-pass filtering among the query-key similar nodes.
- For instance, consider a high-entropy case where a subset of nodes (set a) receives zero attention, and the rest (set b) are assigned uniform attention weights. In this case, the node representations are averaged over set b, leading to feature smoothing. However, if the nodes in set b are not adjacent in the input graph, the actual locally connected nodes will not be smoothed, and there will not be a low-pass filtering effect on the input graph.
Solutions for over-smoothing cannot tackle over-aggregating. Designing high-pass filters can tackle over-smoothing. Conversely, even when a high-pass behavior is desired during over-aggregating, the optimization of attention weights may still be hindered by the overwhelming number of source nodes. One must tackle this issue regarding the input node volumes.
L156-157. Thank you for pointing this out. The low-pass filtering in L157 refers to the local smoothing effect on the input graphs. To avoid confusion, we revised L157 as 'does not necessarily imply a low-pass filtering effect among the locally connected nodes as over-smoothing.'

W2&W3: The use of Algorithm 1 and ablation on the center selection methods.

Response: Thank you very much for enlightening us to discover a better center selection method!

Algorithm 1 is applied to the target nodes before each attention-based aggregation during training, and thus, there will be different centers in different layers.
To ensure computing efficiency, we did not apply the iterative clustering process in the Wideformer but only initiated the centers based on query features via a K-means++-like algorithm. The selection process is based on the node representations.
Following your suggestions, we study whether applying clustering iterations will improve the performance.
- Setting: We compare the simple initiating choice with (1) Multi-step iteration clustering and (2) directly modeling centers as parameters. The results are averaged on amazon-ratings, questions, CoauthorCS, and AmazonPhoto.
- Results in Tab.R1 show that further applying clustering iterations can achieve better results, indicating the efficiency-performance trade-offs. We also discovered that directly modeling centers as parameters can address this trade-off, which does not require clustering iterations and achieves the best results.

Table R1. Clustering Method Comparison	GraphGPS	SGFormer	Polynormer
Original	78.91	79.54	80.84
+Wideformer	79.26	79.98	81.07
+iter=2	79.93	80.58	81.44
+iter=4	79.95	80.42	81.43
+iter=6	79.85	80.55	81.52
+as parameters	80.05	80.60	81.54

W4: Results on larger graphs.

Response: Following your suggestion, we conducted experiments on ogbn-products and ogbn-proteins [R1] in Tab.R2. Results show that Wideformer can constantly benefit the backbones on these larger datasets (with more nodes or edges), further demonstrating its effectiveness.

Table R2. Comparison on Large-scale datasets	ogbn-products	ogbn-proteins
#nodes	2,449,029	132,534
#edges	61,859,140	39,561,252
GCN	72.80	72.51
GraphGPS	74.90	71.69
+Wideformer	76.03	72.67
SGFormer	74.47	73.53
+Wideformer	75.32	75.44
Polynormer	77.21	73.18
+Wideformer	78.89	73.97

W5 (Q): The use of $C$ in Alg.1.

Response: Thank you for bringing this to our attention.

The 'update centers' step in algorithm 1 is a concatenation operation. Therefore, after the first concatenation, $C$ becomes a matrix.
To avoid potential confusion, we revise the description of the algorithm by initiating $C$ as an all-zero $m\times d$ matrix and updating centers by assigning the $t$ -th center to the $t$ -th row. This ensures that $C$ is a matrix of the same shape throughout the algorithm.

[R1] Open Graph Benchmark: Datasets for Machine Learning on Graphs.

审稿意见

评分: 5置信度: 32025-06-30

The paper identifies over-aggregating, a tendency of graph-transformer attention to assign nearly uniform weights when many nodes are present, which “dilutes” salient messages. It measures the effect with attention-entropy and shows that the entropy lower-bound grows with graph size. To counter this, the authors introduce Wideformer, a plug-and-play module that divides the global receptive field into m query-based clusters using a K-means++ seeding routine (Alg. 1) and aggregates each cluster independently. It guides attention by re-weighting and sorting the cluster outputs with a second soft-max so each node can focus on informative clusters. Wideformer is inserted atop three linear graph-transformer backbones (GraphGPS, SGFormer, Polynormer) and evaluated on 13 node-classification datasets spanning 1 K – 170 K nodes. It consistently lowers attention-entropy and yields accuracy or ROC-AUC gains of ~(0.2–1.2) pp on most benchmarks, with linear time/space O(n m) and modest wall-time overhead. An ablation shows both division and guidance are needed for peak gains. Limitations and societal impact are discussed.

优缺点分析

Quality Strengths: Solid empirical study on 13 datasets; ablations and complexity analysis support claims; open-sourced code and reproducibility checklist. Weaknesses: Experiments focus only on node classification; no link-prediction or graph-level tasks; statistical significance of gains not reported. Limitation of K-means++ choice. The divide step uses K-means++, which may be suboptimal due to its assumptions. Have the authors considered differentiable clustering or adaptive-k strategies? An ablation comparing clustering algorithms or at least discussing these trade-offs would strengthen the paper.

Clarity Strengths: Motivation, method and equations are easy to follow; limitations clearly stated. Weaknesses: (i) Several figures are hard to interpret: e.g., in Figure 6 the numbers actually label clusters, but this is never explained in the caption or legend, so a reader cannot tell what they mean without digging into the text. More descriptive legends or color-coding would help. (ii) Fonts in many plots are tiny. (iii) Multiple spelling glitches (“Cluter”, “Large”, “compexity”) and inconsistent capitalization of “Wideformer.”

Significance Strengths: Tackles a practical performance bottleneck in linear graph transformers and can be adopted by many models; modest but consistent gains on real data. Weaknesses: Incremental relative to existing sparse-attention or cluster pooling work; gains may be negligible in applications where entropy is already low.

Originality Strengths: Introduces the specific notion of over-aggregating and couples a two-step divide-and-guide scheme not present in prior work. Weaknesses: Concept overlaps with known uniform-attention phenomena; division uses standard K-means++ rather than a novel algorithm.

问题

The work is technically sound, addresses a practical issue, and shows consistent improvements with minimal cost. Expansion to broader tasks and deeper statistical analysis would elevate its impact.

Generality beyond node classification. Evaluate on graph-level tasks or link prediction.

Statistical robustness. Add paired-t tests or confidence intervals; some gains lie within one standard deviation. Figure interpretability. Revise Figure 4 (requires description) and Figure 6 lacks a informative legend explaining that numeric labels denote clusters: similar plots, enlarge fonts.

Proofreading. Fix spelling errors ("Cluter", "compexity", "Targe") and standardize “Wideformer” capitalization.

The brief description of over-smoothing in §3.3 is hard to follow. Please consider rewriting this paragraph with a concise definition, one illustrative example

局限性

Authors explicitly acknowledge that (i) the current division and guidance strategies are simple and may fail in some regimes and (ii) only a limited search over clustering and weighting strategies is explored. I concur and suggest emphasizing scalability to million-node graphs and tasks beyond classification. Yes, limitations are adequately discussed.

最终评判理由

The authors have cleared my concerns with additional experiments. They have promised to update the paper accordingly. As such I have raised my score from borderline accept to accept.

格式问题

No concerns

作者回复

2025-07-31

Thank you very much for your support and valuable suggestions that further broaden the potential application of our method to other tasks. We hope our response can adequately address your concerns. The discussion will be added to the revision.

Quality 1&Q1&Q2: More results for different tasks.

Response: We further evaluate backbones with Wideformer for graph classification on peptides-func [R1] and CIFAR10 [R2], and for graph regression on peptides-struct [R1] in Tab.R1. Results show that Wideformer can consistently benefit backbones on these tasks.

Table R1. Comparison on Graph-level Tasks	peptides-func $\uparrow$	peptides-struct $\downarrow$	CIFAR10 $\uparrow$
GCN	59.30	0.3496	55.71
GraphGPS	63.77	0.2615	71.63
+Wideformer	64.02	0.2530	73.40
SGFormer	60.14	0.2796	55.93
+Wideformer	60.53	0.2687	56.51
Polynormer	61.39	0.3089	65.34
+Wideformer	61.95	0.3002	66.11

Quality 2&Q3-1: Statistical significance of gains.

Response: Thank you for your suggestion, which enables us to prove the significance of our performance gains statistically. We compute the performance gains between backbones with Wideformer and backbones only, and then compare the gains with zero. Paired t-tests on all backbones consistently give rise to p-values smaller than 0.05, and the confidence interval results do not contain 0, demonstrating the statistical significance of our gains.

Quality 3&Originality 2: Ablation on center selection methods.

Response: Thank you very much for enlightening us to discover a better center selection method!

Setting: Following your suggestions, we studied the trade-off between computing efficiency and model performance. In the manuscript, to ensure computing efficiency, we did not apply the iterative clustering process in the Wideformer but only initiated the centers via a K-means++-like algorithm. Therefore, we only evaluated whether applying clustering iterations results in better performance during the rebuttal, instead of considering more complex clustering strategies. Specifically, we compare the simple initiating choice with (1) Multi-step iteration clustering and (2) directly modeling centers as parameters. The results are averaged on amazon-ratings, questions, CoauthorCS, and AmazonPhoto.
Results in Tab.R2 show that further applying clustering iterations can achieve better results, indicating the efficiency-performance trade-offs. We also discovered that directly modeling centers as parameters can address this trade-off, which does not require clustering iterations and achieves the best results.

Table R2. Clustering Method Comparison	GraphGPS	SGFormer	Polynormer
Original	78.91	79.54	80.84
+Wideformer	79.26	79.98	81.07
+iter=2	79.93	80.58	81.44
+iter=4	79.95	80.42	81.43
+iter=6	79.85	80.55	81.52
+learnable	80.05	80.60	81.54

Clarity&Q3-2&Q4: Legend description, tiny fonts, and typos.

Response: Thank you for your suggestions. We fixed these problems in the revision by increasing the font size, correcting the typos, adding legend description to Fig.6 (The shape of the points indicates the model types, where '.' denotes the whole model and '*' denotes the Wideformer part. The color of the points indicates the number of the clusters $m$ ), and adding descriptions to Fig.4 (The informative source node ratio denotes the ratio of nodes assigned to the cluster with the highest attention score relative to the total number of graph nodes).

Significance 1: Comparison with sparse attention and cluster pooling.

Response: Thank you for your suggestions. We have added the discussion to the revision.

Sparse attentions approximate fully-connected message passing via sparse connections. These methods shrink the receptive fields of the target nodes, which leads to sub-optimal message pathways for downstream tasks. In contrast, Wideformer maintains the global receptive field for target nodes. The importance of maintaining a global receptive field can be supported by the results in Tab.1-3, where combining global attentions with Wideformer gives rise to better performances than sparse attentions.
Cluster pooling methods focus on learning hierarchical features of input graphs. They can collect global information by stacking multiple layers. In contrast, Wideformer can collect global information in a single layer, giving rise to shallow architectures for large graphs.

Significance 2: Performance on low-entropy scenario.

Response: Wideformer can also achieve statistically significant gains on small-scale graphs (Cora and Citeseer in Tab.3 in the original manuscript), which typically have lower attention entropy. This demonstrates the effectiveness of Wideformer across different node scales.

Originality 1: Comparison with uniform-attention phenomena.

Response: Our work differs from previous uniform-attention phenomena by providing

Proof as a common issue: Different from the occasional observation of the uniform attention in previous works [R3], we discovered that uniform attention with high attention entropy consistently exists across different global graph attention methods in different datasets.
Systematical study: We termed this phenomenon as over-aggregating, providing quantitative, theoretical root-cause, and influence analysis systematically.
Effective solution: We also proposed Wideformer to tackle over-aggregating, which is tailored for large-scale graphs and achieves consistent performance gains for different backbones on different datasets.

Q5: Definition and example for over-smoothing.

Response: Thank you for your suggestion. We will rewrite this part with a concise definition and examples.

Cause: Over-smoothing occurs when applying low-pass filtering to aggregate messages on the input graphs, such as the adjacent matrix with self-loops. In contrast, over-aggregating is caused by the large number of nodes blocking the optimization of the attention scores.
Solution: Designing high-pass aggregating functions can tackle over-smoothing. Conversely, for over-aggregating on graph transformers, even when a high-pass behavior is desired, the optimization of attention weights may still be hindered by the overwhelming number of source nodes involved in aggregation. One must tackle this issue regarding the input node volumes.

[R1] Long Range Graph Benchmark. NeurIPS'22

[R2] Benchmarking Graph Neural Networks. JMLR'23

[R3] Scratching Visual Transformer's Back with Uniform Attention. ICCV'23

2025-08-02

Thank you for the response. I'll keep my score.

2025-08-04

Thank you for your time and thoughtful suggestions on our manuscript. We sincerely appreciate your feedback and your decision to maintain a positive rating.

审稿意见

评分: 4置信度: 42025-06-30

The paper defines the concept of attention entropy, proposes a new algorithm that divides global aggregation by clusters and guides the aggregation by the significance of the clusters, which is supposed to reduce the attention entropy. Experiments show that the algorithm is effective in improving the performance of several graph transformers on various datasets.

优缺点分析

Strengths: Define the concept of attention entropy; propose a new algorithm that divides global aggregation by clusters and guides the aggregation by the significance of the clusters. It is written in plain language and has a handful of plots that attempt to corroborate the effectiveness of its method.

Weaknesses:

Theorem 3.1 (at line 122) is a correct but not meaningful statement. You can always show that entropy is lower bounded by 0, which is obtained when $\alpha$ has a point mass distribution. This statement does not suffice to support the exact claim (at line 126) "graph attention will have higher attention entropy when aggregating more graph nodes", and fig.1(b) does not conform to the monotonicity in the Theorem when the # of nodes increases from 1e+4 to 1e+6.
The paper does not show a clear causal relation (or correlation) between reduction of attention entropy and better generalization. Empirically, the algorithm reduces the attention entropy of all three architectures significantly, but the test accuracy is improved marginally. Moreover, the most severe reduction occurs on polynormer, but the improvement of polynormer is not superior to that of the other two architectures.
The idea of using cluster information in the network data is not unique, even in my package of reviewed assignments.

问题

Is there a possibility that over-aggregation is not a significant factor for the generalization of graph transformer models, or, on the contrary, why is it a significant one?
Can you find an approach to demonstrate that reducing the attention entropy deterministically improves the generalization of the model?
In your abstract, you mentioned that your method reduces information loss. Can you be more specific about the information loss in the original transformer structures? How do you measure the information loss? To what extent does your algorithm help reduce it?

局限性

There is no negative societal impact of the work.
The limitations of the paper: See the questions and weaknesses

最终评判理由

Questions are resolved.

格式问题

No major formatting issues detected.

作者回复

2025-07-31

Thank you very much for your valuable suggestions that help us improve our paper. We hope our response can adequately address your concerns. The discussion will be added to the revision.

W1: Lower bound of the attention entropy.

Response:

Tight lower bound of Theorem 3.1: We would like to clarify that Theorem 3.1 is not based on the trivial lower bound of 0, but a non-trivial, tighter lower bound derived in Appendix A. This bound increases with the number of aggregated nodes.
Lower bound monotonicity: Theorem 3.1 characterizes the monotonicity of the attention entropy lower bound to the number of nodes, rather than the monotonicity of the actual learned attention entropy.
- Regarding Fig.1(b): The overall trend of Fig.1(b) aligns with the theoretical prediction, where a higher lower bound makes it more likely for the learned attention scores to exhibit higher entropy given more aggregated nodes.
- Regarding Line 126: We agree that the original phrasing may be misleading. We will revise the statement to “more likely to have higher entropy” to better reflect the theoretical prediction.

W2&Q1&Q2: Relation between attention entropy and model generalizability.

Response: Thank you for your suggestions, which help us to better understand the working mechanics of Wideformer.

Reduction of attention entropy relates to better generalization
- Regularizing entropy improves model performance: In Tab.S5 in the manuscript, we show that directly reducing the attention entropy during training can achieve better model performance.
- Reduced entropy results in smaller generalization gap: We further analyze the performance gap between the training set and the test set with different weights on the entropy regularization term. Results are averaged on GraphGPS, SGFormer, and Polynormer across amazon-ratings, questions, AmazonPhoto, and CoauthorCS. Tab.R1 shows that reduced entropy can reduce the performance gap between training and inference. This confirms the positive relation between the reduction of attention entropy and better generalizability.
- Attention entropy should be compared within the same architecture: Due to distinct design principles, different architectures inherently prioritize different attention patterns for optimal representation learning. While high attention entropy consistently indicates over-aggregating, the optimal entropy level varies across models. Thus, although Polynormer exhibits the most severe entropy reduction, this may still fall short of what is needed for its optimal performance, explaining why its gain is not superior to the other two.
Statistical analysis demonstrates the significance of performance gains: Following the suggestions of reviewer j4ry, we compute the performance gains between backbones with Wideformer and backbones only, and then compare the gains with zero. Paired t-tests on all backbones consistently give rise to p-values smaller than 0.05, and the confidence interval results do not contain 0, demonstrating the statistical significance of our gains.

Table R1. Relation between Entropy and Generalizability
Reg weight	0	0.01	0.05	0.1
Cluster attention entropy	0.1837	0.0921	0.0812	0.0509
Test-Train loss	0.1235	0.1065	0.0892	0.0225
Train-Test acc	10.27	3.34	2.55	0.69

W3: Differences between Wideformer and previous clustering-based methods.

Response: The main difference between Wideformer and previous clustering-based methods is that Wideformer is tailored to tackle over-aggregating.

Node division tailored for message dilution in over-aggregating:
- Tackling over-aggregating requires distinguishing informative source nodes from less informative ones.
- Selecting the truly informative sources for the targets requires the modeling of the source-target relations.
  - Previous methods fail by focusing solely on either target nodes (query) or source nodes (key).
  - We learn from the relation by selecting cluster centers among the targets and assigning source nodes based on their similarity to the target centers. This strategy enables effective discrimination of informative nodes while avoiding excessive computational overhead.
Attention guiding tailored for uniform attention in over-aggregating: To enable target nodes focusing on the informative clusters, cluster aggregation results are further weighted with cluster attention.

Except for these model-level differences, we also showed in the manuscript that over-aggregating is a common issue across different global graph attention methods in different datasets, and systematically studied its root cause and influence with different numbers of nodes.

Q3: Information loss during the over-aggregating.

Response: The information loss refers to message dilution in over-aggregated attention, where signals from informative nodes are overwhelmed by noisy or irrelevant inputs.

Quantitative measure: We can measure this loss by summing the attention scores assigned to informative nodes. Ideally, if all attention is allocated to the informative nodes, the sum equals 1. A sum < 1 indicates message dilution, i.e., some attention is "wasted" on uninformative nodes, leading to loss of useful information.
Wideformer reduces information loss via aggregation division
- Dividing the aggregation reduces the total number of nodes being aggregated.
- Fewer nodes, better optimization: Based on the derivation in sec 3.2, fewer source nodes benefit the optimization of the attention scores.
- Better optimization, more focused attention: The optimization empowers the target nodes to focus on the informative nodes, where informative nodes get larger attention scores and thereby reduce information loss.
Reduced attention entropy demonstrates the reduced information loss: Reduced attention indicates that target nodes successfully select their informative source nodes and assign large attention scores to these nodes. Therefore, the sum of the attention scores assigned to informative nodes becomes large, demonstrating less information loss.

2025-08-05

Thank you for your comments. It may be a novel algorithm that improves the performance, but I am still not sure that the logic in your argument in the paper is right.

As soon as you assume a uniform lower bound of the $\alpha$ 's, your theoretical result Lemma A.1 holds for all kinds of entropies (not just for your attention entropy), which is why I say this is a correct statement. However, generally, there is no such uniform lower bound of $\alpha$ . It is possible that one of the $\alpha$ goes to zero in the training, and \epsilon becomes very close to zero, which then makes the entropy independent of $n$ . Intuitively, more nodes will have higher entropy because of the chaoticity in real networks, but the theorem in its current form is not rigorous enough to support or motivate your method. To my knowledge, a valid theorem needs at least some assumptions on the connectivity pattern of the network. Then it seems not convincing to connect the effectiveness of the cluster division to the entropy reduction. As I said, there is no causal relation between entropy reduction and the division, and it is natural to ask the hidden reason behind the effectiveness of your method.

Some questions:

Maybe a "general network + entropy reg" (without other interference terms) experiment can resolve the reader's doubt to some extent about the effect of reducing entropy in improving generalization. Table S5 only shows the effectiveness of wideformer+reg, not of the proposed attention entropy.
Table 4 shows that attention guidance does not have a positive effect in three settings. What is a legitimate explanation to that?

2025-08-05

Thank you very much for your feedback. We hope the following response can address your concerns.

Q1: How does Theorem 3.1 support the proposed method?

Response:

(Larger n, higher lower bound) Theorem 3.1 with Lemma A.1 shows that the lower bound increases monotonically with $n$ $n$ . Any entropy value the graph gains from training (including your mentioned scenario "one of the $\alpha$ $α$ goes to zero in the training") is bounded by this lower bound, instead of being independent of $n$ $n$ .
- We base our Theorem on the general entropy formulation rather than assuming any network connectivity pattern because the attention graph does not assume any connectivity pattern.
(Larger n, higher actual entropy) Theorem 3.1 indicates that graph attention will be more likely to have higher entropy when aggregating more nodes. Figure 1 validates this inferred conclusion.
(Larger n and higher entropy, worse optimization) We show in sec 3.2 and Fig.1(c) that models with more nodes to be aggregated are typically initialized with higher attention entropy compared to sparse attention, and a large number of graph nodes will attenuate the optimization process of the attention.
Based on the observations above, we conclude that reducing the number of nodes in every aggregation will be more likely to have small attention entropy and benefit the optimization process, and thus we propose Wideformer.

Q2: Relation between entropy reduction and the division.

Response: Thank you very much for bringing this to our attention. There is some potential confusion here.

Division reduces entropy: Node division reduces the number of source nodes for each aggregation by assigning zero attention to specific nodes, and thus, reducing the attention entropy. The attention entropy of the actual aggregation matrix is presented in Tab.R2. We will replace Fig.3(a) with all the results in Tab.R2 and add the following discussion.
Attention guidance to ensure distinction.
- Why high entropy? The attention entropy in Fig.3(a) is computed by comparing the attention values across different clusters (Entropy of mixing clusters w/o. guide in Tab.R2). However, it is not the actual attention matrix employed for aggregation, as we aggregate nodes within each cluster.
- Why is this high entropy not ideal? If we aggregate source nodes from different clusters at the same time (which is not what actually happened in Wideformer), the target nodes may not be able to distinguish informative ones from the others. This can lead to false cluster assignments for the aggregation division. To ensure the separation of the informative ones from the others, we further apply attention guidance to the clustering results.
Reasons for effectiveness:
- Dividing the aggregation can avoid message dilution, where only a small ratio of the source nodes participate in the aggregation.
- Attention guiding ensures the focus on the informative nodes.

Table R2. Attention entropy comparison	GraphGPS	Polynormer	SGFormer
Original	0.9938	0.8254	0.9941
Entropy of the aggregation matrix	0.2891	0.1673	0.6974
Entropy of mixing clusters (w/o. guide)	0.9928	0.8268	0.9887
Entropy of mixing clusters (w. guide)	0.2857	0.1406	0.6658

Q3: Backbone with entropy reg on the original attention.

Response: Thank you for your suggestions. We only apply entropy reg on the cluster attention due to the memory overhead of explicitly computing the whole attention matrix during training.

Following your suggestion, we further apply entropy reg to the original attention on Cora. Results in Tab.R3 show that applying entropy reg to the original attention can reduce the performance gap and improve the generalization.

Table R3. Relation between Entropy and Generalizability
Reg weight	0	0.01	0.05	0.1
Attention entropy	0.4661	0.4038	0.3962	0.3933
Test-Train loss	1.0045	0.9912	0.9361	0.8659
Train-Test acc	0.1897	0.1887	0.1868	0.1859

Q4 Why does attention guidance show degradation in three settings in Tab.4?

Response: This is because attention guidance applies cluster attention to the whole cluster.

Cause: When target nodes have fewer informative source nodes, there may be falsely assigned less informative nodes to the informative cluster (rank-1 cluster in the reordering). Mixing of the informative and less informative nodes reduces the average attention of the whole cluster and attenuates the signal by multiplying the cluster attention.
Future direction: Improving the node division strategy to ensure more accurate cluster assignment is a potential solution for this degradation. Primary results are presented in Tab.R1 for reviewer FdpH, showing that improving the clustering strategy can benefit the performance.

2025-08-05

I said from the beginning that your theorem 3.1 is a correct one under the assumption of $\epsilon>0$ . However, finding a lower bound dependent on n under an assumption is not a sufficient condition for the claim that a smaller n has a smaller lower bound of entropy in general. I can assume that there are only fixed number k of non-zero $\alpha_j$ 's (for simplicity we omit the first index in the subscript of your formula) , then the lower bound of the entropy becomes $\log k$ , which is independent of n.

Response to "Any entropy value the graph gains from training (including your mentioned scenario..., instead of being independent of $n$ ": When the $\alpha_i,j =0$ for some $j$ , then the $\epsilon$ in lemma A.1 has to be zero, then the lower bound of entropy equals 0, which is independent of $n$ .

2025-08-05

Thank you for your feedback. We would like to clarify that, under standard softmax-based attention, each attention weight $\alpha_i$ is computed as $\alpha_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)} > 0,\quad\forall x_i \in\mathbb{R}.$ As a result, all $\alpha_i$ are strictly positive, and it is not possible for only a fixed number of them to be non-zero. Therefore, $\epsilon$ is always strictly greater than zero.

2025-08-05

Sure, please ignore my "zero" examples.

2025-08-05

We are glad that we have reached a consensus on this point, and we hope that our responses have addressed all your concerns. Please feel free to let us know if there are any remaining issues—we would be happy to further clarify or elaborate as needed.

2025-08-06

Apart from the logic of argument in your paper, I still have questions on your "over-aggregating" concept. I still don't know when the overaggregation or the entropy determines the generalization of the model. We can let each node be a cluster (m=n), in which case the attention entropy will be zero. Does this way of clustering degrade the performance of the model? What is the condition of overaggregation being a significant influence on your model?

My zero examples are the limiting case for small $\epsilon$ when you consider one cluster; but globally, your clustering of the nodes is an exact approach to make the attention sparse. It seems that the combination of graph attention sparsity and smaller neighborhood is more likely to be the essential reason of the model's effectiveness than the relief of overaggeragtion.

2025-08-06

Thank you for your thoughtful comments, which have helped us refine our presentation of the "over-aggregating" phenomenon. Below, we address each of your questions in turn.

Q1: Does assigning each node to be a cluster degrade the performance?

Response: Yes, this leads to large attention entropy and over-aggregating at the cluster level.

In Sec 5.1 and Tab.R2, we consider attention entropy both within and across clusters. Merely reducing intra-cluster entropy is not sufficient to mitigate over-aggregating. If each node is assigned to be a separate cluster, the cluster-level aggregation degenerates to the node level, defeating the purpose of clustering and reintroducing over-aggregating among clusters.

Thus, there exists a necessary trade-off between reducing over-aggregating at the node level and avoiding its reemergence at the cluster level. Our ablation study (Fig.3(a) and Fig.7) shows that setting the number of clusters to around $3~5$ strikes this balance most effectively, offering the best performance across datasets.

Q2: What is the condition of over-aggregating being a significant influence on your model? When does over-aggregating determine the generalization of the model?

Response: Based on empirical evidence, we have identified the following indicators that over-aggregating is negatively impacting the model:

High normalized entropy (> 0.7). Fig.4 shows that informative source nodes only constitute a small ratio of all nodes. When the normalized attention entropy exceeds 0.7, it consistently indicates over-aggregating of noisy or irrelevant information, which harms generalization.
Relative entropy drop (before vs. after applying Wideformer). Different architectures exhibit different entropy levels due to their inherent design. For example, architectures such as Polynormer show initial entropy in the 0.4–0.7 range (Fig.1(b)). However, Wideformer can still benefit Polynormer on these datasets. In these cases, relative entropy before and after employing Wideformer indicates whether the model encounters over-aggregating (Fig.3(a)).
Large node scale (>1k): From the consistent performance gains observed on datasets with over 1,000 nodes, we find that global-attention-based backbones at this scale remain susceptible to over-aggregating for node-level tasks.

Q3: Is the combination of graph attention sparsity and a smaller neighborhood the essential reason to tackle over-aggregating?

Response: It's not sufficient. The major distinction between Wideformer and sparse attention strategy is that Wideformer can reduce the number of input nodes per aggregation and maintain the global receptive field for the target nodes. Although each cluster only aggregates a small ratio of the source nodes, all nodes are aggregated within different clusters. Only a combination of graph attention sparsity and a smaller neighborhood is not sufficient to tackle over-aggregating. The importance of the global receptive field can be validated in Tab.1~3, where combining global attention with Wideformer achieves better performance than sparse attention methods.

2025-08-06

Thank you, I appreciate your defence, but

I give a counterexample because I believe that large overaggregation does not necessarily count as a reason of bad performance of the model. Your answer justifies my point: reducing overaggregation potentially degrades the performance; there are trade-offs.
Note that Q3 is not my exact question. What I'm arguing is that although relieving overaggregation positively (or negatively) correlates to the performance improvement (e.g. the authors showed p-values), it is not the reason of performance improvement. There are no evidence until now from the authors can demonstrate that overaggregation "causes" performance degeneration. Mitigating overaggregation seems to be a side effect of your wideformer in the classification.

2025-08-06

Thank you for your comment to empower us to better elaborate on the over-aggregating phenomenon.

Q1: Does reducing over-aggregating potentially degrade the performance in the counterexample?

Response: In your counterexample, the over-aggregating still exists, as we demonstrate in the last reply. Reducing the over-aggregating partially within each cluster does not reduce the actual over-aggregating for each node. The trade-off is that one has to balance the reduced over-aggregating at both levels to alleviate the actual over-aggregating. It cannot be justified that reducing over-aggregating potentially degrades the performance.

Q2: Evidence for over-aggregating causes performance degeneration.

Response: The direct evidence to justify this is that directly reducing the cluster attention entropy (as optimization target) during training can achieve better model performance in Tab.S5 in the manuscript.

We also show that reducing the attention entropy can improve the generalization of the models in Tab.R1 and R2. To further validate this, we present the absolute results of directly reducing the attention entropy during training (without Wideformer) in Tab.R4, where gains are achieved.

Table R4. Attention entropy regularization.	GraphGPS	SGFormer	Polynormer
Original	86.03	80.83	78.70
+Entropy Reg	86.21	81.02	79.29
+Wideformer	86.70	81.20	79.90

2025-08-06

Thank you for your feedback! We are happy to provide further empirical results. Due to the memory overhead of explicitly computing the attention matrix during training, we can only choose small-scale datasets, including Cora, Citeseer, AmazonPhoto, and minesweeper.

Q1: Does 3~5 minimize the overall attention entropy?

Response: Thank you for your suggestions! Following your guidance, we compared the attention entropy of the models with different numbers of clusters. Average results across GraphGPS, SGFormer, and Polynormer are presented in Tab.R5. We can see that the minimal entropy values are gained in the 3~5 range.

Table R5. Entropy comparison with different $m$ .
$m$	1	2	3	4	5	6	7	8
Cora	0.8729	0.5422	0.4603	0.2116	0.5318	0.4295	0.3935	0.4710
Citeseer	0.8986	0.7806	0.7736	0.3237	0.4131	0.7246	0.7041	0.8319
AmazonPhoto	0.9791	0.7896	0.6532	0.2814	0.2786	0.2931	0.3728	0.3719
minesweeper	0.9889	0.9006	0.2384	0.2441	0.4789	0.5528	0.3039	0.3524

Q2: Can you plot the attention entropy against the test performance during training on different types of datasets?

Response: Due to the strict prohibition on uploading figures during rebuttal, we present the training results of SGFormer with Wideformer in Tab.R6. Results show that the training process can reduce the entropy across different types of datasets.

Table R6. Attention entropy during training
Epoch		1	10	15	20	30	50	100	200	500
Cora	normalized entropy	0.8725	0.8041	0.7307	0.6978	0.567	0.5188	\	\	\
	test acc	14.02	31.5	66.91	76.68	80.93	81.19	\	\	\
Citeseer	normalized entropy	0.8756	0.7675	0.7702	0.6275	0.5584	0.6041	\	\	\
	test acc	46.51	69.3	69.33	69.36	70.06	70.62	\	\	\
AmazonPhoto	normalized entropy	0.9685	0.9417	0.9155	0.9062	0.8535	0.7105	0.6157	0.6795	0.6349
	test acc	21.77	28.04	41.11	67.25	77.78	89.61	92.68	95.23	95.65
minesweeper	normalized entropy	0.8433	0.5954	0.6213	0.5927	0.5825	0.5644	0.5197	0.5098	0.5193
	test roc auc	63.09	77.13	77.29	76.84	79.15	82.73	87.72	88.83	89.03

Q3: Can you compare the entropy reg to SOTA sparse GAT methods with standard deviation? Is the reason behind the effectiveness of entropy reduction the concentration of attention on a small portion of nodes?

Response:

Results in Tab.R7 show that entropy regularization can benefit backbones on these datasets. Exphormer and GAT are selected as the best-performing sparse methods. The best results among the three backbones GraphGPS, SGFormer, and Polynormer are reported due to character limitations.
Why is entropy reg effective for benefiting backbones? Entropy regularization indeed helps target nodes to assign large attention scores to a small ratio of nodes. However, the better performance of the original global method compared to the sparse method shows that maintaining a global receptive field is also important. Therefore, we need to design a method that can both maintain global receptive fields for target nodes and reduce the volume of nodes for each aggregation.

Table R7. Results of Entropy Regularization.	Cora	Citeseer	AmazonPhoto	minesweeper
GAT	83.00 $\pm$ 0.70	72.10 $\pm$ 1.10	93.87 $\pm$ 0.11	93.91 $\pm$ 0.35
Exphormer	82.77 $\pm$ 1.38	71.63 $\pm$ 1.19	95.35 $\pm$ 0.22	86.24 $\pm$ 0.61
Global method	86.03 $\pm$ 0.54	77.96 $\pm$ 0.37	95.47 $\pm$ 0.53	97.13 $\pm$ 0.17
+reg	86.23 $\pm$ 0.32	78.12 $\pm$ 0.39	95.52 $\pm$ 0.44	97.20 $\pm$ 0.32
+Wideformer	86.70 $\pm$ 0.83	78.61 $\pm$ 0.35	95.64 $\pm$ 0.36	97.26 $\pm$ 0.01

2025-08-07

The authors have addressed most of my concerns. There are still two points I care about:

The theorem is not a legitimate motivation for your division of the network. It is misleading.
Use entropy reg has a marginal improvement over the original method.

However, the authors dedicate themselves to the rebuttal and modifications, and I believe the paper will qualify as a conference paper after the following revision.

Remove the theorem; the marginal improvement of entropy reg is better as a motivation for your proposed wideformer. Emphasize the key advantage of the wideformer over entropy reg.
Fully compare entropy reg and your wideformer on all the datasets used in your paper instead of the four in your rebuttal.

2025-08-06

Thanks.

Response to Q1: I apologize for the misleading counterexample (I did overlook the inter-cluster part). Let me rephrase. You mentioned in the penultimate comment that when m is around 35 the performance is good. Does 35 minimize the overall attention entropy? Can you plot the attention entropy against the test performance during training on different types of datasets? I wonder if that is a decreasing shape or not.

Response to Q2: I think experiments of entropy reg is acceptable. But the reg results seem marginal (on whatever dataset you pick). To evaluate the effectiveness of the reg, can you do several datasets with standard deviation? Moreover, can you also compare the entropy reg to SOTA sparse GAT methods? I ask for this since I think that the reason behind the effectiveness of entropy reduction is the concentration of attention to small portion of nodes.

2025-08-07

Dear Reviewer,

We would like to provide an update regarding the implementation of your suggestions.

To address the out-of-memory (OOM) issue associated with the entropy regularization baseline, we developed a solution that divides the target nodes into mini-batches of size $b$ . Each batch is sequentially processed on the GPU to compute attention scores with the source nodes. This approach reduces the peak memory overhead to $O(bn)$ , enabling us to evaluate entropy regularization on most of the datasets in the paper (though the training time for large-scale graphs is significantly increased). Experiments on ogbn-arxiv and twitch-gamer encounter OOM even when setting b to 10.

The results are presented in Table R8. The best results among the three backbones, GraphGPS, SGFormer, and Polynormer, are reported due to character limitations. Full results will be included in the revision. We can see that entropy regularization improves performance on small-scale datasets (< 30k nodes), confirming that high attention entropy can hinder model performance. However, for larger graphs, the performance gains are marginal. This is consistent with our analysis in Section 3.2: since entropy regularization does not divide the aggregation process, aggregating over a large number of source nodes still suffers from optimization challenges.

Table R8. Comparison with entropy regularization.	Cora	Citeseer	AmazonPhoto	AmazonComputers	CoauthorCS	CoauthorPhysics	minesweeper	tolokers	roman-empire	amazon-ratings	questions
GAT	83.00 $\pm$ 0.70	72.10 $\pm$ 1.10	93.87 $\pm$ 0.11	90.78 $\pm$ 0.13	93.61 $\pm$ 0.14	96.17 $\pm$ 0.08	93.91 $\pm$ 0.35	83.78 $\pm$ 0.43	88.75 $\pm$ 0.41	52.70 $\pm$ 0.62	76.79 $\pm$ 0.71
Exphormer	82.77 $\pm$ 1.38	71.63 $\pm$ 1.19	95.35 $\pm$ 0.22	91.47 $\pm$ 0.17	94.93 $\pm$ 0.01	96.89 $\pm$ 0.09	86.24 $\pm$ 0.61	83.77 $\pm$ 0.78	89.03 $\pm$ 0.37	53.51 $\pm$ 0.46	73.94 $\pm$ 1.06
Global method	86.03 $\pm$ 0.54	77.96 $\pm$ 0.37	95.47 $\pm$ 0.53	91.98 $\pm$ 0.09	95.44 $\pm$ 0.03	96.87 $\pm$ 0.08	97.13 $\pm$ 0.17	85.09 $\pm$ 0.21	91.83 $\pm$ 0.16	54.71 $\pm$ 0.17	78.66 $\pm$ 0.50
+reg	86.23 $\pm$ 0.32	78.19 $\pm$ 0.39	95.52 $\pm$ 0.44	92.08 $\pm$ 0.07	95.50 $\pm$ 0.19	96.88 $\pm$ 0.09	97.20 $\pm$ 0.32	85.21 $\pm$ 0.33	91.94 $\pm$ 0.34	54.79 $\pm$ 0.23	78.69 $\pm$ 0.24
+Wideformer	86.70 $\pm$ 0.83	78.61 $\pm$ 0.35	95.64 $\pm$ 0.36	92.39 $\pm$ 0.07	95.84 $\pm$ 0.07	97.39 $\pm$ 0.19	97.26 $\pm$ 0.01	85.33 $\pm$ 0.23	92.16 $\pm$ 0.24	55.05 $\pm$ 0.08	79.00 $\pm$ 0.20

Based on these results, the advantages of Wideformer over entropy regularization are summarized as follows:

Lower memory overhead: Wideformer requires $O(nm)$ memory, where $m \ll n$ . In contrast, entropy regularization incurs $O(n^2)$ memory, making it less scalable and forcing a trade-off between memory and time.
Better mitigation of message dilution: Wideformer separates the aggregation into parallel processes. Each process only involves a small ratio of the source nodes to be aggregated, preventing the dilution of the key messages. In contrast, entropy regularization still aggregates over all source nodes, leading to attenuated optimization signals and the risk of over-aggregating on large graphs.

Thank you again for your valuable suggestions, which help us to provide a more comprehensive evaluation of our method and validate the necessity of the specific design. We will follow your suggestions to revise our manuscript and incorporate the results into the revision.

2025-08-07

Thank you very much for your feedback. We are happy to have addressed most of your concerns. For your remaining two concerns:

Point 1: Employing entropy regularization results instead of Theorem 3.1 as a motivation for our proposed method. (L122-126)

Response: Thank you for your suggestions. Regarding entropy regularization results, we will add the results to the manuscript to show that over-aggregating with higher entropy does harm the model performance. This can be employed as direct evidence to complement the analysis in Sec 3.2.

While the entropy regularization results provide direct empirical evidence that high entropy can hurt performance, they do not reveal what factors cause the entropy to increase. We aim to provide a more complete picture by combining these results with Theorem 3.1 and Sec 3.2, which highlights the relationship among entropy, the number of aggregated nodes, and the attenuated optimization process.

Regarding Theorem 3.1, we understand that placing Theorem 3.1 early in the paper (L122–126) may have created the impression that it serves as a direct motivation. We will revise this part to clarify that Theorem 3.1 only provides a theoretical insight into why attention entropy tends to increase with the number of aggregated nodes, complementing the empirical findings in Fig.1(b). The actual architectural design of Wideformer to reduce the number of nodes being aggregated is motivated by the analytical insights in Sec 3.2.

Point 2: Employing entropy regularization on larger datasets.

Response: We have attempted to apply entropy regularization to the remaining datasets, but encountered out-of-memory issues due to its $O(n^2)$ memory cost. Such memory overhead, along with the relatively marginal gains, suggest that entropy regularization is not a viable solution. In contrast, Wideformer achieves more substantial improvements while avoiding quadratic memory overhead.

Thank you again for your patient guidance throughout the process. We sincerely appreciate your time and your valuable suggestions.

审稿意见

评分: 2置信度: 52025-07-07

This paper presents Wideformer, a plug-and-play enhancement to graph attention mechanisms designed to address a newly identified challenge in graph transformers: over-aggregating. The authors observe that as the number of nodes increases, attention scores tend to become uniformly distributed, leading to diminished discriminatory power during message aggregation. Wideformer mitigates this issue by employing a divide-and-conquer strategy, in which parallel aggregation modules each focus on restricted subsets of nodes. This design reduces the information burden per aggregation unit and alleviates message dilution, thereby enhancing the model’s ability to capture salient information in large graphs.

优缺点分析

[Strengths]

I personally enjoyed Wideformer’s design—partitioning attention aggregation and enforcing selective focus—which mirrors established strategies in managing information overload in attention-based models (e.g., mixture-of-experts). The method is described clearly with intuitive steps: parallel aggregation, cluster-wise processing, and attention-guided selection.
The paper is built upon a solid motivation which effectively situates Wideformer among two dominant solution categories for attention scalability: sparse attention and linear attention. It convincingly argues that while these approaches reduce complexity, they may exacerbate over-aggregation under large node counts.
The paper is based on strong experimental results which validates that Wideformer achieves state-of-the-art performance across 13 well-known benchmarks in Graph Processing.

[Weaknesses]

Although "over-aggregating" is introduced as a distinct problem, its symptoms resemble aspects of over-squashing, over-smoothing, and over-dilution (where important messages are lost due to limited representation capacity). I have not been convinced that "over-aggregating" is a distinct and unique problem that requires special solution sets.
Wideformer mitigates over-aggregation by increasing the number of parallel aggregation units. But I would have enjoyed a more thorough analysis on the computational overhead or memory cost. How does it compare in runtime efficiency after its "late-aggregation" style where it runs at subsets first and aggregated later. A brief discussion or analysis of efficiency vs. accuracy trade-offs is necessary, especially since scalability is a key motivation.
The most major concern is that while the problem is introduced in Graph Neural Network, this is a central issue in both Graph-based [1] and Transformer-based methods that could be considered as fully connected graphs. In this aspect, I personally do not think that Wideformer is either unique nor novel, since simply restricting the attention scope into a subset of the input token has been presented in multiple previous works in Transformer attention [2,3,4].

[1] Even Sparser Graph Transformers (NeurIPS 25, https://neurips.cc/virtual/2024/poster/95681) [2] Stabilizing Transformer Training by Preventing Attention Entropy Collapse (ICML'23, https://proceedings.mlr.press/v202/zhai23a/zhai23a.pdf) [3] Attention is not all you need: pure attention loses rank doubly exponentially with depth (ICML'21, https://arxiv.org/pdf/2103.03404) [4] In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness [5] ENACT: Entropy-based Clustering of Attention Input for Reducing the Computational Needs of Object Detection Transformers (https://arxiv.org/abs/2409.07541) [6] Scratching Visual Transformer's Back with Uniform Attention (ICCV'23, https://arxiv.org/pdf/2210.08457)

问题

My major concerns are written in the weakness section above.

局限性

No. Refer to my weaknesses please.

格式问题

None.

作者回复

2025-07-31

We greatly appreciate your valuable suggestions that help us revise our manuscript, and hope our response can adequately address your concerns. The discussion will be added to the revision.

W1: Differences between over-aggregating and over-squashing/smoothing/dilution.

Response: Thank you for bringing over-dilution to our attention. We would like to confirm that our understanding is correct, that the reviewer is referring to the over-dilution discussed in [R1]. If this is not the case, we kindly ask the reviewer to clarify. We have added the differences between over-dilution and over-aggregating to the existing discussion in sec 3.3.

Over-smoothing
- Different graph structures: Over-smoothing occurs on the input graph structures, while over-aggregating occurs on the attention graph.
- Different causes: Over-smoothing stems from the low-pass filtering of the aggregation function, while over-aggregating is caused by the attention optimization issue when the number of source nodes becomes large.
- Different solution: Designing high-pass filters can tackle over-smoothing. Conversely, even when a high-pass behavior is desired during over-aggregating, the optimization of attention weights may still be hindered by the overwhelming number of source nodes. One must tackle this issue regarding the input node volumes.
Over-squashing
- Different causes: Over-squashing stems from the imbalance between the volumes of a node’s input and output messages, whereas over-aggregating is only associated with its input volume.
- For instance, consider an attention matrix where edge weights are uniformly distributed across the fully-connected attention graph.
  - This case reflects the most severe form of over-aggregating, where the attention entropy reaches its maximum (i.e., logn).
  - Over-squashing does not arise where each node in such a graph has perfectly balanced input and output volumes.
Over-dilution
- Intra-node dilution focuses on the volumes of the node features, while over-aggregating focuses on the volumes of the input nodes.
- Inter-node dilution focuses on whether a node's own information is preserved during aggregation, whereas over-aggregating emphasizes the preservation of messages from the informative nodes, regardless of whether that includes the node itself.

W2: Runtime and memory cost.

Response: Thank you for your suggestion. Based on our analysis of model complexity (sec 5.3.2) and cluster ablation (sec 5.3.3), Wideformer results in

Efficiency-accuracy trade-off regarding the number of clusters $m$ $m$ .
- Wideformer time/space complexity: T+ $O(nm)$ . T is the complexity of the backbone, and $n$ is the number of graph nodes.
- A larger $m$ value ( $m\le8$ ) gives rise to higher accuracy, but increases the memory and runtime costs linearly.
Affordable increase in time and memory cost: Fig. 7 shows that Wideformer achieves the largest gains with a small number of clusters $m$ in {3, 4, 5}. The average cost increase ratio (calculated as [cost(m=4)-cost(m=1)]/cost(m=1)) across different numbers of graph nodes is presented in Tab.R1. We can see that the additional costs are relatively minor, amounting to only ~30% of the original time cost and less than 11% of the memory cost.

Table R1. Cost Increase Ratio		Time (s)	Memory (MB)
GraphGPS	m=1	0.0973	13870
	m=4	0.1408	15546
	increase ratio	30.93%	10.78%
SGFormer	m=1	0.0258	50206
	m=4	0.0413	51332
	increase ratio	37.56%	2.19%
Polynormer	m=1	0.1043	29338
	m=4	0.1435	30180
	increase ratio	27.37%	2.78%

W3: Differences with previous works.

Response: Thank you very much for bringing these interesting works to us. The major differences between our work and previous works are that,

Novel phenomenon: We discovered the novel over-aggregating phenomenon as a common issue across different global graph attention methods in different datasets, and systematically studied its root cause and influence with different numbers of nodes.
Novel solution: To specifically mitigate over-aggregating, we propose Wideformer as a plug-and-play solution, including
- Node division tailored for message dilution in over-aggregating:
  - Tackling over-aggregating requires distinguishing informative source nodes from less informative ones.
  - Selecting the truly informative sources for the targets requires the modeling of the source-target relations.
    - Previous methods fail by focusing solely on either target nodes (query) or source nodes (key).
    - We learn from the relation by selecting cluster centers among the targets and assigning source nodes based on their similarity to the target centers. This strategy enables effective discrimination of informative nodes while avoiding excessive computational overhead.
- Attention guiding tailored for uniform attention in over-aggregating: To enable target nodes focusing on the informative clusters, cluster aggregation results are further weighted with cluster attention.
Effective solution: Wideformer can effectively tackle the over-aggregating effect, while previous work cannot. (Indicated in Tab. R1 and R2)

Specifically to the works you mentioned in the review, the differences can be summarized as:

[1][4] validate the prerequisite of our work: These works and other sparse attention methods prove the prerequisite of our work, that not all source nodes are informative to the target nodes. As a result, large attention entropy in graph transformers leads to indiscriminate aggregation and information loss.
- Why not sparse attention? Wideformer maintains the global receptive field for target nodes, while sparse attention shrinks the receptive fields, leading to sub-optimal message pathways for downstream tasks. This is supported by the results in Tab.1-3, where global attentions with Wideformer give rise to better performances than sparse attentions.
[2][3] study other issues in attention: [2] reformulates the weight matrix to tackle the training instability of transformers with lower attention entropy, and [3] decomposes the transformer forward pass into an ensemble of successive attention heads to tackle the attention collapsing.
[6] cannot be applied to graph transformers: [6] requires explicitly computing the whole attention matrix, which demands a space complexity of $O(n^2)$ and is not scalable to large graphs.
[5] cannot effectively address over-aggregating:
- Methodology:
  - Awareness of source-target relation: [5] clusters the source nodes based on their own features, while Wideformer clusters nodes based on the source-target relation, which better enables the selection of the informative sources for the targets.
  - Finer-level aggregation: [5] only applies attention aggregation at the cluster level, while Wideformer also applies attention aggregation at the node level within each cluster. Node-level aggregation allows the model to adjust the attention scores at a finer level and capture the node feature distribution within each cluster.
- Experiment:
  - We implemented ENACT [5] in graph transformers by directly clustering source nodes based on their own features, and only applying cluster-level attention aggregation.
  - By comparing Wideformer with ENACT in terms of the resulting attention entropy (Tab.R2) and node classification performance (Tab.R3), we can see that ENACT cannot effectively tackle over-aggregating and gains worse performance than Wideformer. This demonstrates the effectiveness of our method in tackling over-aggregating and benefiting backbone performance.

Table R2. Attention Entropy	GraphGPS	SGFormer	Polynormer
Original	0.9846	0.9743	0.8138
ENACT	0.9673	0.9619	0.8125
Wideformer	0.2863	0.1352	0.6529

Table R3. Classification Results	amz-ratings	cs	photo	questions
GCN	48.70	92.92	92.70	76.09
Exphormer	53.51	94.93	95.35	73.94
GraphGPS	49.73	95.44	94.98	75.48
+ENACT	48.52	95.16	94.76	75.74
+Wideformer	49.96	95.85	95.55	75.69
SGFormer	52.38	93.50	95.47	76.81
+ENACT	53.32	93.21	95.53	76.53
+Wideformer	53.47	93.86	95.64	76.94
Polynormer	54.71	94.53	95.47	78.66
+ENACT	54.75	94.60	95.55	78.78
+Wideformer	55.05	94.94	95.64	79.00

[R1] Understanding and Tackling Over-Dilution in Graph Neural Networks. KDD'25

2025-08-07

Dear Reviewer 6GqH,

Thank you again for your valuable time and thoughtful suggestions on our submission.

We would like to kindly follow up to check whether our response has addressed your concerns. If there are any remaining questions or suggestions, we would greatly appreciate the opportunity to clarify further.

Your feedback is valuable to us and has been helpful in improving the quality of our work. Thank you again for your engagement.

Best regards,

Authors of Submission 2796

最终决定Accept (poster)

2025-09-17

This paper identifies the novel problem of "over-aggregating" in graph transformers and proposes an effective plug-and-play solution, Wideformer. The method demonstrates consistent performance gains across numerous benchmarks, and the authors' comprehensive rebuttal successfully resolved most reviewers' concerns. Given the strong empirical validation and the overall positive consensus, the paper is a good contribution and warrants acceptance.