4.7

/10

Poster3 位审稿人

最低3最高7标准差1.7

3.7

置信度

正确性2.0

贡献度2.3

表达2.7

NeurIPS 2024

Even Sparser Graph Transformers

Hamed Shirzad,Honghao Lin,Balaji Venkatachalam,Ameya Velingker,David Woodruff,Danica J. Sutherland

OpenReview PDF

提交: 2024-05-16更新: 2024-11-06

TL;DR

Attention scores on graphs are consistent across network widths; find which edges matter with a narrow network first

摘要

关键词

Graph TransformersExpander GraphsSparsification

评审与讨论

审稿意见

评分: 4置信度: 42024-07-08

This paper proposes a two-phase training process for Graph Transformers to address the memory and computational inefficiencies associated with large graph datasets. The first phase involves training a small-width network to estimate attention scores. These scores are then used to construct sparse interaction graphs for a larger network, which is trained in the second phase. The authors claim that this approach maintains performance while significantly reducing memory requirements. They provide theoretical justifications for their sparsification method and validate their approach through experiments on various graph datasets.

优点

The paper is generally well-written and structured. The methodology and experimental setups are clearly explained, making it easier for readers to understand the proposed approach.
The paper introduces a novel two-phase training methodology. This approach is creative and has the potential to address a significant limitation in current graph transformer models related to scalability and memory usage.
The theoretical analysis provided in the paper gives a solid foundation for the proposed method.

缺点

While the two-phase training process is innovative, it primarily builds on existing Exphormer. It mainly adjusts the flexibility of the expander graph and adds a sampled attention mechanism. The novelty might be seen as incremental rather than groundbreaking.
It was recently shown that the reported performance gap between Graph Transformers (GTs) and MPNNs on node classification is overestimated due to suboptimal hyperparameter choices [1,2]. My concern is whether we truly need global information propagation for node classification in large graphs and whether models need to approximate a full Transformer. Currently, from an experimental perspective, MPNNs with residual connections outperform GTs, even basic models like GCN or GAT. In this context, the authors need to compare the memory usage and runtime of MPNNs and the proposed method to demonstrate its efficiency.

[1] Bag of Tricks for Node Classification with Graph Neural Networks.

[2] Classic GNNs are Strong Baselines: Reassessing GNNs for Node Classification.
The authors are encouraged to include an ablation study to clarify the necessity of the two-phase training process. This study should illustrate the effects of not utilizing a wider network trained on a much sparser graph.
The theoretical analysis is primarily focused on the first layer of the network. It would be more compelling if the analysis were extended to deeper layers to provide a more comprehensive understanding of the method’s effectiveness.
Minor Issues: Lines 268-277 are somewhat confusing, particularly line 275, where the number of query nodes might be incorrect.

问题

The authors claim that the fixed number of neighbors for each node enables matrix multiplication calculations instead of edge-wise calculations (gather operation), improving the speed of the model. I would like to ask how much faster this approach is in terms of training time compared to Exphormer and MPNNs?

局限性

Limitations are discussed.

作者回复

2024-08-07

Thanks for your valuable review. Here are our responses. Please also refer to the rebuttal PDF for some extra tables and figures.

Innovation

As mentioned in lines 97-102, we use Exphormer to interpolate between MPNNs and full-attention Transformers. Lower expander degrees make the model similar to MPNNs (particularly, eg, GAT), and the highest degree (n-1) will be equivalent to full Transformers. None of our theory relies on the expander degree or even the existence of the expander graph, and the experimental results include expander degrees from 30-200. The main idea of our work is to make a sparse and scalable method, and it made sense to start from an already sparse method such as Exphormer, and build on it to sparsify further. Sparsification from the full Transformer can be a little tricky, because the attention scores can be very small and unreliable (because of the large number of nodes to attend).

In addition, our paper has other contributions of additional interest, such as the batching technique (a particular fit for our sparsification, but usable in other settings), and our theoretical results regarding the compressibility of the network.

Comparison with GNNs

Thanks for pointing to these papers. The second paper mentioned here was published after our submission, and so we were not aware of it before submitting. However, there are a few important things to consider:

The networks they have used for their experiments are much larger than the ones used in our work, in the sense of the number of parameters. For example, for Minesweeper they use 15 layers with hidden dimension 512, while we use four layers with hidden dimension 32. We also have focused on the memory efficiency, and most of the small datasets are trained with less than 4GB GPU limitation. Their models have been trained on a workstation with 8 RTX 3090 GPUs (each of which would typically have 24GB of memory).
While our work is focused on memory efficiency, we also do batching based on the task and the requirements of the problem (relying on the attention scores from the initial network). However, these works need to train their method based on random subset selection batching, which as shown in our Figure 4 and Appendix E can sometimes behave very poorly. As the ratio of the batch size to the whole graph goes to zero for a fixed-degree graph, the probability of ever seeing any edge in the training process also goes to zero. By contrast, our batching is independent of the ratio of the graph size/batch size, and allows trading off time and memory without biasing the gradients of the problem parameters.
These GNNs are well studied for many years, and Graph Transformers are relatively new. Thus, there is probably also potential to also find new tricks that improve GT performance. Many key advantages of the GTs can also be brought to the GNN world, as most of these things appear in the papers you have mentioned. The long-range dependencies can be handled by deeper GNNs, and oversmoothing/oversquashing and simplistic message functions can be handled by residual connections, jumping knowledge, layer normalization, and larger hidden dimension. Even so, much smaller GTs “naturally” handle many of these problems, and so the potential for better performance by finding appropriate tricks seems solid.

Ablation studies

Thanks for this suggestion; we have added an ablation study including this in our rebuttal pdf.

The theoretical analysis

Indeed, this is a limitation of our theory at time of submission. Since submission, we’ve continued to work on that, and have been able to extend the result to deeper layers for a scheme similar to (but not the same as) the narrow network we use in practice: a narrow attention dimension, but dimensions for the other parts of the network (e.g., the MLP layers) the same as the target network. (This “in-between” scheme saves in the most expensive part of the network.)

Because of the space limitation in this reply, please refer to our response to reviewer UFNp for the details.

This expanded version of the theorem helps extend our guarantees about the possibility of approximation to deeper layers. Thank you for pointing out this important issue.

Minor Issues

Thanks for mentioning this. We will fix the mistake and make the writing more clear for the next revision, as well as adding pseudocode to clarify.

Speed improvement

To give you some examples, the training time per epoch for the Tolokers dataset when training on the sparsified graph improves from 0.56s to 0.47s, and for the Photo dataset, it improves from 0.43s to 0.36s on average. This is a significant improvement, considering that many other parts of the network, such as Q, K, and V mappings and the feed-forward networks, are fixed and have similar overhead in both cases, with only the attention score estimation part changing.

The time improvement is even more considerable when comparing the Exphormer model to our sparsified regular graph. For example, the whole epoch time (neighbor sampling + training + validation) improves from 6.2s/epoch to 1.1s/epoch on the Tolokers dataset, and from 1.7s/epoch to 0.5s/epoch on the Photo dataset.

2024-08-14

Thank you for your response. I have carefully reviewed all the content, particularly the comments from other reviewers. While I acknowledge the theoretical contributions of the paper, I remain unconvinced by the experimental results. When I closely examined Tables 1 and 2 in the original paper, I noticed that SGFormer [1] was not included. This inconsistency raises concerns about the validity of the experimental findings. The authors should conduct a thorough comparison with the latest scalable transformers [1,2], including performance, memory usage, and runtime across a broader range of datasets in the original paper (performance across all datasets). Given the current results, I will be maintaining my initial score.

[1] SGFormer: Simplifying and Empowering Transformers for Large-Graph Representations.

[2] Polynormer: Polynomial-Expressive Graph Transformer in Linear Time.

2024-08-14

Dear Reviewer,

Thank you for your response. We wish we had heard about this concern earlier during the discussion period to address it more thoroughly. Given the time constraints, we would like to briefly address some points regarding your response:

SGFormer has indeed been our main baseline for the large graphs datasets in Table 2 (see the row just before our model). While we can add this baseline to Table 1, that is not the primary purpose of Table 1. The goal of Table 1 is to demonstrate that the sparsification from Exphormer maintains competitive results, and the other numbers provide context for the extent of the results for those datasets. If it satisfies you, we can add SGFormer to this table as well.

Regarding large graphs, our model uses around 2-3 GB of memory, which is comparable to the SGFormer model. For instance, on Amazon2M, our model uses 3 GB while SGFormer uses 2.7 GB. Notably, our model allows for batch size reduction to train with less memory without sacrificing accuracy. However, as mentioned in Appendix E and Figure 4, random subset batching has its flaws, and reducing the batch size can significantly hinder the model’s performance.
Regarding the Polynormer, we have already outlined several reasons why we are not comparing our model with this baseline, and we are happy to include these reasons in our paper if it helps. To summarize, the reasons are:
- Polynormer introduces an orthogonal idea that can be applied to many models, including ours. This idea enables a polynomial mixture of features and node embeddings. However, this process introduces several complexities, such as a two-stage training process and the requirement for a very large network. Our final model is sublinear with respect to graph size and achieves competitive results against linear or superlinear models. We are not pursuing every possible extension, such as adding polynomial characteristics, merely to chase state-of-the-art results. Our work has a specific purpose, and we explore ideas around it, comparing our model with reasonable baselines.
- As is customary with deep learning models, models with a similar number of parameters should be compared. While our model is a much smaller network, Polynormer uses a hidden dimension of 512, and a single linear map in their network has more parameters than most of our networks. Although they do not explicitly mention the number of parameters, they typically use between 6 to 13 layers, each with many linear mappings. Additionally, while our model and our main baseline, SGFormer, aim to train on a 4 GB budget, Polynormer uses 48 GB GPU devices for training and still applies random subset batching in their work.
- Not every model must achieve state-of-the-art results to be valuable. Our model aims to be memory-efficient and extend sparse pattern Graph Transformers to large graphs, offering a memory-time trade-off. We have demonstrated its effectiveness and compared it to relevant models that prioritize memory efficiency. In the NLP context, sparse pattern Transformers and low-rank models have always been distinct development threads, each excelling in different aspects.
- We have addressed the limitations of a baseline mentioned in Figure 4 and Appendix E, and our model avoids these pitfalls. Even if model sizes are not convincing, we emphasize that different models should be developed in parallel. Suppressing a research direction because it does not immediately achieve top results is not scientifically sound. Different models have unique advantages, and the lack of benchmarks or not achieving the best results should not deter their development. After all, if neural networks had been abandoned early due to lack of best results, we would not enjoy their benefits today.

审稿意见

评分: 3置信度: 32024-07-11

This paper proposes a method to reduce the memory complexity of Exphormer's transformer layers by learning a sparser attention pattern. Given some graph learning tasks, the authors first train a small-width, single-head proxy model on CPUs with large memory resources. After training, the attention scores of the smaller proxy model appear to reflect which interaction edges in Exphormer's attention network are important to solve the given task. These attention scores are used to sample (in a layer-wise fashion) sparser, regular subgraphs of the original interaction graph. These give rise to layer-wise sparse attention patterns which can be used to train a larger model on GPUs with constrained memory resources. This allows scaling Exphormer to larger graphs.

优点

It seems that the idea of using smaller proxy models to find sparse attention patterns is novel. Given that this approach could also be applicable to other domains, it may be relevant to the community.
The authors appear to have some theoretical results motivating their method.

缺点

The paper claims that graph transformers typically suffer from quadratic memory complexity. While this holds true for dense attention mechanisms (including self-attention), efficient implementations like FlashAttention achieve linear complexity. The authors do not explore such implementations for their sparse attention approach, which could potentially lead to similar benefits. Notably, while the sparsity pattern itself requires $O(|E|)$ memory, it's unclear whether its storage dominates the memory usage compared to the potentially inefficient computation of attention scores as seen in Exphormer (e.g., here). A more in-depth discussion on this trade-off would be valuable.
Furthermore, the evaluation focuses solely on memory consumption, neglecting runtime performance. Since runtime can be traded off for memory (e.g., via activation checkpointing), the authors should analyze the runtime implications. If the proposed method increases runtime, a detailed justification explaining the memory-runtime trade-off compared to existing approaches is necessary.
A potential bias in the evaluation is also observed. The authors use a higher expander degree (deg=30) compared to the original Exphormer paper (deg=6). This choice reportedly benefits their method (as noted in lines 524-526). However, the Exphormer baseline also uses this higher degree (lines 205-206). It's crucial to investigate how this skews the results and whether Exphormer with the original degree would have similar memory limitations.
Regarding Table 2, where Exphormer runs out of memory, stronger baselines are needed. For instance, Polynormer [1] might outperform Spexphormer on the pokec dataset. Additionally, studies like NAGphormer [2] that evaluate stronger baselines on pokec could provide evidence that even GCN might outperform Spexphormer.
To further validate the effectiveness of the proposed proxy models, including a baseline with random sparsification would be beneficial. This would demonstrate that the proxy models offer an advantage beyond simple pattern reduction.
The baseline distributions in section 5.1 appear overly simplistic. Utilizing distributions based on the actual models' behavior would provide a more realistic picture.
Ablation studies investigating the impact of value normalization (line 212) and variable temperature (line 218) would be insightful.

Minor Points

Appendix B: This seems to contain verbatim citation. It also appears like this in the Exphormer paper, but I do not know whether that is the primary source. I think one should mark appropriate paragraphs as verbatim.
l162-l165: I feel like this is making a strong claim (" $\mathbf E^j$ is insufficient") with a very lax argument (" $assuming$ the columns of $\mathbf K$ and $\mathbf Q$ are distributed independently and uniformly...."). I would consider removing this strong claim or adding an ablation that supports it.
l242-250: This paragraph is quite unclear. The phrasing "almost the same features" and "lead to the same result" are too unspecific for a research paper, even if one can infer what is meant. Overall, the paragraph seems to try to give an intuitive understanding why sampling should be better than selecting the top-k neighbors. Unless there are experimental results to support it, I am skeptical whether one should make the claim that one is better than the other.
In the equation in l142-143,some information regarding dimensions and the meaning of $\sigma$ could be helpful to readers.
Big Bird appears twice in references (l467, 470). Similar issue in lines 460, 462
Landau symbols are not typeset consistently (sometimes \mathcal is missing)
l180-181: Being pedantic: it probably is possible to fit the transformer on the GPU, but it is not possible to train it.
l251-252: What does "exponentially large" mean?
l258-259: Do you mean $\mathcal O(|\mathcal{N}_H(i)|)$ complexity?
l259: what do you mean by trial and error?
l268: no space after comma
l289: I assume it should be $|\mathcal{V}_l|$ instead of $|\mathcal V|$
l511: socal -> social
l536: missing reference

References

[1] Deng, C., Yue, Z., & Zhang, Z. (2024). Polynormer: Polynomial-Expressive Graph Transformer in Linear Time. ICLR 2024

[2] Chen, J., Gao, K., Li, G., & He, K. (2023). NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs. ICLR 2023

问题

The paper focuses on memory consumption for small graphs. It's expected that Spexphormer's benefits would be most significant on large graphs. To strengthen the analysis, consider estimating Exphormer's memory consumption on large graphs (e.g., by running a single training step on CPU).
The paper claims a complexity reduction for Spexphormer layers, from $\mathcal{O}((m + n)d^2)$ to $\mathcal{O}(nd^2 + ndc)$ (Eq. 31 & Eq. 77). This improvement hinges on $c$ scaling sublinearly with the average node degree. While theoretical indications for this possibility are mentioned, are there empirical results supporting it?
Section 4.2.1 suggests that Spexphormer allows for memory-efficient batching due to slower subgraph growth compared to Exphormer (line 42). Has this advantage been evaluated experimentally?
The claimed complexity of $\mathcal{O}((m + n)d^2)$ for Exphormer layers (line 31) is unclear. The Exphormer paper's discussion is also missing. Specifically, why is there an $md^2$ term instead of $md$ (there are $m$ dot products between d-dimensional vectors)? Additionally, why does the $d^2$ term change to $d$ in the Spexphormer complexity (line 77)?
Typically (including Exphormer), the equation would be $\mathbf{V}^j_i = \mathbf{W}\_o \mathbf{W}^{j}\_{V} \mathbf{X}_{\mathcal{N}_H(i)}$ where $\mathbf{W}_o \in \mathbb{R}^{d \times m}$ and $\mathbf{W}_V^j \in \mathbb{R}^{m \times d}$ and $m$ is the head dimension. This serves to reduce the parameter count. Is this also the case for you and just an accidental omission, or are you actually learning several $d \times d$ matrices $\mathbf{W}_V^j$ ?

局限性

While the authors identify high main memory usage as a limitation, a more comprehensive analysis of limitations is warranted. Here are some key considerations:

Efficiency of attention implementations: The paper doesn't explore the potential benefits of using efficient attention implementations like FlashAttention, which could potentially achieve similar memory savings as the proposed method. Evaluating Spexphormer's performance with such implementations would be valuable.
Impact of Expander degree: The evaluation compares Spexphormer to an Exphormer baseline that uses a higher expander degree than the original Exphormer paper. This potentially favors Spexphormer. A more balanced comparison would involve evaluating Exphormer with the original degree to understand if it suffers from the same memory limitations.
Runtime considerations: The evaluation focuses solely on memory consumption, neglecting the potential impact on runtime. Since runtime can be traded off for memory, it's crucial to analyze how Spexphormer affects runtime performance.

作者回复

2024-08-07

Thanks for your very thorough review; sorry for brevity (character limit). Please also see rebuttal pdf for updated results.

quadratic memory complexity

FlashAttention computes either full-attention or sparse-block attention mechanisms, with memory-aware loading of relevant parts of data. The use of full-attention, block-sparse attention (Big Bird), or linear approximate attention mechanisms (Performer) has been explored in the GraphGPS paper; Exphormer usually gets superior accuracies, probably thanks to its inductive bias.

A full-graph transformer with FlashAttention should work (with somewhat-worse accuracy expected), but would highly rely on positional encodings, standard versions of which are difficult to compute for large graphs. This is a feasible model to explore, but would require significant innovation and be very different from our work.

The block-sparse version of FlashAttention without considering the structural biases of the graph seems a poor fit for graph structures (cf BigBird results from GraphGPS).

Doing FlashAttention on an Exphormer-type attention pattern seems inefficient, since the block structure seems important to FlashAttention's efficacy, while Exphormer-style graphs can have nodes of wildly varying degrees (if the original graph’s degree varies, as usual).

Something like FlashAttention might be more feasible on regular graphs, as from Spexphormer. But this potential computational improvement to our approach seems to require deep thought about GPU memory structures and difficult implementation work, as did FlashAttention; we don't think this should be necessary for an initial paper exploring a new approach.

evaluation focuses solely on memory

Where we can train the original network, our approach is faster than Exphormer in general. For Photo, Exphormer takes 1.7s/epoch, our initial network 1.1s/epoch, final Spexphormer 0.5s/epoch. For Tolokers, Exphormer takes 6.2s/epoch, our initial network 4.6s/epoch, final Spexphormer network 1.1s/epoch. This advantage is larger for denser graphs. We also have added some experiments on the trade-off between RAM and GPU, as we can trade-off memory and time without any advanced GPU-level implementations (Figure 2 in rebuttal pdf).

potential bias in the evaluation

Even using a smaller degree Exphormer uses much more memory than our higher degree Spexphormer, see Fig 1 in our pdf for some numbers. Since Spexphormer can effectively use these higher degrees, it is more likely to be able to handle long-range dependencies. For example, in the ogbn-proteins dataset, an expander degree of 30 barely achieved an AUC of around 0.78 with a thorough hyperparameter search. Increasing the degree to 200 yielded an average AUC of 0.806.

Table 2 ... stronger baselines are needed

Thanks for highlighting this. We mostly followed the experimental setup + setup code from SGFormer; while they state they use the standard 80-10-10 train-val-test split, we had not previously noticed that their code (and ours) actually uses a 50-25-25 split. The numbers in Table 2 are correct for this split, but exacerbate the difference vs Polynormer due to smaller training set; eg SGFormer's GCN has accuracy 62%, while Polynormer's is 75%. We only just realized this, and are rerunning with the standard split; we'll try to provide numbers before the discussion period ends.

It's also worth noting Polynormer uses many more parameters than our models: 7 local + 2 global layers, each with hidden dimension 512, compared to our 2-layer network with hidden dimension 64. They train with batching based on a random subset of nodes, which as mentioned in Figure 4 and Appendix E can fail catastrophically; on Pokec in particular, though, the neighborhood summary is the most important feature, and the graph structure is not especially relevant, so this does not hurt much.

The idea of polynomial layers is also possible to directly drop in most common architectures, including ours; a SpexPolynormer combination model would be straightforward to write down. Their added complexity and hyperparameters, however, including a warm-up phase, mixture of global and local layers, etc. makes this cumbersome to include when exploring other new ideas.

Also, all of our current large-graph experiments now run with under 4GB of GPU memory, showcasing that our approach works well with very small batch sizes – unlike random-subset approaches which do not scale to small batches. (See the rebuttal pdf for more.)

Moreover, our first phase of training uses high CPU memory, but later training and evaluating the learned model on a given node can be done with low memory. Pollynormer, SGFormer, even GCN all batch for low memory during training, but evaluate with high CPU memory (loading the whole graph).

Baseline with random sparsification; baseline distributions appear overly simplistic

Also, uniform neighbor sampling is a common approach, used famously eg in GraphSAGE which often gets competitive results. In Table 1 of the pdf, we see using our network instead of random sparsification helps a lot in some datasets.

We’re not sure what you mean by “distributions based on the actual models’ behavior,” other than exactly the distributions that we use in Spexphormer. Do you have a specific baseline in mind to compare to?

Ablation studies

Good idea; also in new Table 1.

Appendix B

Thanks for noticing this; sorry for the mistake which we overlooked in submission. We will rewrite this section.

l162

Our claim is not that it is always impossible (though it is at initialization) for a model to decide to e.g. ignore expander edges, but that bias variables make it much easier. We’ll clarify.

l242

You are correct that we are making an intuitive argument for why we chose sampling over top-k selection; we'll clarify.

Other minor points

Thanks; we'll fix these.

Questions

Please see the separate author response above; thank you for your detailed review!

2024-08-13

We again thank the reviewer for the detailed review. Here are a few new confirmations:

Sampling Importance: Using the maximum-ranked edges instead of sampling can lead to poor results, especially seen on heterogeneous datasets. Here are some numbers:

Dataset	Sampling w Attention Scores Ratios	Maximum Attention Neighbors
Minesweeper	90.72 ± 0.06	87.92 ± 0.26
Tolokers	83.15 ± 0.12	80.85 ±0.23
Photo	95.24 ± 0.12	95.07 ±0.20

This confirms that sampling instead of getting the maximum is usually a better approach. The size of the effect can be very different among datasets, though.

Subgraph Growth: For the efficiency of batching, as we argued earlier:

When $c$ is smaller than the expander degree $k$ , as usual in our work, the neighborhood expansions are strict subsets of those from Exphormer. We haven't yet done explicit experimentation on this but will try to do so during the discussion period, or else for the revised version.

It is easy to see our expansion should be much smaller than for example batching with considering the whole neighborhood. For example, for the Photo dataset if we start with 5 nodes, averaging over 10 times of sampling 5 initial nodes and extending the neighborhood our method increases the neighborhood size (number of nodes $\pm$ std) with the following numbers in layers:

0: 5.0 $\pm$ 0.0

1: 28.9 $\pm$ 1.1

2: 164.8 $\pm$ 7.4

3: 818.1 $\pm$ 37.1

4: 2587.7 $\pm$ 94.1

However, whole neighborhood sampling has these number of nodes:

0: 5.0 $\pm$ 0.0

1: 274.1 $\pm$ 45.1

2: 6172.3 $\pm$ 356.0

3: 7650.0 $\pm$ 0.0

4: 7650.0 $\pm$ 0.0

The graph has 7650 nodes in total; in 2 layers it already extends to the whole graph. This expansion can be smaller by reducing the expander degree, but the expander graph approximates full-attention Transformers, and thus if the expansion does not reach the whole graph, obviously the expander graph will fail in approximating the full-Transformer.

This is not the only advantage, and as we have previously posted in the response to Reviewer vT4j, this lets us have regularity in our model and caused over 16% speed-up in the calculations for the Tolokers and Photo calculations over the sparse but not regular calculation scheme. The memory saving over the full neighborhood even without batching (which means we are not using the advantage of the slower expansion of our method in this comparison) has been explored in Figure 1 of our rebuttal PDF.

Pokec Dataset and Polynormer Model Baseline: For the Pokec dataset, we can confirm that we have used SGFormer’s data split, which is different from the one used in the Polynormer paper. Our experiments are valid and we are comparing with numbers from SGFormer’s table, thus all the numbers are using the same split. Our method is not comparable with the GCN, NAGphormer, and Polynormer numbers mentioned by the reviewer due to different data splits, and if you compare our model with GCN with the SGFormer split our method has far superior results.

For comparing with Polynormer in general as is common in the machine learning community, usually, methods with a similar number of parameters should be compared. Our model for the Pokec dataset has only 83,781 parameters, and while Polynormer does not report the number of parameters explicitly they use a hidden dimension of 512, which with only one layer of a linear map would need 262,144 parameters. They have 9 layers in total for this dataset, each having multiple linear mappings. Thus the scale of the number of parameters is not comparable at all.

We have made our best effort to answer all the reviewer's concerns. We would greatly appreciate it if the reviewer could read our rebuttal and let us know if there are still concerns remaining. We would be happy to answer, though the remaining time from the discussion period is very short.

2024-08-14

Thank you for your detailed response. However, I remain unconvinced by the justification provided regarding the applicability of FlashAttention to memory-efficient sparse attention. More importantly, I do not think different data splits or model sizes are valid reasons to avoid comparing with stronger baselines. Therefore, I will maintain my initial score.

2024-08-14

Dear Reviewer,

Thank you for your response. We hope that you are satisfied with the other parts of our rebuttal. Given the time constraints, we would like to address two points regarding your comment:

Regarding the FlashAttention type implementation, we would like to remind you that this implementation requires CUDA-level optimization, which is not straightforward in the highly irregular space of graph structures. Please note that the maximum graph size here is much larger than the context length typically used in Transformers for NLP, making full attention with simple masking not a viable option. While the direction you mentioned is worth pursuing, it has not been the focus of our work and is beyond its scope.
Regarding the baseline, not every model must achieve state-of-the-art results to be valuable. Our model aims to be memory-efficient and extend sparse pattern Graph Transformers to large graphs, offering a memory-time trade-off. We have demonstrated its effectiveness and compared it to relevant models that prioritize memory efficiency. In the NLP context, sparse pattern Transformers and low-rank models have always been distinct development threads, each excelling in different aspects.

Our final model is sublinear with respect to graph size and achieves competitive results against linear or superlinear models. We are not pursuing every possible extension, such as adding polynomial characteristics, merely to chase state-of-the-art results. Our work has a specific purpose, and we explore ideas around it, comparing our model with reasonable baselines.

We have addressed the limitations of a baseline mentioned in Figure 4 and Appendix E, and our model avoids these pitfalls. Even if model sizes are not convincing, we emphasize that different models should be developed in parallel. Suppressing a research direction because it does not immediately achieve top results is not scientifically sound. Different models have unique advantages, and the lack of benchmarks or not achieving the best results should not deter their development. After all, if neural networks had been abandoned early due to lack of best results, we would not enjoy their benefits today.

审稿意见

评分: 7置信度: 42024-07-12

This work studies the topic of sparsifying Graph Transformers which if quadratic is not scalable even on medium sized graphs. It builds on recent works like Exphormer, GraphGPS, SAN, among others and proposes a new two stage procedure with the naming Spexphormer. It is designed to reduce memory complexity for node classification tasks. The two stage process helps improve the scalability limitations of existing work such as Exphormer. First, a narrow network is trained on a fully augmented graph to estimate attention scores and these are used without edge features. These attention scores are then used to train a wider network on a much sparser graph. Experiments conducted on 11 datasets show that Spexphormer achieves competitive accuracy with lower memory usage.

优点

The question of how can graph transformers be further made scalable - through sparsification - is addressed. Although there can be multiple directions to scale the network which are discussed in the literature review section.
The two stage training forms the basis of sparsifying the graph even further than Exphormer. While doing this, there can arise several challenges. These challenges are touched upon by the paper to a decent extent.
Experimental analysis shows a clear reduction in the memory usage of the proposed Spexphormer with the base model.
The method follows a theoretical justification of why the approach of sparsification makes sense, though with assumptions.

缺点

together with Questions below

问题

Although the sparsification is addressed, the two stage process can be complicated to implement without expertise. Moreover, the reliance on the first narrow network could mean error propagation to the actual model. The paper mentions this limitation and provides a study for the approximation in 5.1. However, it may not be true universally.
The experiment section shows reduced memory usage for the smaller datasets, and in generally the maximum node size of datasets used is in ~2 million range. I would be curious to see the trends on the datasets of Table 2, eg. Amazon2M products dataset.
In section 4.3 and later in appendix, it seems the approximation aspect is studied for the first layer only, which makes it for simplification, understandably. What are the implications of this with respect to layer wise sparsification in terms of the approximation guarantees?
Writing format: For better readability, the Figure 4 can perhaps be placed appropriately before the experiments as it relates to the motivation of the method.

局限性

Yes the paper discusses the limitations of the method in the work.

作者回复

2024-08-07

Thanks for your valuable feedback! Here you can find our response; we also encourage you to check our rebuttal pdf for some new experiments.

the two stage process

Even for one-stage training, hyperparameter tuning is a significant part of the training. In our setup, the first stage is relatively easier to tune — we only need one training run that converges from the initial network, and as the number of layers and the hidden dimensions are fixed, usually just tuning the learning rate works. So the “overhead” of training two networks is not so huge, compared to the usual process of hyperparameter tuning. We think this plus the additional extra complexity – mitigated by us releasing documented code for the two-stage pipeline – is not so extreme, and often worth it for the benefits it brings.

Also, the training process of the second network is independent of the initial network given the attention scores. We intend to share the attention scores we have from our initial network’s training publicly for possible future uses.

error propagation

It is true that the error can propagate. However, this is not too drastic because the second network learns its own attention scores. The neighborhood sampling is done with the initial network, and the second network just gets the updated list of the neighbors and learns its own attention scores. As long as the initial network gives reasonably high attention scores to important neighbors, the second network can adjust the attention scores. Even if some of the selected neighbors are not very informative, the second network can ignore them by giving a lower attention score. If an informative link is missed, it is likely that the needed information can propagate through other (longer) paths. Also, different neighborhood sampling per Transformer layer helps to mitigate the possibility of missing important links. Even if some of the important neighbors are underestimated in the initial network, increasing the number of sampled neighbors can also help bring them back to the sampled neighborhood.

Our experimental results in Table 1, compared to Exphormer, show that the total amount of introduced error is not that large, and our approach can give competitive results.

The experiment section shows reduced memory usage ...

In Figure 3, the memory has been reported from the actual running of the methods for smaller datasets. It is not feasible to train the original Exphormer on a 40GB GPU even with only 1 layer and hidden dimension of 4 or 8. Among the tested datasets, the largest feasible for Exphormer is ogbn-arxiv, where we see a ~5x reduced memory usage. The memory savings is considerable even without any batching. Training the model with two layers and hidden dimension 8 requires near 200GB for Amazon2M dataset; training it with the hidden dimension 256, and without advanced techniques that trade memory and time, requires between 32 (256/8) to (32)^2 times more memory, which is not feasible even with most CPU devices.

theory

Theorem: Assume we have a Transformer network $\mathcal{T}$ with arbitrary large hidden dimension $d_l$ , $L=O(1)$ layers, and in this network, in all layers, we have $\lVert h_\cdot\rVert_2 \leq \sqrt{\alpha}$ , and $\lVert W_\cdot \rVert_{op} \leq \beta$ . There exists a Transformer $\widehat{\mathcal{T}}$ , that for any layer $\mathbf{W}_Q$ and $\mathbf{W}_K$ are in $\mathbb{R}^{d_s \times d_l}$ for a $d_s=\mathcal{O}(\frac{\log n}{\epsilon^2})$ , with a sufficiently small $\epsilon$ , and for all $i \in [n]$ , $\lVert\mathcal{T}(X)_i - \widehat{\mathcal{T}}(X)_i\rVert_2 = \mathcal{O(\epsilon)}$ .

The proof idea is to use Johnson-Lindenstrauss for the attention scores, bound them by relative error, and then bound the Euclidean distance of the other vectors based on the errors happening in the attention scores. The difference here with the previous proof is that the error can propagate from the Euclidean distance from the input of the layer too. But, we bound all these errors and show that the total error is still $\mathcal{O}(\epsilon)$ .

This proof only works for narrow attention score estimation, and assumes the other parts of the network have the same dimensions. This is, however, the most memory/time-intensive part of a Transformer architecture. The rest of the sections are node/token-wise and linear with respect to the number of nodes. The attention score estimation part of a full-Transformer layer requires $\mathcal{O}(n^2d)$ operations and $\mathcal{O}(md)$ operators are required for a sparse Transformer with $m$ attention edges. Thus this change effectively changes the computational complexity of the model.

Also, another interesting thing about this proof is that it not only shows that there is a network with similar attention scores but narrow attention calculation hidden dimension; but also by bounding the error in the output, it shows that this network is nearly as good as the large one, and thus if the large network is optimal, this would be at least near-optimal network in the lower dimension.

This expanded version of the theorem helps extend our guarantees about the possibility of approximation to deeper layers. Thank you for pointing out this important issue.

Writing format

We very much agree with this comment and we will definitely move that figure for the next revision.

评论- Reply to rebuttal

2024-08-14

Thank you for your responses. While the limitations of the two stage process and the error propagation may still hold to the best of my understanding, I believe the answers indicates towards how it does not alter (that much) the overall advantages brought by the sparsification of Exphormer like architecture in the proposed way. The memory trends of Amazon2M sized datasets also exposes the prior graph transformers which are memory-demanding, necessitating further sparsification. Taking in consideration the points raised in other reviews and pending revision based on the replies, I adjust my score to accept.

2024-08-14

We would like to thank the reviewer for acknowledging our rebuttal and raising the score.

The memory trends of Amazon2M sized datasets also exposes the prior graph transformers which are memory-demanding, necessitating further sparsification.

Indeed, this has been one of the main points of our work. Thank you for pointing this out.

As promised in our rebuttal, we will ensure to publish a cleaned-up version of our code along with our attention scores to make implementation easier for the community and enhance the usability of our work.

作者回复

2024-08-07

New experiments

All reviewers: please see the PDF linked below, which has some new experimental results.

Further response to Reviewer X47Y

Due to character limits in responding to your very thorough review, these are continued here; thanks again for your work in reviewing our paper!

To strengthen the analysis, consider estimating Exphormer's memory consumption on large graphs (e.g., by running a single training step on CPU).

We use 200GB of memory to train on a dataset such as Amazon2M with only hidden dimension 8. Training the network with hidden dimension 256 would require at least 32 times the memory, which we did not have available, even on CPU. Alternatively, the memory and time could be traded here, as you mentioned earlier, but training for 150 epochs with hidden dimension 8 takes almost a day; checkpointing or similar techniques could easily make the training take a week or more.

The paper claims a complexity reduction for Spexphormer layers, from $\mathcal O((m+n)d^2)$ to $\mathcal O(nd^2+ndc)$ (Eq. 31 & Eq. 77). This improvement hinges on c scaling sublinearly with the average node degree. While theoretical indications for this possibility are mentioned, are there empirical results supporting it?

Spexphormer improves this asymptotic complexity if $m > n c$ , or $c < (m/n) = \rho$ , where $\rho$ is the average degree of the graph. This is in fact always the case in our setup. If we want to have the benefit of the expander graph, a regular expander graph of let's say degree $k$ would be added; in all our experiments $c < k$ , and thus it is guaranteed that $c < \rho$ . In most cases we also have $c < \rho - k$ , if we considered not using the expander graph at all. $c/\rho$ values are as low as (0.08, 0.05) for the two layers of Proteins, and on average around 0.1 for the homophilic datasets in Table 1.

In addition to the lower degree, the regularity of the sampled graph helps with avoiding many complexities from varying degrees in the nodes, helps with the batching process and more efficient calculations as mentioned in section 4.2.2.

Section 4.2.1 suggests that Spexphormer allows for memory-efficient batching due to slower subgraph growth compared to Exphormer (line 42). Has this advantage been evaluated experimentally?

When $c$ is smaller than the expander degree $k$ , as usual in our work, the neighborhood expansions are strict subsets of those from Exphormer. We haven't yet done explicit experimentation on this but will try to do so during the discussion period, or else for the revised version.

The claimed complexity of O((m+n)d^2) for Exphormer layers (line 31) is unclear. The Exphormer paper's discussion is also missing. Specifically, why is there an $md^2$ term instead of $md$ (there are $m$ dot product between $d$ -dimensional vectors)? Additionally, why does the $d^2$ term change to $d$ in the Spexphormer complexity (line 77)?

We briefly discussed this in lines 149-153, but will clarify. $md^2$ is because of the line E = self.E(edge_attr) in ExphormerAttention.forward, in the file Exphormer/graphgps/layer/Exphormer.py from the Exphormer repo. This edge feature mapping is a $d \times d$ multiplication, because Exphormer is used on datasets with many types of varying edge features. In Spexphormer, we only have three possible values of edge features, for each type of edge; thus we can replace that mapping with an edge type embedding, reducing the complexity to $m d$ . For all the memory comparisons we have fixed this on the Exphormer as well. This is a very small change, but it drastically saves memory as usually $m \gg n$ .

Typically (including Exphormer), the equation would be ...

Our attention formulation is same as the Exphormer. We will revise this for future versions of our model. However, we should clarify that $W_O$ is actually of size $d \times d$ (not $m \times d$ ) in both Exphormer and the originating paper. $W_O$ will be applied on the concatenation of the heads, to mix the heads together. In our case, this can be combined with the next feed-forward network following the attention mechanism (as common in Transformer architectures). The reason why Exphormer cannot do that is that they combine their model with an MPNN, and the feed-forward network part applies on the combined representations coming from the Transformer part and the MPNN. Since we remove the MPNN part, we don't have this problem, and representationally $W_O$ can be absorbed into the next layer.

2024-08-13

Dear reviewers,

we would like to sincerely thank you for your valuable reviews.

We are approaching the end of the discussion period. We have tried to address all the questions and concerns to the best of our ability, provided the requested information regarding our experiments as well as some new results, and supplemented our work with additional theoretical support. We would greatly appreciate it if you could provide some feedback and respond to our rebuttal. An earlier response from you would give us sufficient time to address any further concerns that may arise before the end of the discussion period.

2024-08-14

We are in the last few hours of the discussion period, and have not had a reply from any reviewers. Having invested a significant amount of time responding to these reviews, including running several new experiments and otherwise spending a large amount of time thoroughly addressing concerns, we can only hope that the area chair and the reviewers will engage with our rebuttal in the coming decision-making period. Since we can no longer be involved in that phase, we would like to leave you with a summary of the merits of our work and some highlights of the rebuttal, to aid your assessment of whether we have appropriately addressed the various concerns raised in review.

Our work addresses several current challenges in the research area:

Our empirical analysis of attention scores reveals a previously unnoticed consistency across models whose downstream performance is not consistent. To our knowledge, this has not been observed in previous work.
We provided theoretical justification for this consistency, to our knowledge the first to analyze hidden dimension compressibility of graph transformers in a transductive setting. Initially, our proof applied to the first layer, but in the rebuttal, we provide a new theorem that extends the analysis to a version of the entire network. This is a novel theoretical result in this area that can also be applied to other GNNs.
We present a task-dependent graph subsampling and batching technique, with significant advantages over most state-of-the-art methods in this area.
We introduce techniques to enhance the interpretability of our attention scores and improve subsampling and batching time by applying reservoir sampling. This can pave the way for other works hindered by the slow weighted sampling methods typically used in conventional libraries.
Our method excels on dense graphs by sampling a small ratio of the original edges (around 0.1 for most graphs, but can be smaller for denser graphs not currently available for our benchmark). As the community has primarily focused on sparse graphs for large graph analysis, most publicly available graph benchmarks are also sparse. On the other hand, real-world graphs, such as social networks, can be much denser, with average degrees in the thousands and maximum degrees in the millions. We believe our work, which extends the capability of current models to new settings, should not be punished for the lack of publicly available dense graph benchmarks.

Some highlights from our rebuttal:

We retrained our network for large graphs using less than 4GB of GPU memory, with memory usage and batch sizes reported in our rebuttal PDF. This demonstrates how memory usage can be reduced without sacrificing accuracy, a quality not shared by current batching techniques like random node subsets. While Reviewer X47Y correctly notes that some recent Transformer works have achieved this trade-off, no such implementation exists for current GNNs, especially given the challenges of highly varying node degrees and unstructured space. We emphasize that our graphs are much larger than typical NLP Transformer context lengths, making a full Transformer with masking infeasible. Even with improved implementations, our regular degree subsampled graph is a superior option and enables faster methods.
Our method is much faster than already fairly fast models such as Exphormer on the datasets where we could run both models. (On Tolokers, Exphormer takes 6.7s/epoch, vs 1.1s/epoch for our model.)
During the rebuttal, we provided ablation studies that show many parts of our work are essential and hugely improve the results (Table 1, rebuttal PDF).
As is customary in deep learning, comparisons should be made across models with similar parameter counts. Most mentioned baselines have more parameters in a single linear map than our entire network. Nevertheless, we achieve competitive results on most datasets with our much smaller network.
Figure 1 in our rebuttal PDF shows that even with an improved Exphormer or small expander degree Exphormer, our model hugely saves on memory, and this saving gets more and more critical as the graph size grows.
Figure 2 of our rebuttal PDF shows that our work has an excellent time-memory trade-off (please note that the x-axis or batch size is in a logarithmic scale and time is in the linear scale, and the slope for memory decrease is higher than the time increase, especially for the ogbn-proteins dataset).

Our rebuttal also addressed many other questions asked by the reviewers. We would kindly request that the reviewers and the area chair consider these points during the future phases of review.

最终决定Accept (poster)

2024-09-25

This work introduces Spexphormer, a two-stage process designed to address the scalability issues of Graph Transformers on large datasets by reducing memory complexity. Building on prior work (Exphormer), the method first trains a narrow network on a fully augmented random graph to estimate attention scores, which are then used to create sparser interaction graphs. These sparser graphs are employed to train a larger network, improving scalability without sacrificing performance. Experiments demonstrate some of Spexphormer claims.

Weaknesses

The work essentially bills itself as a smaller-memory footprint Exphormer. This was seen as too incremental by some reviewers.
Once we add attention to the edges of the expander graph, it obviously not guaranteed to be an expander anymore. This quirk of the method should be made clear. The initial shallower GNN will benefit from the expander, but maybe not the attention module.
The sampling and expander will make the embeddings positional, losing equivariance (which is a very common drawback of graph transformers).
The experimental analysis on scalability focuses on memory reduction, with less emphasis on runtime performance. This limited scope weakened the overall evaluation.
The work should have talked more about the challenges of alternative efficient attention implementations like FlashAttention. It is important for the community to understand precisely why sparse attention methods are incompatible with directly applying FlashAttention.

Strengths

Scalability of graph models is an important topic.
Working in layers, simpler models reduce the workload of more complex models, is a good direction to follow in general. This approach, even though applied to a narrow scenario (improve the memory footprint of Exphormer), it may inspire others.
Well-written and structured paper, with clear explanations of methodology and experimental setups.
The rebuttal has improved the paper. The new results should be included in the final paper, if accepted.