5.8

/10

Poster4 位审稿人

最低4最高7标准差1.1

3.5

置信度

正确性2.8

贡献度2.5

表达3.3

NeurIPS 2024

Supra-Laplacian Encoding for Transformer on Dynamic Graphs

Yannis Karmim,Marc Lafon,Raphael Fournier-S'niehotta,Nicolas THOME

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

TL;DR

New spectral spatio-temporal encoding for fully connected Dynamic Graph Transformer in dynamic link prediction

摘要

关键词

Dynamic graphsLink predictionTransformersupra-Lapacian encoding

评审与讨论

审稿意见

评分: 7置信度: 32024-07-11

This paper introduces a new method called Supra-Laplacian Encoding for spatio-temporal Transformers(SLATE) to deal with dynamic graph challenges. Its core approach is to enhance the graph transformer(GT) architecture by integrating spatio-temporal information more efficiently. It deploys a new technology to convert discrete-time dynamic graphs into multiplayer graphs and exploit the spectral properties of their associated super-laplacian matrices. SLATE also implements a cross-attention mechanism to explicitly model pairwise relationships between nodes. SLATE can capture the dynamic nature of graphs more accurately with this implementation. SLATE provides a powerful tool for applications ranging from social network analysis to understanding complex biological networks.

优点

1.SLATE applies spectral graph theory to the dynamic graph domain in a novel way. 2.The quality of this study is evident in the rigorous experimental setup and the comparion with SOTA methods. It is able to outperform many existing models on nine datasets. 3.The authors provide a detailed explanation of the method and the underlying theoretical concepts. And the open-source code and the instructions for reproducing the results enhances the clarity and accessiblility.

缺点

The experimental results of CanParl in Table 2 is not very good. 2.The permutation setting of SLATE may limit its ability to generalise to unseen nodes and large-scale graph data.

问题

How does the model perform as the size of the graph increases?

局限性

Yes

作者回复

2024-08-07

We thank reviewer n3Tf for their meaningful and valuable comments.

Q.1) How does the model perform as the size of the graph increases?

Scaling our SLATE method to graphs with ~10, 000 nodes has been one central objective in our submission. Through engineering techniques like using FlashAttention and a temporal window, we were able to achieve strong performance on the HepPh (15k nodes) and Flights (13k nodes) datasets, outperforming the baselines on both. Following reviewer n3Tf's requests on the possibility to apply SLATE to even larger graphs, we conducted several experiments using Performer, a method that approximates attention computation with linear complexity in the number of nodes instead of the quadratic complexity of a classic Transformer. This allowed us to significantly reduce memory usage while maintaining stable performance, as shown in the results presented in Table 2 of the global answer. We show that SLATE/Performer was still able to significantly outperform state-of-the-art results on USLegis and Trade. We also successfully scaled our model on the Facebook dataset containing ~45k nodes, with a memory usage of 31 GB on our NVIDIA RTX A6000 (49 GB) card. Due to the dataset's size, we only had time to run a single trial of our model, achieving an AUC performance of 77.5% AUC. This is a very promising result, which still outperforms strong baselines such as EvolveGCN and DySat on this large-scale dataset (as per the results reported for those methods in the HTGN paper [11]).

W.1) Regarding performance on CanParl:

While SLATE's performance on the CanParl dataset is not the highest, it consistently outperforms other models on the remaining datasets, such as USLegis, Flights, Trade, and UNVote. Overall, SLATE achieves an average performance that is 13 points higher than the second-best baseline across all datasets. This demonstrates SLATE's robustness and effectiveness in capturing the dynamics of various graph structures.

[11] Menglin Yang, Min Zhou, Marcus Kalander, Zengfeng Huang, and Irwin King. Discrete-time Temporal Network Embedding via Implicit Hierarchical Learning in Hyperbolic Space. ACM SIGKDD 2021.

审稿意见

评分: 4置信度: 42024-07-12

This paper proposes SLATE, a novel method for link prediction in dynamic graphs. SLATE transforms dynamic graphs into multi-layer networks and generates a unified spatio-temporal encoding by leveraging the spectral properties of the supra-Laplacian matrix. It uses a fully connected transformer architecture to capture long-range dependencies between nodes across multiple time steps. The authors introduce a cross-attention-based edge representation module for dynamic link prediction. They claim that SLATE significantly outperforms existing state-of-the-art methods on several benchmark datasets

优点

The idea of transforming dynamic graphs into multi-layer networks and utilizing the supra-Laplacian is innovative.
Extensive experiments were conducted on various datasets and baselines.
The method shows superior performance compared to state-of-the-art approaches on multiple datasets.

缺点

W1. The explanation for adding temporal connections in the supra-Laplacian construction stage seems insufficient.

W2. The description of how to construct the supra-Laplacian is not comprehensive enough.

W3. The characteristics of the SLATE model are not clearly defined. For example, the necessity of each step (a)-(d) in Figure 2 lacks convincing arguments.

W4. This paper discloses all data and code upon acceptance, which limits the ability to verify the reproducibility of this paper.

问题

Q1. The depiction of adding temporal connections in Figure 2 is difficult to understand. The explanation for adding temporal connections seems inadequate, especially the description of AddTempConnection in Algorithm 1.

Q2. When adding virtual nodes, are multiple virtual nodes created? The explanation regarding virtual nodes appears to be insufficient.

Q3. What are the theoretical advantages of Supra-Laplacian encoding compared to existing graph positional encoding methods?

Q4. What would be the impact of using sparse attention mechanisms instead of fully connected transformers?

局限性

This paper discloses all data and code upon acceptance, which limits the ability to verify the reproducibility of this paper.

作者回复

2024-08-07

We thank reviewer G1C1 for their questions and remarks, we hope the explanations below will answer their concerns.

Q1 & W1.) The explanation for adding temporal connections seems inadequate, especially the description of AddTempConnection() in Algorithm 1. The explanation for adding temporal connections in the supra-Laplacian construction stage seems insufficient.

Having temporal connections between a node and itself at previous timesteps allows us to build a connected graph, whose spectral analysis provides relevant information of the spatio-temporal structure of the dynamic graph. A single eigenvector incorporates information about the dynamics of the graph, i.e. the first non-zero eigenvector if this multi-graph is connected (see Figure 1 left). In the simple case where there are no isolated nodes, this amounts to stacking the adjacency matrices diagonally, as illustrated in Equation 7 (appendix A.1). In practice, we only add a temporal connection if the nodes are not isolated, that is the role of AddTempConnection() in Algorithm 1 (Appendix A.2). We checked the correctness of this algorithm.

Additionally, we have included a visualization (numbered 3) in the PDF attached to the rebuttal to better illustrate the importance of temporal connections. This visualization shows that the spectrum associated with the discrete dynamic graph does not capture spatio-temporal structure, unlike when inter-layer connections are added to model temporal dependencies between nodes.

More generally, the core of our method is to transform a discrete dynamic graph (DTDG) into a multilayer graph in order to exploit their rich spectral properties, as motivated in lines 42 to 45 of our paper. A dynamic graph can be seen as a MultiLayerGraph, each snapshot of the dynamic graph is represented by a layer of the multilayer graph. In such graphs, identical nodes are connected to each other [8]: in a dynamic graph, this is represented by a connection between a node and its past.

W2.) The description of how to construct the supra-Laplacian is not comprehensive enough.

The construction of the supra-Laplacian is detailed in section 3.1 of the paper and illustrated on Figure 2. We give theoretical and technical precisions in Appendix A (especially sections A1 and A2), and we also resort to common background knowledge on (multi-layer) graphs, e.g., the definition of a supra-laplacian matrix [8,9].

Q2.) When adding virtual nodes, are multiple virtual nodes created?

As indicated on lines 151-152, we add as many virtual nodes as there are snapshots; if $W = 3$ , then we add 3 virtual nodes. The purpose of these virtual nodes is to make our snapshot connected in case there are several clusters in it. We've illustrated the need for virtual nodes in Figure 2.a (1 page pdf, global response): at time $t$ , there are two clusters, so adding a virtual node connects these two clusters. We also show in Table 5 of the answer to RN2C2 the experimental improvement of adding virtual node.

W3.) The characteristics of the SLATE model are not clearly defined. For example, the necessity of each step (a)-(d) in Figure 2 lacks convincing arguments.

Figure 2 is the main figure describing the details of the method. Each step is detailed in the text, with references to the precise sections at the top of the columns on the Figure (e.g., step (c) is detailed in section 3.2). In each of those sections, we motivate the necessity of each step (e.g., in section 3.3, we explain how we compute an edge representation from the output of the fully-connected transformer).

Q3.) What are the theoretical advantages of Supra-Laplacian encoding compared to existing graph positional encoding methods?

As detailed in lines 142-145 in the submission, we remind that the main objective of our pre-processing steps is to obtain a connected graph. This ensures that the spectral analysis of the supra-Laplacian matrix provides relevant information about the global dynamics of the graph, which can then be utilized as a powerful spatio-temporal encoding mechanism in graph transformers. Our Supra-Laplacian encoding offers a unified approach that captures both spatial and temporal information simultaneously, unlike previous graph positional encoding methods, which separately computed spatial and temporal encodings (e.g., Taddy [10] in the context of anomaly detection). As demonstrated in our ablative experiment (Table 3 of the paper), this unified spatio-temporal encoding significantly enhances performance.

W4.) This paper discloses all data and code upon acceptance, which limits the ability to verify the reproducibility of this paper. All datasets used in our study comes from public datasets, such as those presented in [6]. In our paper, we have also taken care to detail all the parameters we optimized (see Table 8 in the appendix). SLATE’s source code will released to the community. We are currently working on providing the Area Chair with an anonymized and easily executable version of the code to verify the performance of our model.

Q4.) What would be the impact of using sparse attention mechanisms instead of fully connected transformers?

As indicated in l-262, we have compared our method to state-of-the-art approaches using sparse attention mechanisms, either discrete models such as DySAT, or continuous ones, such as TGAT, TGN and TCL.

Our experimental results show that SLATE performs significantly better than those methods (e.g., see in the paper Tables 1 and 9 for SLATE vs DySat, see Tables 2 and 12 for SLATE vs TGAT)

[8] Radicchi, et al. Abrupt transition in the structural formation of interconnected networks. Nature Physics 2013

[9] Cozzo et al. Multilayer networks: metrics and spectral properties. arxiv preprint (2015)

[10] Liu et al. Anomaly Detection in Dynamic Graphs via Transformer. IEEE Transactions on Knowledge and Data Engineering 2021.

2024-08-12

Thank you for your detailed response. However, I still have questions regarding Q2:

In the figures provided, it's not clear which nodes are virtual nodes. Could you please identify specifically which node numbers in the visualizations represent virtual nodes? Additionally, while Algorithm 1 in Appendix mentions AddVirtualNode(), it doesn't provide details on how this is done. Could you explain the specific algorithm or rules for adding virtual nodes? For instance, how do you determine where to add these nodes, and how many are typically added?

2024-08-12

Thank you for your reply to our rebuttal and we hope we've been able to clarify as many of your questions as possible.

About the figures :

Due to space constraints (1 page) and for better clarity in the figures, we did not include virtual nodes in the visualizations. A virtual node is connected to all other nodes in the snapshot, which can overcrowd the figures and make them visually complex. It is important to note that the purpose of the visualization is to highlight the importance of having a connected graph. Figure 2 demonstrates the significance of removing isolated nodes (RemoveIsolatedNodes), and Figure 3 shows the impact of adding temporal connections (AddTemporalConnections). Due to space limitations, we were unable to include a fourth figure illustrating the benefit of a virtual node in a snapshot with multiple connected components. However, we plan to add these visualizations using the toy dataset in the appendix of the final paper if it is accepted.

Clarification on virtual nodes:

where to add these nodes?: We add one virtual node per snapshot (l-152).

How many are typically added? : If we have $W$ snapshots, we’ll add $W$ virtual nodes.

Explanation of the AddVirtualNode() function using an example: Assume a dynamic graph $\mathcal{G} = [G_1,G_2]$ containing two snapshots. Where $G_1 = (V_1,E_1)$ and $G_2 = (V_2,E_2)$ . Let $V_1 = (v_1,v_2,v_3,v_4)$ and $E_1 = ((v_1,v_2),(v_3,v_4))$ . Now let’s add a virtual node $v_{n_1}$ to the graph $G_1$ . We have $V_1^{'} = V_1 \cup (v_{n_1})$ .

A virtual node is a node connected to all the other nodes in $G_1$ , so we add the following edges to $G_1$ : $E_1^{'} = ( E \cup (v_{n_1},v_1),(v_{n_1},v_2),(v_{n_1},v_3),(v_{n_1},v_4))$ . We apply the same procedure to the next snapshot (i.e. $G_2$ ). We add a second virtual node $v_{n_2}$ and connect it to the nodes of $G_2$ . In the end, we'll have added $v_{n_1}$ to the graph $G_1$ and $v_{n_2}$ to the graph $G_2$ . In an adjacency matrix, as shown in Figure 2a.), this adds a new row and column, with 1's in each position corresponding to connections with all other nodes, indicating that it is connected to all existing nodes. We hope this example helps to clarify your understanding on virtual nodes.

We hope that we have addressed all your concerns, and that you’ll be able to increase your rating. We remain at your disposal.

2024-08-12

Thanks for answering my follow-up question. The additional explanation helped me understand your virtual nodes. I think it would be good if you could revise the paper to make this part a little more clear. I raise my score to 4.

2024-08-12

Thank you for engaging in the discussion and raising your score. We'll be updating the final paper to make the explanations of the various steps as clear as possible. We'll also include the rebuttal visualizations and an additional visualization of the virtual nodes in the supplementary.

审稿意见

评分: 6置信度: 32024-07-12

This paper proposes a spatial-temporal encoding for transformers on dynamic graphs. Specifically, graphs at each time step are treated as a single multilayer graph and packed into a larger adjacency matrix, with temporal self-connections between each node and its past. Eigenvectors of the constructed Laplacian are used as positional encoding and concatenated with node features. A standard transformer encoder layer then generates all representations for each node at each time step within a selected time window. To predict a link between two nodes, the model does cross-attention between their representations within the time window. Experimental results show that the proposed model performs better than existing methods.

优点

The positional encoding proposed by this paper aims to jointly model spatial and temporal dependencies, which is a plausible improvement over existing methods.
The proposed model shows strong empirical performance compared with existing approaches. In particular, the proposed positional encoding works better than simply concatenating LapPE and sin/cos

缺点

Scalability/efficiency may still be a concern on large graphs, though the paper shows good engineering (e.g., Flash-Attention) can help
Some finer-grained ablation studies are missing. For example:
- Instead of removing isolated nodes in preprocessing, can we keep the disconnected graph and just use eigenvectors corresponding to non-zero eigenvalues?
- The transformer itself can already get global information and I see no strong reason to use virtual nodes additionally. How would the model behave without virtual nodes? How would "virtual nodes + GNN" behave?

问题

Please see my 2nd point in Weaknesses

局限性

N/A

作者回复

2024-08-07

We thank reviewer nC2C for their meaningful and valuable comments.

W1 on scalability:

As any Transformer model, SLATE’s main bottleneck in terms of scalability is the quadratic complexity of the attention matrix. However, as you pointed out, we use Flash attention to mitigate this issue. Based on this efficient engineering technique, we manage to apply SLATE to dynamic graphs having more than 10k nodes, with a broad-audience GPU device (NVIDIA RTX A6000 with 49 GB memory). We show in Table 1 of the global response on Flights (13,169 nodes) that the memory demand of SLATE is comparable to several state-of-the-art methods for dynamic link prediction, e.g., EvolveGCN, Dysat, ROLAND-GT with Flash attention.

To further challenge the scalability of STATE to larger-scale graphs and explicitly address reviewers’ concerns, we propose to apply attention approximations to reduce our memory consumption. Specifically, we explore the use of the Performer method [2], which approximates the softmax calculation using Random Fourier Features, thus breaking the quadratic complexity to linear complexity. Using Performer, SLATE is able to scale to datasets like Facebook containing over 45,000 nodes with the same NVIDIA RTX A6000 (49 GB). We conducted preliminary experiments on this Facebook dataset using the default Performer’s hyperparameters, and reached 77.5% AUC score, which is already above EvolveGCN and Dysat. We expect to significantly improve these results by a more careful parametrization.

We also apply Performer on several datasets used in the submission, i.e., AS733, USLegis and Trade, and show that the performance of SLATE/Performer remains close to the performance of SLATE, and remain significantly above state-of-the-art results on those datasets (see Table 2 of the global answer). We believe that the application of such attention approximation is a very promising way of scaling SLATE to larger datasets, while maintaining excellent performances. We would be glad to include these new results and discussion in the final paper if accepted.

W2 on ablation studies: Following RnC2C’s request, we provide finer-grained ablation studies regarding supra-Laplacian’s computation. We would be glad to include those results in the final paper.

W2.1 on pre-processing isolated nodes:

This is an excellent question, and one that we've studied extensively. To fulfill RNc2c’s request, we evaluated the spectral analysis obtained when keeping isolated nodes but only considering eigenvectors associated with non-zero eigenvalues. Qualitatively, we provide visualizations (see the visualization 2 of the pdf file attached to the rebuttal) showing that the resulting spectrum is significantly more noisy than that obtained after removing isolated nodes, especially because many eigenvectors encode a projection on the kept isolated nodes. Those projections do not capture meaningful information on the spatio-temporal structure of the dynamic graph. To consolidate the comparison, we also provide a quantitative ablation study on the Table 4 below on the USLegis dataset: we can see that that there is a huge drop in performance (~30 pts in AUC) when not removing isolated nodes but ignoring 0 eigenvectors (‘SLATE with isolated nodes 0’) compared to removing them in SLATE. This clearly validates the importance of this pre-processing step.

Table 4: AUC Performance of SLATE Compared to SLATE with Isolated Nodes, Considering Only Eigenvectors Associated with Strictly Positive Eigenvalues.

Models	Colab	USLegis
SLATE with isolated nodes 0	86.73	66.57
SLATE	90.84	95.80

W2.2 on virtual nodes:

Transformers can indeed capture global information without virtual nodes (VNs). Although recent works use virtual nodes with static GNNs to get global information [7], the purpose and motivation in our paper is fundamentally different. Indeed, we use virtual nodes as one step to ensure that our multi-graph is connected, which is crucial for designing a relevant spatio-temporal encoding based on the supra-Laplacian matrix, enabling us to extract meaningful information from the spectral analysis of the supra-Laplacian. In this context, the second smallest eigenvalue represents the graph's dynamics. This is particularly important in cases where a snapshot contains multiple clusters of nodes and remains disconnected even after removing isolated nodes. Based on RNc2c’s recommendation, we experimentally demonstrate the importance of virtual nodes. As shown in the Table 5 below, this approach yields a gain of +1.21 points in AUC on the Enron dataset.

Table 5: AP and AUC Performance of SLATE With and Without Virtual Nodes on the Enron dataset.

Models	AP	AUC
SLATE w/o VN	93.74	95.18
SLATE	95.40	96.39

To answer RNc2c’s request on "virtual nodes + GNN", we implemented the addition of a virtual node per snapshot on a standard DTDG architecture already included in the article, namely GRUGCN, which combines a spatial GCN and a temporal RNN. However, in contrast to results reported on static graphs, we were not able to improve performances over GRUGCN in our dynamic context.

[7] Cai et al. On the Connection Between MPNN and Graph Transformer. ICML 2023

2024-08-13

I thank the authors for their response and additional experimental results. I think they are good additions to the paper and I suggest they are included in the updated version. I'm willing to increase my rating to 6

2024-08-13

We would like to thank you for your reply and for the positive response to our work. We are delighted that the additional experiments and the rebuttal were able to provide you with answers. We'll be adding the scalability experiments as well as the detailed preprocessing experiments for the supra-adjacency matrix.

审稿意见

评分: 6置信度: 42024-07-13

This work introduces Supra-Laplacian encoding for spatio-temporal Transformers (SLATE) which aims to learn both spatio and temporal information in a dynamic graph with a transformer architecture. The key is to convert Discrete Time Dynamic Graphs into multi-layer networks and then extract the spectral features of their supra-Laplacian matrix to improve upon existing dynamic graph transformer designs. Additionally, SLATE employs a cross-attention mechanism to accurately model nodes' pairwise relationships, improving dynamic link prediction. The proposed SLATE model performs competitively to both CTDG and DTDG methods on discrete graphs.

优点

originality: connecting DTDGs into a multi-layer graph and then compute spectral properties of a Supra-Laplacian matrix is a novel approach in the literature. The empirical performance also demonstrates that this approach can outperform existing methods with its spatio-temporal reasoning capabilities.
extensive evaluation: The proposed SLATE method compares favorably to both CTDG and DTDG methods on discrete datasets with existing evaluation of testing 1 positive edge against 1 negative edge. In addition, model analysis experiments and ablation studies provides insights into the model components and choices. Additional experiments with hard negative samples are also included in the appendix.
clear presentation: the paper is easy to follow and the main idea is presented well

缺点

scalability: my main concern is the scalability of the method as the authors also pointed out as a limitation. Even with the time window (which truncates the history of the temporal graph), the $N^2 w^2$ complexity remains very high and only feasible for networks with up to thousands of nodes, In addition, there is a large amount of precomputation needed for the supra-Laplacian and computing its eigenvectors.
window size: one of the core hyperparameter of SLATE is the choice of window size, as the study in Figure 4 shows that there are some common optimal window size for the CanParl Colab and USLegis datasets. These datasets mostly contains a small number of snapshots thus might be why 4 is a good choice. In practice though, it might be difficult to tell which window size is optimal without extensive experiments to select it. It would also be interesting to see if the length of the window is related to other factors in the architecture, size of the multi-layer network, transformer dimension etc.

问题

In the ROLAND paper, which is a close competitor to this work, the MRR metrics is used for evaluation, why is the AUROC adapted in this work instead as it has been shown to be very limited and biased towards the 1 negative sample that is compared against each positive test edge.
there are types in the paper for example "loose" on line 5 should be "lose"
how are the CTDG methods applied on the discrete datasets, some performance looks low.

局限性

The limitations are discussed sufficiently in the paper. More discussion on negative societal impact might be included such as downstream applications of anomaly detection or recommendation systems for temporal graph learning methods,

作者回复

2024-08-07

We thank reviewer kt1z for their meaningful and valuable comments.

W1 on scalability: Like any Transformer model, SLATE's primary scalability bottleneck is the quadratic complexity of the attention matrix. However, as noted by reviewer nC2C, we mitigate this with FlashAttention, allowing SLATE to handle dynamic graphs with over 10k nodes. As shown in Table 1 of the global response, SLATE's memory usage on Flights is comparable to that of several state-of-the-art dynamic link prediction methods, e.g., EvolveGCN and DySAT.

To further test SLATE's scalability on larger graphs, we use Performer (applied for static graphs in GraphGPS), which approximates softmax calculations with Random Fourier Features and reduces attention's complexity from quadratic to linear. With Performer, SLATE scales to datasets like FB with over 45k nodes. In preliminary experiments using Performer's default hyperparameters, SLATE achieved a 77.5% AUC score on the FB dataset, above EvolveGCN and DySAT. We expect to improve these results with more careful parameter tuning.

We also applied Performer to several datasets used in the submission, such as AS733, Legis, and Trade. As shown in Table 2 of the global response, SLATE/Performer's performance remains close to SLATE's and significantly exceeds state-of-the-art results on these datasets. We believe attention approximation is a promising way to scale SLATE to larger datasets while maintaining excellent performance. We would be glad to include these new results and discussions in the final paper if accepted.

Regarding the scalability of the eigenvector calculation, its complexity is $\mathcal{O}(k^{2}N^{'})$ [3], with $k$ the number of kept eigenvectors and $N^{’}$ the number of nodes in the supra-adjacency matrix after removing isolated nodes. Note that $k$ is relatively small in our experiments (around 10-15). In practice, the eigenvector calculation is small in SLATE's computation: on the Facebook dataset, our model ran in 303 seconds per epoch ( $k=6$ ) without pre-computing the eigenvectors compared to 577 seconds for VGRNN.

W2 on time window selection

The time window is indeed an important hyper-parameter; however, we obtained stable results using a fixed window size $W$ of 3 for all experiments. An extensive grid search across all datasets could improve performance; for example, on CanParl, a window size of 4 yields better results. To explicitly fulfill the reviewer's request regarding the impact of W on datasets with a larger number of snapshots, we conducted several experiments on UNVote, which contains 72 snapshots. As shown in the Table 3 below, we observed similar trends for SLATE than those on Figure 4 in the paper. The optimal window size in these runs is 3. To address the reviewer's questions about the relationship between model and window size, we evaluate models of different nature and complexity than SLATE. We observed a similar trend among different models, i.e., a local temporal window is preferable than keeping the full history, e.g, DySAT performances dropped by 9 points with $W=\infty$ , where $\infty$ considers all snapshots for predictions. We would be glad to elaborate this interesting discussion in the final paper.

Table 3: AUC Performance Based on Temporal Window Size on the UNVote Dataset

Model	# of parameters	W = 1	W = 2	W = 3	W = 4	W = 5	W = $\infty$
SLATE	2.1 M	96.68	98.77	98.74	95.90	95.79	92.24
EGCN	1.8M	86.96	86.48	86.74	87.66	85.26	86.14
Dysat	1.8M	83.93	81.90	86.15	88.71	80.08	77.04

Q1 on MRR:

The choice of evaluation metrics is crucial and reflects the specific task we are addressing. In our work, we focus on dynamic link prediction as a classification task, while the Mean Reciprocal Rank (MRR) metric is typically used for ranking tasks. Dynamic link prediction aims to classify whether a link between nodes exists. For such tasks, metrics like AUROC are preferred because they provide a threshold-independent measure of model performance across all classification thresholds [4,5]. This is the standard evaluation of models like TGAT or TGN.

MRR is suitable for ranking tasks, such as user-item recommendation (bipartite graphs), when the objective is to rank nodes according to some relevance measure. Papers like JODIE employ MRR to evaluate dynamic link prediction within bipartite graphs like Wikipedia or LastFM, where ranking tasks differ from our classification focus.

Extending SLATE's evaluation to other tasks, such as link ranking, is greatly insightful. Due to the short duration of the rebuttal period, it was challenging to obtain results for these additional tasks. For link ranking, we would need to compute $N \times (N-1)$ links per node. However, we look forward to exploring this in future work.

Q2 on typo:

Thank you for catching that typo. We have corrected it.

Q3 on CTDG performances:

In the evaluation of CTDG models, as described in [4,6], discrete datasets like CanParl and USLegis are treated as continuous datasets, with timestamps at regular intervals. For instance, in a dataset with 3 snapshots, timestamps might be $t_1=100$ $t_2=200$ , and $t_3 = 300$ . CTDG models (e.g., TGAT, DyRep) use these integer values to calculate temporal embeddings, similar to how they process continuous datasets.The performance of CTDG models on these discrete datasets has been taken from [6] as mentioned in l-243 to 247 of our paper. We followed the same evaluation protocol.

[3] Kreuzer et al. Rethinking Graph Transformers with Spectral Attention. NeurIPS 2021

[4] Poursafaei et al. Towards Better Evaluation for Dynamic Link Prediction. NeurIPS 2022

[5] Yang et al. Evaluating Link Prediction Methods. Knowledge and IS 2015

[6] Yu et al. Towards Better Dynamic Graph Learning: New Architecture and Unified Library. Neurips 2023

评论- Response to Author Rebuttal

2024-08-12

I thank the authors for their detailed rebuttal. Here are comments from my end:

W1: scalability. I agree that using transformers with linear complexity can address the scalability challenge to a certain extent. Some larger datasets with million of nodes such as those seen in OGB[1] and TGB[2] might still prove to be a problem. Nonetheless, my concern here is mostly addressed.
W2: addressed.
Q1: I understand the limitation on time and proper evaluation of ranking links require significant computational time. Despite many prior work treating temporal link prediction as binary classification due to low computational cost for this evaluation, it is not a realistic measure for practical applications. In any recommendation systems, there is no guarantee of having half positive links and half negative links (as assumed in binary classification), I hope the authors can include ranking results in future versions of the paper.

Overall, I decide to retain my current score.

[1] Hu W, Fey M, Zitnik M, Dong Y, Ren H, Liu B, Catasta M, Leskovec J. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems. 2020;33:22118-33.

[2] Huang S, Poursafaei F, Danovitch J, Fey M, Hu W, Rossi E, Leskovec J, Bronstein M, Rabusseau G, Rabbany R. Temporal graph benchmark for machine learning on temporal graphs. Advances in Neural Information Processing Systems. 2024 Feb 13;36.

2024-08-12

Thank you for your response to our rebuttal and for the positive opinion of our work. We are delighted to have been able to clarify certain concerns about our method. We prioritized the experiments on scaling, temporal windows, and preprocessing, as these were requested by multiple reviewers. We are pleased that these additional experiments have addressed your questions. The experiments for the ranking task are currently underway, and we plan to include them in the final paper.

作者回复

2024-08-07

Global response to reviewers

We would like to thank all the reviewers for their excellent feedback, their relevant questions, the enthusiastic reception of our SLATE method and their encouragement.

We would like to clarify here some key points raised by the reviewers along the two main lines of concerns: i) the scalability of SLATE and ii) the construction of the multilayer graph from the discrete dynamic graph. We also individually respond to each reviewer below their reviews.

Scalability (RNc2c, Rn3Tf, Rkt1z)

Reviewers raised concerns on the scalability of our SLATE model due to the quadratic complexity of the Transformer's attention matrix computation.

Firstly, we want to highlight that based on efficient engineering techniques, especially FlashAttention [1], we manage to apply SLATE to dynamic graphs having more than 10k nodes, with a broad-audience GPU device (NVIDIA RTX A6000 with 49 GB memory). In this setting, the memory demand is comparable to several state-of-the-art methods for dynamic link prediction, e.g., EvolveGCN, Dysat, ROLAND-GT with Flash attention — as shown in the Table 1 (below).

Table 1 : Memory demand for different models on the Flights dataset (13k nodes), with NVIDIA RTX A6000 (49 GB)

Models	Memory usage	Time / epoch
EvolveGCN	46 GB	1828 s
Dysat	42 GB	1077 s
VGRNN	21 GB	931 s
ROLAND-GT w/of Flash	OOM	-
ROLAND-GT	44 GB	1152 s
SLATE w/o Flash	OOM	-
SLATE - Flash	48 GB	1354 s
SLATE - Performer	17 GB	697 s

To further challenge the scalability of SLATE to larger-scale graphs and address reviewers’ concerns, we propose to apply attention approximations to reduce memory consumption. Specifically, we explore the Performer method [2], which approximates the softmax calculation using Random Fourier Features, thus reducing quadratic complexity to linear complexity. Using Performer, SLATE scales to datasets like Facebook with over 45,000 nodes on the same NVIDIA RTX A6000 (49 GB), while all tested methods but VGRNN are out of memory on this device. Preliminary experiments on this Facebook dataset using the default Performer’s hyperparameters reached a 77.5% AUC score, which is already above the EvolveGCN and DySAT results reported in the HTGN paper [11]. We expect to improve these results with careful parametrization. We also apply Performer on several datasets used in the submission, i.e., AS733, USLegis, and Trade, and show SLATE/Performer performance remains close to SLATE and significantly above state-of-the-art results on those datasets (see Table 2 below). We believe such attention approximation is a promising way of scaling SLATE to larger datasets, while maintaining excellent performance. We would be glad to include these results and discussion in the final paper if accepted.

Table 2 : SLATE with a Transformer encoder vs a Performer encoder

Models	AS733	Legis	Trade
SLATE-Transformer	97.46	95.80	96.73
SLATE- Performer	95.28	95.02	96.49

Clarifications on supra-Lapacian computation (RG1C1, RNc2c)

To compute the supra-Laplacian matrix from our dynamic graph, we apply 3 main pre-processing steps, as shown in Figure 2a: removing isolated nodes, adding virtual nodes, and adding temporal connections. We remind that the main objective of those pre-processings is to obtain a connected graph from which the spectral analysis of the supra-Laplacian matrix contains relevant information on the global graph’s dynamics to be used as a powerful spatio-temporal encoding in graph transformers (l. 142-145 in submission).

The reviewers raised questions related to the importance of those 3 pre-processing steps.

In the specific answer to RG1C1, we provide clarifications regarding the computation for adding temporal connections. To provide additional insights on the importance of those temporal connections, we add a one-page pdf with a visualization of the spectrum on a toy dynamic graph, which complements our Figure 1 in submission. Here, we show that not including temporal connection yields a spectrum whose eigenvectors are unable to capture the spatio-temporal structure of the dynamic graph.

RNc2c requested finer-grained ablation studies to validate the importance of removing isolated nodes and adding virtual nodes.

For the removal of virtual nodes and fulfilling RNc2c’s request, we evaluated the spectrum obtained when keeping isolated nodes but only considering eigenvectors associated with non-zero eigenvalues. Qualitatively, we provide new visualizations (see pdf file attached to the rebuttal) showing that the resulting spectrum is significantly more noisy than that obtained after removing isolated nodes, especially because many eigenvectors encode a projection on the kept isolated nodes. Those projections do not capture meaningful information on the spatio-temporal structure of the dynamic graph. We also provide a quantitative ablation study on the USLegis dataset that shows a massive drop in performances (30 pts) when not removing isolated nodes but ignoring 0 eigenvectors — see specific response to RNc2c.

Regarding virtual nodes, we clarify in the response to RNc2c that their motivation is fundamentally different to works adding them in GNNs, and that this pre-processing is thus not redundant to the full connections in transformers. To further validate the importance of virtual nodes, we provide new ablation studies showing degraded performances when not including them.

We hope that the work done in this rebuttal helps to clarify the reviewers’ concerns. We remain at their disposal for further discussions on this work.

[1] Dao et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arxiv preprint (2022).

[2] Choromanski et al. Rethinking Attention with Performers. ICLR 2021

最终决定Accept (poster)

2024-09-25

This paper proposes SLATE (Supra-Laplacian encoding for spatio-temporal Transformers) which aims to learn both spatial and temporal information in a dynamic graph with a transformer architecture. SLATE converts Discrete Time Dynamic Graphs into multi-layer networks and extracts the spectral features of their supra-Laplacian matrix to improve upon existing dynamic graph transformer designs. To predict a link between two nodes, the model does cross-attention between their representations within the time window. Experimental results show SLATE performs competitively to both CTDG and DTDG methods on discrete graphs. Reviewers think the proposed approach is novel, the experiments results are extensive and promising, and the presentation is clear. The main concerns include efficiency/scalability of the approach on large graphs even with truncated time window, and that the Improvement over baselines in some settings seem mediocre.