Temporal Graph Neural Tangent Kernel with Graphon-Guaranteed
摘要
评审与讨论
This is a theoretical paper that studies graph neural tangent kernels in the context of temporal graphs. The authors propose a temporal GNTK model and proves rigorous error bounds. This work also links the convergence of the temporal GNTK to the graphon GNTK.
优点
To the best of my knowledge, this is the first work that attempts to apply the NTK theory to continuous-time dynamic graphs. Despite the issues I raise below, I enjoyed the paper, which provides interesting insights into how NTK can be used to analyze temporal graph learning. However, this approach may not be SOTA.
缺点
-
The theory is an incremental modification of [7] to temporal graphs by including the time index to each graph.
-
It is unclear how this approach compares with traditional temporal graph learning methods that capture temporal relationships.
-
The assumptions made are quite strong, and some claims are dubious.
问题
-
Equation (1) is confusing. Are node features not given as the input in your discussion of CTDG in Section 2? How do you deal with the node features in (1)?
-
To obtain (6), you have not mentioned that the weights are initialized as i.i.d. standard normals. Furthermore
-
To obtain (6) and (13), aren't you actually training a different GNN for each time ?
- Your notation should be and , otherwise it misleads the reader. I.e., (possibly based on ) is trained using .
- This is not the usual way temporal GNNs are trained. It also requires the strong assumption that all are synchronized (in terms of the inference data and task) in snapshot times .
-
The kernel (12) does not capture the temporal relationships between different time instances. I am not sure how this models or compares with typical temporal graph learning approaches that may use LSTM, RNN or transformers to capture temporal interactions.
-
In relation to the above points, why do you not consider the kernel that takes the whole temporal graph sequence and as inputs? I can understand that this is more sophisticated and complex but seems like a more fruitful approach.
-
Table 1: I fail to see why Temp-G3NTK is computationally cheaper than GraphMixer. (Also, you have not defined for GraphMixer.) For graphs with large , Temp-G3NTK is more expensive according to your table.
-
GraphMixer is not included in Figure 1 because it is "designed to directly process continuous-time dynamic graphs, which require observing the entire length of the input temporal graph instead of processing snapshots". I don't think this claim is accurate. You can easily implement GraphMixer by restricting it to the snapshots.
Minor:
-
Theorem 5.1: should be has timestamps.
-
Can you explain the meaning of the model name? What is "Graphon-Guaranteed"? It also does not sound grammatically correct.
局限性
I did not find a discussion on this.
Thanks very much for the review! We greatly appreciate your enjoy on our paper's insights, theoretical analysis, and experiments. We address your questions in the form of Q&A below.
Moreover, for the length limit, we distillate our answer and provide brief but important answers, more detailed intermediate steps, reasonings, and derivations can be provided during the upcoming author-reviewer discussion phase. All the detailed reply and supplementary experiments will be added to our updated paper.
Q1. How do you deal with the node features in Equation (1)?
The reason we omit node features in Equation (1) is that prevalent benchmarks only support time-aware edge features.
But if node features are given, our framework still works. Theoretically, Equation (1) can be extended as follows: , where is the node features of at time .
Adding node features will not affect our conclusions and framework, as the construction of our kernel depends on as a general term, instead of its components, such as node / edge features. Moreover, Theorem 5.1 and Theorem 5.2 still hold if node features are given.
Q2. How weights W in Equation (6) are initialized
For Equation (6), the weights are initialized i.i.d from the normal distribution .
Q3. To obtain (6) and (13), aren't you actually training a different GNN for each time ?
We want to clarify that our method does not involve training GNNs.
The kernel function in Equation 6 is just a starter and illustrated by a temporal graph neural network for easy interpretation, because the temporal graph neural network exists earlier than the proposed temporal graph neural tangent kernel (i.e., ours).
The derivation of our kernel function is given in Sec. 4, especially in Equation (12), which just relies on the covariance matrix computation through Equation (11), to Equation (10), to Equation (9), to Equation (8), and finally to Equation (7).
A typo is found during the rebuttal. For the last term in Eq (11), should be .
We will strengthen the above derivation in the updated paper.
Q4. The kernel (12) does not capture the temporal relationships between different time instances. I am not sure how this models or compares with typical temporal graph learning approaches ...
Temporal dependencies are captured in Equation (1), where we construct the temporal node representations for a node at time by aggregating information (node, edge features, and time difference) from its previous temporal neighbors . Intuitively, this models the current state of 's neighborhood at time based on its own neighborhood at the previous time . Since we would want a vector representation for , we perform vector summation over all information of temporal neighbors .
It is true that there is an established line of works in temporal graph learning using recurrent architecture and attention to learn temporal dependencies like TGN and TGAT, but there are works employing simpler architectures, such as GraphMixer, which aggregates information from recent temporal neighbors and processes them with MLP-Mixer layers.
Q5. Why do you not consider the kernel that takes the whole temporal graph sequence and as inputs? I can understand that this is more sophisticated ...
If we understand your question correctly, your concern is about Equation (13) that current time prediction only depends on other samples at the current time.
We admit that taking the historical behavior as you suggested is a promising approach, and actually this is what we have done in the paper. Instead of taking each past snapshot individually, recall our Equation (1), temporal dependencies are captured, where we construct the temporal node representations for a node at time by aggregating information (node, edge features, and time difference) from its previous temporal neighbors . Hence, the current time representation is the cumulation of the past. In this angle, the sequence of the temporal graph is observed by our method.
Q6. Table 1: I fail to see why Temp-G3NTK is computationally cheaper than GraphMixer. (Also, you have not defined for GraphMixer.) ...
First, for GraphMixer is the padding length, i.e., the maximum number of recent interactions a node can sample.
The theoretical complexity compassion is not that straightforward by only looking at the power to number of nodes. As we stated in the paper, GraphMixer needs the complex neural computation like gradient descent and back propagation, but our method does not. That’s why we claim our method is faster.
To in-depth verify our claim, we added empirical running time comparison of all baselines across all datasets. The results are reported as Table 1 in the 1-page pdf in “general rebuttal”. According to the result, our method is roughly 10x-20x faster than GraphMixer, and also faster than most baselines.
We will include these new results in our paper to in-depth verify my hypothesis.
Q7. GraphMixer is not included in Figure 1 because it is "designed to directly process continuous-time dynamic graphs, which require observing the entire length of the input temporal graph instead of processing snapshots". I don't think this claim is accurate. You can easily implement GraphMixer by restricting it to snapshots.
Technically, GraphMixer can process discrete time dynamic graphs, by only considering part of the past snapshots as the continuous dynamic graphs, although it was originally designed for continuous graphs.
Since we can use snapshot timestamp as edge timestamp for each edge in this snapshot, then we are now running GraphMixer and will add its plot during the author-reviewer discussion phase.
Thank you for the clarifications. Can you address the weakness on "It is unclear how this approach compares with traditional temporal graph learning methods that capture temporal relationships."? E.g., ROLAND, DGNN, SSGNN, EvolveGCN, etc.
Thanks for your follow-up question! We are very glad to answer this question.
Below, we first give a brief answer that highlights the important differences between our method and the methods you recommended.
Second, we then give each one-by-one detailed illustration (in chronological order) for each of your recommended works to support our first point.
Third, we base our analysis on a recent survey’s taxonomy and point out where our method stands.
Answering this question helps expand the scope of our paper's literature review, and we will add all the answers to our paper.
1. Brief but important difference
1.1 Neural representation learning vs. Neural tangent kernel
The traditional temporal graph learning methods you recommended, like DGNN, EvolveGCN, ROLAND, and SSGNN, rely on neural network training (e.g., stacking neural layers, gradient descent, and backpropagation) to obtain neural representations to support corresponding graph downstream tasks.
Our temporal graph neural tangent kernel method does not rely on neural network structure but can achieve the expressive power of graph neural networks, as our Theorems and experiments demonstrate.
1.2. Complexity and neural architecture
This point is the extension of the last point.
To be specific, DGNN, EvolveGCN, ROLAND, and SSGNN belong to the category of recurrent graph neural architectures that handle temporal information (e.g., on the learnable weight level like EvolveGCN or hidden representation level like ROLAND). This is indeed an effective direction, but it requires heavy time complexity.
Facing this problem, an emerging direction appears, i.e., MLP-Mixer on Static Graphs [1] or GraphMixer on temporal graphs [2]. Especially, GraphMixer aggregates information from recent temporal neighbors and processes them with MLP-Mixer layers. Motivated by this direction, we propose our temporal graph neural tangent kernel. Also, we would like to note that, even without recurrent neural architectures, temporal information can also be preserved in our method.
To be more specific, in our method, temporal dependencies are captured in Equation (1), where we construct the temporal node representations for a node at time by aggregating information (e.g., node features, edge features, and time difference) from its previous temporal neighbors . And the entire process does not involve neural training but just depends on mathematical time kernel functions. In other words, this process records the current state of based on its own neighborhood at the previous time and can be retrieved for future state computation. Besides theoretical derivation, especially, Figure 1 visualizes our method’s effectiveness.
To support our above statement, we list the detailed illustration below.
2. Detailed illustration of each recommended work in chronological order
2.1. DGNN: SIGIR 2020
-
Task:
- Targeting on link prediction (Ous is mainly for temporal graph classification and can be easily adapted to temporal node classification)
-
Details in Methodology:
- There are two types of nodes: interacting nodes and influenced nodes
- If two nodes are involved in a (directed) interaction (u, v, t), then “u, v” are interacting nodes and nodes that are nearby this interaction are referred to as “influenced nodes”.
- If two nodes and interact at a certain time, then DGNN updates their temporal representation. First, process each of and separately by employing recurrent architecture to their previous temporal representations and finally combine with time encoding the difference between the current time and the last interaction time of that node. Then, merge the two representations and obtain two new representations for and .
- If two nodes interact, nearby nodes (“influenced nodes”) would be affected. DGNN also updates “influenced nodes”, i.e., applying recurrent architecture on their previous representation, combining with two representations from interacting nodes, and the time encoding of difference between current time and last interacting time.
2.2. EvolveGCN: AAAI 2020
- Task:
- Mainly on link prediction and node classification
- Details in Methodology:
- Operates on graph snapshots. Use recurrent architecture (e.g., LSTM) to update the weight of each neural layer across time.
- At the time , get node representation by applying the current snapshot’s adjacency matrix, learnable weights, and representation from the previous recurrent layer.
2.3. ROLAND: KDD 2022
- Task:
- Link prediction
- Details in Methodology:
- Different from EvolveGCN, in ROLAND, the recurrent architectures are added on the hidden representation vectors other than learnable weights across timestamps. The components in recurrent architectures get simplified in ROLAND, which can be mainly based on MLPs and simple GNNs.
2.4. SSGNN: AAAI 2023
- Task:
- Time Series Forecasting
- Details in Methodology:
- SSGNN uses a deep randomized recurrent neural network to encode the history of each node encodings into high-dimensional vector embeddings, and then uses powers of the graph adjacency matrix to build informative node representations of the spatiotemporal dynamics at different scales.
- The decoder maps the node representations into the desired output, e.g., future values of the time series.
3. The position of our method in temporal graph learning
A recent temporal graph learning survey [3] reviewed many temporal graph learning methods, including EvolveGCN, SSGNN, and DGNN, in their taxonomy, as plotted in Figure 2.
In that taxonomy, according to the best of our knowledge, our method belongs to the category “Event-based”, and is the child node of “Temporal Embedding” and “Temporal Neighborhood”, the position is close to the work TGL [4].
However, different from TGL - a large-scale graph neural network, to our best knowledge, our method is the first temporal graph neural tangent kernel method.
Reference
[1] A Generalization of ViT/MLP-Mixer to Graphs, ICML 2023
[2] Do We Really Need Complicated Model Architectures For Temporal Networks, ICLR 2023
[3] Graph Neural Networks for temporal graphs: State of the art, open challenges, and opportunities, TMLR 2023
[4] TGL: A General Framework for Temporal GNN Training on Billion-Scale Graphs, VLDB 2022
I agree that your proposed approach using NTK is attractive and has lower time complexity than some of the recurrent models out there. What I meant was what is the performance comparison? Is there a trade-off somewhere? I haven't seen any theoretical or empirical comparisons on this.
We are happy to report that the additional running experiments required for GraphMixer with the rest 7 baselines are now finished.
Due to the rebuttal policy, we can not add an external link to direct you to check the figure. However, we describe the overall pattern and provide the important data points in table format as follows. The new figure and corresponding analysis are promised to be added to our paper.
-
After involving GraphMixer, the general pattern in Figure 1 is not changed, i.e., our method still outperforms baselines, and the gap is obvious.
-
For providing specific data points, we extract the important comparison, i.e., GraphMixer vs. Ours, and made the tabular representation as below. According to the table, we can see our method outperformed GraphMixer in most cases.
Table. For involving GraphMixer in Figure 1 of the paper, the temporal graph classification accuracy on different stages of temporal graphs.
| Dataset | Method | 1 / 5 | 2 / 5 | 3 / 5 | 4 / 5 | 5 / 5 |
|---|---|---|---|---|---|---|
| Infectious | Temp-G NTK (Ours) | 0.640 | 0.620 | 0.650 | 0.740 | 0.590 |
| GraphMixer | 0.525 | 0.545 | 0.480 | 0.495 | 0.500 | |
| Temp-G NTK (Ours) | 0.600 | 0.700 | 0.564 | 0.620 | 0.530 | |
| GraphMixer | 0.539 | 0.546 | 0.580 | 0.564 | 0.561 |
Thanks for your appreciation, and we are glad to answer your follow-up question!
Performance of complex neural architectures and simple neural architectures
According to the most recent temporal graph learning work, e.g., MLP-Mixer [1] and GraphMixer [2], there is a trend to replace the complex neural architecture in the representation learning process, but using simplified neural architecture to maintain or even boost the performance.
According to GraphMixer [2], learning positional encoding plus simple MLPs can achieve very competitive performance. For example, in the link prediction task on Reddit datasets, GraphMixer achieved 99.93% precision and outperforms the 99.30% of JODIE [3] (using recurrent architecture) and the 99.80% of TGN [4] (also using recurrent architecture).
Motivated by this direction, we designed our temporal graph neural tangent kernel. On the one hand, we even simplified the complexity of SOTA temporal graph neural networks like GraphMixer [2], TGN [4] (using recurrent arch), and DyRep [5] (using recurrent arch), and outperformed no matter
- in the theoretical complexity (as shown in Table 1 of the paper);
- or in the empirical complexity (as shown in Table 1 of the 1-page PDF file in “general rebuttal”);
- or in the theoretical effectiveness (as shown in Theorem 5.1 and entire Appendix B for proof);
- or in the empirical effectiveness (as shown in Table 2 of the paper for graph-level tasks and in Table 3 of the paper for node-level tasks).
Moreover, besides our two existing baselines that already used recurrent architectures (TGN [4] and DyRep [5]), we also added your recommended work to our comparison, and the result is shown below.
Table. Temporal Graph Classification Accuracy on Infectious Dataset
| Method | Accuracy |
|---|---|
| WL-Subtree | 0.600 0.044 |
| Shortest Path | 0.670 0.075 |
| Random Walk | 0.670 0.073 |
| Graph2Vec | 0.565 0.081 |
| NetLSD | 0.625 0.061 |
| GL2Vec | 0.545 0.051 |
| GraphMixer | 0.500 0.000 |
| TGN | 0.520 0.019 |
| EvovleGCN | 0.521 0.093 |
| TEMP-GNTK (OURS) | 0.740 0.058 |
From the above table, we can see that EvolveGCN performs similarly with another recurrent architecture baseline TGN, which aligns with the analysis [1,2] that complex neural architectures can be replaced by simple architectures in the temporal graph learning community. Since it is replaceable, we next expand the possible trade-off.
Possible Trade-off
Here, we would like to share the trade-off of this evolution trend in the temporal graph learning community. To simplify the neural architecture but most importantly maintain (or even boost) the performance, it requires the researchers to devote time and efforts to propose new paradigms, just like what we did in the paper to replace the neural computing with exact mathematical expressions and ensure the theoretical and empirical effectiveness.
On the other hand, our method still triggers the future direction for the temporal graph learning community. For example, even though we already achieved theoretical and empirical effectiveness and efficiency, we discerned that the room still exists to do better for future research. For example, in our Equation (13), making the decision for the testing data, it usually needs a bunch of observed training data. Even though our computation is simple and not related to neural computation like gradient descent and back propagation, we still wish to decrease the number of training data to meet the specific case if the training data is inadequate.
Here, we would like to share a possible solution. We can involve the prototype learning in a non-neural way, such that a bunch of training data can be presented by a single prototype representation. Hence, the number items in Equation (13) can be decreased and the efficiency can be largely increased. This is an interesting topic, and we would like explore it as our future research direction.
Reference
[1] A Generalization of ViT/MLP-Mixer to Graphs, ICML 2023
[2] Do We Really Need Complicated Model Architectures For Temporal Networks, ICLR 2023
[3] Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks, KDD 2019
[4] Temporal graph networks for deep learning on dynamic graphs, arXiv 2020
[5] Dyrep: Learning representations over dynamic graphs, ICLR 2019
Thanks for your new results, which have clarified my doubts. I am happy to increase the score.
Thanks again for your appreciation! Answering your insightful questions helps us improve the quality of the paper. We will include the above answer in our paper. Thanks very much for your time and review!
This paper proposes a graph neural tangent kernel method that extends the advantages from static graphs to temporal graphs. Theoretical analyses are conducted to prove the transferability and robustness of the model. Experiments are performed on graph-level tasks as well as node-level tasks.
优点
1: The motivation for the paper is strong. Temporal graphs are quite a hot topic these days, and transferring graph tangent kernel method to temporal graphs shows clear intuition.
2: The theoretical analysis regarding the generalization bound, convergence and time complexity are comprehensive.
3: Experiments show the competitive results, which demonstrate the potential and effectiveness of the proposed method.
缺点
The related work section is a bit condensed, which may decrease the persuasiveness of the paper.
The explanation regarding the simplicity and interpretation ability of the proposed method is unclear. I wonder if the authors can further clarify.
问题
This is good work. Is it possible to use more graph neural architecture-based methods in experiments for two-level tasks? Also the datasets selection can be more inclusive.
局限性
N/A
Thanks very much for the review! We greatly appreciate your acknowledgment of our paper's motivation, theoretical analysis, and experiments. We address your questions in the form of Q&A below.
Moreover, for the length limit, we distillate our answer and provide brief but important answers, more detailed intermediate steps, reasonings, and derivations can be provided during the upcoming author-reviewer discussion phase. All the detailed reply and supplementary experiments will be added to our updated paper.
W1. Expanding the related work section
On one hand, we expand the detailed introduction to Graphons, Graphon Neural Network, and Graphon Neural Tangent Kernel.
For example, following details will be added to Section C.1 of our Appendix.
Graphons. A graphon is a symmetric, bounded, and measurable function $W : [0, 1]^2 \rightarrow [0, 1]$ that acts as the limit object of dense graph sequences and defines a family of similar graphs. Specifically, given an undirected graph $\mathbf{F}$ and a graph sequence $\{G_n\}$, whose $i$-th graph in the sequence, $G_i$, has $i$ nodes, then $\{G_n\}$ converges to $W$ if $\lim_{n \rightarrow \infty} t(\mathbf{F}, G_n) = t(\mathbf{F}, W)$, where $t(\mathbf{F}, G_n)$ defines the density of homomorphisms between $\mathbf{F}$ and $G_n$, and $t(\mathbf{F}, W)$ can be defined similarly. In this way, $W$ can also be regarded as a generative models for a certain graph family.
Graphon Neural Networks (WNNs) [26]. Similar to how GNNs perform graph convolutions on a graph given the adjacency matrix, $A$, and node / edge features $H$, Graphon Neural Networks (WNNs) also perform graphon convolutions using $W$ and $X$, where $X: [0, 1] \rightarrow \mathbb{R}$ (corresponds to node-level) or $X: [0, 1]^2 \rightarrow \mathbb{R}$ (corresponds to edge-level) is graphon signals. Intuitively, we could consider WNNs as the counterpart of GNNs that operate on graphons. As proven in [26], as the graph's size of the graph sequence $\{(G_n, H_n)\}$ generated by $W, X$ grows, i.e $n \rightarrow \infty$, the output of GNNs operating on $\{(G_n, H_n)\}$ converges to the output of WNNs operating on $W, X$. More details can be found in Section C.1 in our Appendix.
On the other hand, we include more temporal graph representation learning methods discussion in our paper, like CAWN [1], and DySAT [2].
For example, following details will be added to Related Work of our paper.
* CAWN obtains the causal walks for each node, anonymize the node indices on each walk, further process these walks with recurrent architectures, and the final temporal node representation is determined by aggregating walks encodings.
* DySAT first transforms the given CTDG into discrete temporal graph snapshots, then applies Graph Attention Network (GAT) on each snapshots to obtain the spatial information, $\mathbf{X}(t)$, where $t$-th corresponds to the graph snapshot at time $t$, and denote the spatial information for node $u$ in the $t-$th graph snapshot as $[\mathbf{X}(t)]_u$. Then $u$'s temporal representation is obtained by applying Transformer on its spatial information at different timestamps: $h_u(t_k), h_u(t_{k - 1}), \dots, h_u(t_0) = \text{Transformer}([\mathbf{X}(t_k)]_u, [\mathbf{X}(t_{k - 1})]_u, \dots, [\mathbf{X}(t_0)]_u)$.
Finally, we would be more than willing to discuss any related work the reviewer recommend during the upcoming author-reviewer discussion phase.
Reference:
[1] Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks, ICLR 2021
[2] DySAT: Deep Neural Representation Learning on Dynamic Graphs via Self-Attention Networks, WSDM 2020
W2. More explanation regrading the simplicity and interpretation of the proposed method
For simplicity, as stated in the paper, our kernel method does not involve complex neural architecture or training algorithms. Moreover, all terms appearing in the method can be computed in an iterative manner as shown from Equation 7 to Equation 11.
Additionally, we added the empirical running time for comparing the efficiency of our method and 8 baselines across 4 datasets, as shown in Table 1 in the 1-page pdf in "general rebuttal". Overall, the empirical running time aligns with our theoretical analysis of time complexity in Table 1 of the paper. That is, our method belongs to the graph kernel category, where the node-wise comparison is usually inevitable, and our time complexity is lower; Compared to the neural baselines, since our method does not rely on complex neural training like backpropagation, our method is still efficient.
In brief, the interpretation of our graph neural tangent kernel relates to the training dynamics of the temporal graph neural architecture that is defined in Section 3.1. Specifically, in the limit that the layer width of MLPs goes towards infinity, then the graph prediction using the architecture defined in Section 3.1 is equivalent to the prediction of using kernel regression, as stated in Equation 13.
Q1. This is good work. Is it possible to use more graph neural architecture-based methods in experiments for two-level tasks? Also the datasets selection can be more inclusive.
As stated in Section 6.1, for graph-level datasets, we cover a variety of different domains from DBLP (citation graph), Facebook (social graph), Infectious (human-contact graph), and Tumblr (social graph). Moreover, in Appendix D.4, we also show how our methods can scale on larger datasets (Wiki, Reddit, MOOC, LastFM), in terms of number of nodes and edges.
Our baselines include static graph kernel methods, static GNN methods, temporal GNN methods, and ours stand for the temporal graph kernel method.
During the rebuttal, we also include all baselines into a new graph dataset from https://chrsmrrs.github.io/datasets/docs/datasets/, the performance is shown in Table 2 in 1-page pdf in “general rebuttal”. According to the new results, our method still achieves the best performance.
I would like to thank the authors for the hard work addressing my concers and considering my opinion. I've increased my score. Good luck with the submission.
Thanks very much for your insightful review! We enjoy answering your questions and addressing your concerns! The above answer is promised to be added to our paper. Thanks again!
The proposed Temporal Graph Neural Tangent Kernel with Graphon-Guaranteed or Temp-G3 NTK method is a novel temporal graph learning method with provable error bounds. It extends the the simplicity and interpretation ability of Graph Neural Tangent Kernel to the temporal graph setting leading to rigorous error bounds for the temporal graph classification task. Furthermore, in the case of growing networks, the proposed method will converge in the limit to the graphon NTK value. Temp-G3NTK method achieves strong performance on both temporal graph classification and the node property prediction task.
优点
-
novel method with theoretical bounds: the proposed method is one of the first methods in temporal graph learning with provable error bounds. This is novel and opens up research for more future methods with provable bounds.
-
strong performance. The Temp-G3NTK method shows strong performance on the temporal graph classification task and competitive performance for the node property prediction task from TGB. Additional ablation studies are also provided in the appendix.
-
clear presentation: overall the paper is clearly presented and also with discussion on the time complexity of the method.
缺点
The main concern I have is the time complexity. The complexity is quite high thus limiting the scalability of the method to networks with millions of nodes. The node pair wise covariance matrix would be expensive to compute.
Additionally, there is no limitation and societal impact discussion in the paper.
问题
- might be good to add plots or table showing the compute time for each dataset.
- Is it possible to adapt the method to continuous time dynamic graphs (CTDGs)?
- How would the model convergence be if the network both grows and shrinks over time. This is quite common in real world temporal graphs.
局限性
Missing discussion on limitations for current work as well as discussions on societal impact. Note that it is possible to lead to negative societal impact when temporal graph learning methods are deployed in applications such as anomaly detection, brain activity classification and more.
Thanks very much for the review! We greatly appreciate your acknowledgment of our paper's novelty, presentation, and model performance. We address your questions in the form of Q&A below.
Moreover, for the length limit, we distillate our answer and provide brief but important answers, more detailed intermediate steps, reasonings, and derivations can be provided during the upcoming author-reviewer discussion phase. All the detailed reply and supplementary experiments will be added to our updated paper.
Q1. Might be good to add plots or tables showing the compute time for each dataset
We have added the empirical running time of our method and 8 other baselines on 4 benchmark datasets. The table is plotted as Table 1 in the 1-page pdf file in the “general rebuttal”.
Overall, the empirical running time aligns with our theoretical analysis of time complexity in Table 1 in the paper. That is, our method belongs to the graph kernel category, where the node-wise comparison is usually inevitable, and our time complexity is lower. Compared to the neural baselines, since our method does not rely on complex neural training like backpropagation, our method is still efficient.
Given our method achieved the best classification accuracy, as shown in Table 2 in the paper, according to the corresponding running time reported in Table 1 of the uploaded 1-page pdf during the rebuttal, our method is (1) more than 10x - 20x faster then complex temporal graph neural network methods like GraphMixer [6] and TGN [24]; (2) similarly efficient as simple kernel methods like WL-Subtree [27] and Shortest Path [4] and embedding methods like NetLSD [30] and GL2Vec [5]; and only Graph2Vec [21] is running faster than our method, but our performance is roughly 1.4x better.
Q2. Is it possible to adapt our method to continuous time dynamic graphs (CTDGs)?
The answer is yes.
In the paper, we choose discrete time just for a better illustration of the proposed method. The transformation between discrete time and continuous time is possible, as described in lines 63-65 of the paper.
In other words, given 2 CTDGs , suppose that we already computed the Temp-GNTK value between at time . If new links occurring at time are added into , then we update the neighborhood of each node in , and re-compute the Temp-GNTK value for at time .
An extreme case needs more caution, i.e., spare continuous time dynamic graph. For example, only one edge is updated per timestamp. This scenario will make our method run slower, because more snapshots are emerging. But multiple edges updated per timestamps can be adapted seamlessly. Moreover, one edge update per timestamp is very rare no matter in the real-world setting.
Q3. How would the model convergence be if the network both grows and shrinks over time. This is quite common in real world temporal graphs.
In brief, our method handles the case if some graph shrinks at the intermediate timestamp, and our theoretical findings hold as long as the life-long pattern is growing.
In other words, for a graph that has some growing stages and some shrinking stages in its life, if its number of nodes follow an overall growing trend, then Theorem 5.2 holds.
To be more specific, let the number of nodes of a temporal graph snapshot at time , , be , then Theorem 5.2 holds if as . Intuitively, the term could be decomposed and bounded by terms that depend on and (where is 's signal function and denote the induced graphon and induced graphon corresponds to temporal snapshot ). Then, are bounded by terms that depend on the number of nodes , so as .
Moreover, according to the real-world temporal graph dataset benchmark like (https://snap.stanford.edu/data/index.html), especially social graphs and citation graphs, a growing pattern is usually more common.
Q4. Possible negative societal impact
In general, graph deep learning often involves analyzing complex networks, which can include sensitive personal information. If not managed properly, this can lead to privacy violations, especially when dealing with social networks, financial systems, or medical data. Also, if the data used to train graph models contains biases, these biases can be propagated and even amplified by the algorithms.
To the best of the authors’ knowledge, we do not discern specific negative societal impacts caused by this proposed method. However, we will add the above general discussions regarding the risks of graph deep learning in our updated paper.
I thank the authors for their detailed discussion and addressing my concerns. I will retain my score and believe this work will be of interest to the wider temporal graph learning community.
We sincerely thank you for your helpful review and your appreciation of our work. We are glad that our answer addressed your concern. The above answer is promised to be added to our paper. Thanks!
This paper introduces the graph tangent kernel for temporal graph, thus enabling the learning task on time-evolving graph structures. The authors generalized the concepts of graph tangent kernel to incorporate the time domain information. They derived the generalization bound for the corresponding kernel predictor, and that their kernel approximates the graphon neural tangent kernel.
优点
The authors generalized the concepts of graph tangent kernel to incorporate the time domain information. They derived the generalization bound for the corresponding kernel predictor, and that their kernel approximates the graphon neural tangent kernel.
缺点
See the questions.
问题
- Formula (1) defines the node feature. In this definition, the time instance is encoded by the function . From the construction it seems that aims to make the time difference between as a vector. However, is already simple enough as a scalar. Can you explain the motivation of using instead of directly? What does indicate?
- As the node feature is defined by formula (1), the authors suppose that there are no node features given in the beginning (from the dataset), and only edge features are accessible. Does the whole framework still work well if the node features are given by the dataset? In other words, does the conclusions in this paper rely on the specific construction method of the node features (i.e., formula (1))?
- Formula (12) provides the definition of the proposed "kernel", but the authors did not prove that this "kernel" satisfies the usual definition of kernel, i.e., it is symmetric and semi-definite.
- Theorem 5.1 and 5.2 assume the case that we indeed use the designed kernel and formula (13) to learn the task. However, there should be theoretical proofs that the proposed method (1) and (2) indeed converge to the Gaussian process or kernel method, like in [12].
局限性
N.A.
Thanks very much for the review! We greatly appreciate your acknowledgment of our paper's theoretical analysis. We address your questions in the form of Q&A below.
Moreover, for the length limit, we distillate our answer and provide brief but important answers, more detailed intermediate steps, reasonings, and derivations can be provided during the upcoming author-reviewer discussion phase. All the detailed reply and supplementary experiments will be added to our updated paper.
Q1. Explain using instead of directly? What does indicate?
Theoretically, mapping a discrete scalar to a continuous high-dimensional vector space has two unique advantages. First, the embedding vector can capture more complex information like non-linear relationships. Moreover, the embedding does not hurt the original scalar representation, i.e., close time scalars still have similar embedding vectors, and the embedding vectors can also capture periodic behaviors [1]. Second, the embedding vector can jointly work with deep learning models, i.e., embedding vectors for time can be potentially used within any neural architecture’s representation learning process, which is incapable of using discrete time scalars.
In our case, maps a scalar to an dimensional vector, whose th entry is . Given two real positive numbers such that . As , the th entries of go towards . Since , it is likely that contains more entries than .
Moreover, there are many choices for time encoding. However, most functions require trainable parameters that add extra computational burden. But our relies on mathematical properties and only needs fixed parameters (), which is suitable for kernel-based methods.
Empirically, as shown in Table 4 of our paper, using relative difference (3rd line of Table 4) directly yields a lower accuracy than using , justifying the effectiveness of .
Reference: [1] Time2Vec: Learning a Vector Representation of Time
Q2. Does the whole framework still work well if the node features are given by the dataset?
The reason we omit node features in formula (1) is that prevalent benchmarks only support time-aware edge features.
But if node features are given, our framework still works. Theoretically, formula (1) can be extended as follows: , where is the node features of at time .
Adding node features will not affect our conclusions and framework, as the construction of our kernel depends on as a general term, instead of its components, such as node / edge features. Moreover, Theorem 5.1 and Theorem 5.2 still hold if node features are given.
Q3. Proof of symmetric and semi-definite properties.
We have proved the symmetric and semi-definite properties of our method. Due to the rebuttal length limit, the detailed proof will be uploaded during the author-reviewer discussion phase via the “official comment” channel.
Q4. However, there should be theoretical proof that the proposed method (1) and (2) indeed converge to the Gaussian process or kernel method, like in [12].
In short, our proposed method naturally enjoys the convergence in [12].
First, we provide some preliminary.
Let be the output of a fully connected neural network, with parameters and as the input.
Then the Neural Tangent Kernel (NTK) corresponds to , and any arbitrary input would be
Suppose that has layers, the th layer's width is denoted as . Theorem 1 from [12] states that in the limit of , the NTK converges in probability to a deterministic limiting kernel.
Next, we analyze these terms in our paper's context.
If we apply Theorem 1 from [12], then the NTK (defined in [12]) associating with a deep neural network consisting of layers applying on temporal node representation converges, with high probability, to the deterministic limiting kernel , (where is the identity matrix with dimension and is the width of the -layer MLP), if , and is denoted as a scalar kernel. To be clear, the kernel value between two graphs in Equation (13) is defined based on the summation of limiting scalar kernel taken over all node pairs .
Regarding Gaussian process, based on Proposition 1 from [12], the output of the deep neural network consisting of layers operating on temporal node representations defined in Section 3.1 converges to Gaussian process of covariance , with (as defined in Equation 9).
To better sum up for understanding, our method is constructed based on limiting objects of the NTK convergence theories.
We will add the above derivation to our updated paper.
[Symmetric proof]
We need to prove .
Given our proposed kernel function, , we first write down another equation, where the internal order is flipped, i.e.,
We first prove that
For , we have
=
=
=
=
=
=
Then, suppose , we can have
Therefore,
=
=
=
Hence, if , then .
Moreover, we have proven that .
Thus, by induction, we have
Finally, we have
The proof is completed.
[Semi-definite proof]
We need to prove
Given our proposed kernel function , we first prove that .
For ,
=
=
=
As visualized in Figure 1 in the uploaded 1-page PDF file in "general rebuttal", we know that
As is the input of function, so it must be constrained to interval . Also, we can achieve the constraint by normalizing .
Therefore,
Suppose , we can have
Then, we can derive the three following equations that ,
, as ,
and , as .
Moreover, as , we now have .
We also have , so by induction, .
Finally, we have
The proof is completed.
First of all, we want to sincerely thank the time and review of all reviewers and chairs, and we are very glad to learn the reviewers' appreciation for this paper.
Also, the reviewers' raised questions are very actionable and helpful to improve our paper. We have uploaded the answers individually to each reviewer.
Here, we upload a 1-page PDF file that contains Figures and Tables, which are required by reviewers for additional theoretical analysis and experiments.
Thanks again!
This paper introduces the Temporal Graph Neural Tangent Kernel (Temp-G3NTK) for learning on time-evolving graph structures. The method extends the Graph Neural Tangent Kernel (GNTK) to temporal graphs, incorporating time-domain information and deriving generalization bounds. Additionally, the work establishes a connection between the convergence of the temporal GNTK and the graphon GNTK. While none of the reviewers opposed the acceptance of the paper, two expressed reservations. Their concerns centered on (1) the method being seen as an incremental extension of GNTK, and (2) doubts about the effectiveness of how temporal information is represented, despite the authors' detailed explanations.
I believe these concerns are not sufficient grounds for rejection. Most reviewers acknowledged that this is one of the first attempts to generalize GNTK to temporal graphs, and they recognized the theoretical rigor, solid results, and competitive performance of the method, which were further supported by additional experiments provided during the discussion.
Overall, I recommend this work for acceptance.