PaperHub
6.0
/10
Poster4 位审稿人
最低5最高7标准差1.0
5
7
7
5
3.8
置信度
正确性3.0
贡献度2.8
表达2.8
NeurIPS 2024

Temporal Graph Neural Tangent Kernel with Graphon-Guaranteed

OpenReviewPDF
提交: 2024-05-14更新: 2024-12-19

摘要

_Graph Neural Tangent Kernel_ (GNTK) fuses graph neural networks and graph kernels, simplifies the process of graph representation learning, interprets the training dynamics of graph neural networks, and serves various applications like protein identification, image segmentation, and social network analysis. In practice, graph data carries complex information among entities that inevitably evolves over time, and previous static graph neural tangent kernel methods may be stuck in the sub-optimal solution in terms of both effectiveness and efficiency. As a result, extending the advantage of GNTK to temporal graphs becomes a critical problem. To this end, we propose the temporal graph neural tangent kernel, which not only extends the simplicity and interpretation ability of GNTK to the temporal setting but also leads to rigorous temporal graph classification error bounds. Furthermore, we prove that when the input temporal graph grows over time in the number of nodes, our temporal graph neural tangent kernel will converge in the limit to the _graphon_ NTK value, which implies the transferability and robustness of the proposed kernel method, named **Temp**oral **G**raph **N**eural **T**angent **K**ernel with **G**raphon-**G**uaranteed or **Temp-G$^3$NTK**. In addition to the theoretical analysis, we also perform extensive experiments, not only demonstrating the superiority of Temp-G$^3$NTK in the temporal graph classification task, but also showing that Temp-G^3NTK can achieve very competitive performance in node-level tasks like node classification compared with various SOTA graph kernel and representation learning baselines. Our code is available at https://github.com/kthrn22/TempGNTK.
关键词
Graph KernelTemporal GraphNeural Tangent Kernel

评审与讨论

审稿意见
5

This is a theoretical paper that studies graph neural tangent kernels in the context of temporal graphs. The authors propose a temporal GNTK model and proves rigorous error bounds. This work also links the convergence of the temporal GNTK to the graphon GNTK.

优点

To the best of my knowledge, this is the first work that attempts to apply the NTK theory to continuous-time dynamic graphs. Despite the issues I raise below, I enjoyed the paper, which provides interesting insights into how NTK can be used to analyze temporal graph learning. However, this approach may not be SOTA.

缺点

  1. The theory is an incremental modification of [7] to temporal graphs by including the time index tt to each graph.

  2. It is unclear how this approach compares with traditional temporal graph learning methods that capture temporal relationships.

  3. The assumptions made are quite strong, and some claims are dubious.

问题

  1. Equation (1) is confusing. Are node features not given as the input in your discussion of CTDG in Section 2? How do you deal with the node features in (1)?

  2. To obtain (6), you have not mentioned that the weights W\mathbf{W} are initialized as i.i.d. standard normals. Furthermore

  3. To obtain (6) and (13), aren't you actually training a different GNN for each time tt?

    • Your notation should be fkernel(t)f_{kernel}^{(t)} and ftemp(t)f_{temp}^{(t)}, otherwise it misleads the reader. I.e., ftemp(t)f_{temp}^{(t)} (possibly based on ftemp(t1),,ftemp(1)f_{temp}^{(t-1)}, \ldots, f_{temp}^{(1)}) is trained using {Gi(t)}i=1n\lbrace G_i^{(t)} \rbrace_{i=1}^n.
    • This is not the usual way temporal GNNs are trained. It also requires the strong assumption that all GiG_i are synchronized (in terms of the inference data and task) in snapshot times tt.
  4. The kernel (12) does not capture the temporal relationships between different time instances. I am not sure how this models or compares with typical temporal graph learning approaches that may use LSTM, RNN or transformers to capture temporal interactions.

  5. In relation to the above points, why do you not consider the kernel that takes the whole temporal graph sequence GG and GG' as inputs? I can understand that this is more sophisticated and complex but seems like a more fruitful approach.

  6. Table 1: I fail to see why Temp-G3NTK is computationally cheaper than GraphMixer. (Also, you have not defined KK for GraphMixer.) For graphs with large V|V|, Temp-G3NTK is more expensive according to your table.

  7. GraphMixer is not included in Figure 1 because it is "designed to directly process continuous-time dynamic graphs, which require observing the entire length of the input temporal graph instead of processing snapshots". I don't think this claim is accurate. You can easily implement GraphMixer by restricting it to the snapshots.

Minor:

  1. Theorem 5.1: should be GiG_i has TT timestamps.

  2. Can you explain the meaning of the model name? What is "Graphon-Guaranteed"? It also does not sound grammatically correct.

局限性

I did not find a discussion on this.

作者回复

Thanks very much for the review! We greatly appreciate your enjoy on our paper's insights, theoretical analysis, and experiments. We address your questions in the form of Q&A below.

Moreover, for the length limit, we distillate our answer and provide brief but important answers, more detailed intermediate steps, reasonings, and derivations can be provided during the upcoming author-reviewer discussion phase. All the detailed reply and supplementary experiments will be added to our updated paper.

Q1. How do you deal with the node features in Equation (1)?

The reason we omit node features in Equation (1) is that prevalent benchmarks only support time-aware edge features.

But if node features are given, our framework still works. Theoretically, Equation (1) can be extended as follows: hv(t)=c(u,tˉ)[t_enc(ttˉ)e_uv(t)xu(tˉ)]h_v(t) = c \cdot \sum_{(u, \bar{t})}[\mathbf{t}\_{enc}(t - \bar{t}) || \mathbf{e}\_{uv}(t) || \mathbf{x}_{u}(\bar{t})], where xu(tˉ)\mathbf{x}_u(\bar{t}) is the node features of uu at time tˉ\bar{t}.

Adding node features will not affect our conclusions and framework, as the construction of our kernel depends on hv(t)h_v(t) as a general term, instead of its components, such as node / edge features. Moreover, Theorem 5.1 and Theorem 5.2 still hold if node features are given.

Q2. How weights W in Equation (6) are initialized

For Equation (6), the weights W\mathbf{W} are initialized i.i.d from the normal distribution N(0,1)\mathcal{N}(0, 1).

Q3. To obtain (6) and (13), aren't you actually training a different GNN for each time tt?

We want to clarify that our method does not involve training GNNs.

The kernel function in Equation 6 is just a starter and illustrated by a temporal graph neural network for easy interpretation, because the temporal graph neural network exists earlier than the proposed temporal graph neural tangent kernel (i.e., ours).

The derivation of our kernel function is given in Sec. 4, especially in Equation (12), which just relies on the covariance matrix computation through Equation (11), to Equation (10), to Equation (9), to Equation (8), and finally to Equation (7).

A typo is found during the rebuttal. For the last term in Eq (11), l1l-1 should be ll.

We will strengthen the above derivation in the updated paper.

Q4. The kernel (12) does not capture the temporal relationships between different time instances. I am not sure how this models or compares with typical temporal graph learning approaches ...

Temporal dependencies are captured in Equation (1), where we construct the temporal node representations for a node vv at time tt by aggregating information (node, edge features, and time difference) from its previous temporal neighbors (u,tˉ),tˉ<t(u, \bar{t}), \bar{t} < t. Intuitively, this models the current state of vv's neighborhood at time tt based on its own neighborhood at the previous time tˉ\bar{t}. Since we would want a vector representation for vv, we perform vector summation over all information of temporal neighbors (u,tˉ)(u, \bar{t}).

It is true that there is an established line of works in temporal graph learning using recurrent architecture and attention to learn temporal dependencies like TGN and TGAT, but there are works employing simpler architectures, such as GraphMixer, which aggregates information from recent temporal neighbors and processes them with MLP-Mixer layers.

Q5. Why do you not consider the kernel that takes the whole temporal graph sequence and as inputs? I can understand that this is more sophisticated ...

If we understand your question correctly, your concern is about Equation (13) that current time prediction only depends on other samples at the current time.

We admit that taking the historical behavior as you suggested is a promising approach, and actually this is what we have done in the paper. Instead of taking each past snapshot individually, recall our Equation (1), temporal dependencies are captured, where we construct the temporal node representations for a node vv at time tt by aggregating information (node, edge features, and time difference) from its previous temporal neighbors (u,tˉ),tˉ<t(u, \bar{t}), \bar{t} < t. Hence, the current time representation is the cumulation of the past. In this angle, the sequence of the temporal graph is observed by our method.

Q6. Table 1: I fail to see why Temp-G3NTK is computationally cheaper than GraphMixer. (Also, you have not defined KK for GraphMixer.) ...

First, KK for GraphMixer is the padding length, i.e., the maximum number of recent interactions a node can sample.

The theoretical complexity compassion is not that straightforward by only looking at the power to number of nodes. As we stated in the paper, GraphMixer needs the complex neural computation like gradient descent and back propagation, but our method does not. That’s why we claim our method is faster.

To in-depth verify our claim, we added empirical running time comparison of all baselines across all datasets. The results are reported as Table 1 in the 1-page pdf in “general rebuttal”. According to the result, our method is roughly 10x-20x faster than GraphMixer, and also faster than most baselines.

We will include these new results in our paper to in-depth verify my hypothesis.

Q7. GraphMixer is not included in Figure 1 because it is "designed to directly process continuous-time dynamic graphs, which require observing the entire length of the input temporal graph instead of processing snapshots". I don't think this claim is accurate. You can easily implement GraphMixer by restricting it to snapshots.

Technically, GraphMixer can process discrete time dynamic graphs, by only considering part of the past snapshots as the continuous dynamic graphs, although it was originally designed for continuous graphs.

Since we can use snapshot timestamp as edge timestamp for each edge in this snapshot, then we are now running GraphMixer and will add its plot during the author-reviewer discussion phase.

评论

Thank you for the clarifications. Can you address the weakness on "It is unclear how this approach compares with traditional temporal graph learning methods that capture temporal relationships."? E.g., ROLAND, DGNN, SSGNN, EvolveGCN, etc.

评论

Thanks for your follow-up question! We are very glad to answer this question.

Below, we first give a brief answer that highlights the important differences between our method and the methods you recommended.

Second, we then give each one-by-one detailed illustration (in chronological order) for each of your recommended works to support our first point.

Third, we base our analysis on a recent survey’s taxonomy and point out where our method stands.

Answering this question helps expand the scope of our paper's literature review, and we will add all the answers to our paper.


1. Brief but important difference

1.1 Neural representation learning vs. Neural tangent kernel

The traditional temporal graph learning methods you recommended, like DGNN, EvolveGCN, ROLAND, and SSGNN, rely on neural network training (e.g., stacking neural layers, gradient descent, and backpropagation) to obtain neural representations to support corresponding graph downstream tasks.

Our temporal graph neural tangent kernel method does not rely on neural network structure but can achieve the expressive power of graph neural networks, as our Theorems and experiments demonstrate.

1.2. Complexity and neural architecture

This point is the extension of the last point.

To be specific, DGNN, EvolveGCN, ROLAND, and SSGNN belong to the category of recurrent graph neural architectures that handle temporal information (e.g., on the learnable weight level like EvolveGCN or hidden representation level like ROLAND). This is indeed an effective direction, but it requires heavy time complexity.

Facing this problem, an emerging direction appears, i.e., MLP-Mixer on Static Graphs [1] or GraphMixer on temporal graphs [2]. Especially, GraphMixer aggregates information from recent temporal neighbors and processes them with MLP-Mixer layers. Motivated by this direction, we propose our temporal graph neural tangent kernel. Also, we would like to note that, even without recurrent neural architectures, temporal information can also be preserved in our method.

To be more specific, in our method, temporal dependencies are captured in Equation (1), where we construct the temporal node representations for a node vv at time tt by aggregating information (e.g., node features, edge features, and time difference) from its previous temporal neighbors (u,tˉ)(u, \bar{t}). And the entire process does not involve neural training but just depends on mathematical time kernel functions. In other words, this process records the current state of vv based on its own neighborhood at the previous time tˉ\bar{t} and can be retrieved for future state computation. Besides theoretical derivation, especially, Figure 1 visualizes our method’s effectiveness.

To support our above statement, we list the detailed illustration below.


2. Detailed illustration of each recommended work in chronological order

2.1. DGNN: SIGIR 2020

  • Task:

    • Targeting on link prediction (Ous is mainly for temporal graph classification and can be easily adapted to temporal node classification)
  • Details in Methodology:

    • There are two types of nodes: interacting nodes and influenced nodes
    • If two nodes are involved in a (directed) interaction (u, v, t), then “u, v” are interacting nodes and nodes that are nearby this interaction are referred to as “influenced nodes”.
    • If two nodes uu and vv interact at a certain time, then DGNN updates their temporal representation. First, process each of uu and vv separately by employing recurrent architecture to their previous temporal representations and finally combine with time encoding the difference between the current time and the last interaction time of that node. Then, merge the two representations and obtain two new representations for uu and vv.
    • If two nodes interact, nearby nodes (“influenced nodes”) would be affected. DGNN also updates “influenced nodes”, i.e., applying recurrent architecture on their previous representation, combining with two representations from interacting nodes, and the time encoding of difference between current time and last interacting time.

2.2. EvolveGCN: AAAI 2020

  • Task:
    • Mainly on link prediction and node classification
  • Details in Methodology:
    • Operates on graph snapshots. Use recurrent architecture (e.g., LSTM) to update the weight of each neural layer across time.
    • At the time tt, get node representation by applying the current snapshot’s adjacency matrix, learnable weights, and representation from the previous recurrent layer.
评论

2.3. ROLAND: KDD 2022

  • Task:
    • Link prediction
  • Details in Methodology:
    • Different from EvolveGCN, in ROLAND, the recurrent architectures are added on the hidden representation vectors other than learnable weights across timestamps. The components in recurrent architectures get simplified in ROLAND, which can be mainly based on MLPs and simple GNNs.

2.4. SSGNN: AAAI 2023

  • Task:
    • Time Series Forecasting
  • Details in Methodology:
    • SSGNN uses a deep randomized recurrent neural network to encode the history of each node encodings into high-dimensional vector embeddings, and then uses powers of the graph adjacency matrix to build informative node representations of the spatiotemporal dynamics at different scales.
    • The decoder maps the node representations into the desired output, e.g., future values of the time series.

3. The position of our method in temporal graph learning

A recent temporal graph learning survey [3] reviewed many temporal graph learning methods, including EvolveGCN, SSGNN, and DGNN, in their taxonomy, as plotted in Figure 2.

In that taxonomy, according to the best of our knowledge, our method belongs to the category “Event-based”, and is the child node of “Temporal Embedding” and “Temporal Neighborhood”, the position is close to the work TGL [4].

However, different from TGL - a large-scale graph neural network, to our best knowledge, our method is the first temporal graph neural tangent kernel method.


Reference

[1] A Generalization of ViT/MLP-Mixer to Graphs, ICML 2023

[2] Do We Really Need Complicated Model Architectures For Temporal Networks, ICLR 2023

[3] Graph Neural Networks for temporal graphs: State of the art, open challenges, and opportunities, TMLR 2023

[4] TGL: A General Framework for Temporal GNN Training on Billion-Scale Graphs, VLDB 2022

评论

I agree that your proposed approach using NTK is attractive and has lower time complexity than some of the recurrent models out there. What I meant was what is the performance comparison? Is there a trade-off somewhere? I haven't seen any theoretical or empirical comparisons on this.

评论

We are happy to report that the additional running experiments required for GraphMixer with the rest 7 baselines are now finished.

Due to the rebuttal policy, we can not add an external link to direct you to check the figure. However, we describe the overall pattern and provide the important data points in table format as follows. The new figure and corresponding analysis are promised to be added to our paper.

  • After involving GraphMixer, the general pattern in Figure 1 is not changed, i.e., our method still outperforms baselines, and the gap is obvious.

  • For providing specific data points, we extract the important comparison, i.e., GraphMixer vs. Ours, and made the tabular representation as below. According to the table, we can see our method outperformed GraphMixer in most cases.


Table. For involving GraphMixer in Figure 1 of the paper, the temporal graph classification accuracy on different stages of temporal graphs.

DatasetMethod1 / 52 / 53 / 54 / 55 / 5
InfectiousTemp-G 3^3 NTK (Ours)0.6400.6200.6500.7400.590
GraphMixer0.5250.5450.4800.4950.500
FacebookTemp-G 3^3 NTK (Ours)0.6000.7000.5640.6200.530
GraphMixer0.5390.5460.5800.5640.561
评论

Thanks for your appreciation, and we are glad to answer your follow-up question!

Performance of complex neural architectures and simple neural architectures

According to the most recent temporal graph learning work, e.g., MLP-Mixer [1] and GraphMixer [2], there is a trend to replace the complex neural architecture in the representation learning process, but using simplified neural architecture to maintain or even boost the performance.

According to GraphMixer [2], learning positional encoding plus simple MLPs can achieve very competitive performance. For example, in the link prediction task on Reddit datasets, GraphMixer achieved 99.93% precision and outperforms the 99.30% of JODIE [3] (using recurrent architecture) and the 99.80% of TGN [4] (also using recurrent architecture).

Motivated by this direction, we designed our temporal graph neural tangent kernel. On the one hand, we even simplified the complexity of SOTA temporal graph neural networks like GraphMixer [2], TGN [4] (using recurrent arch), and DyRep [5] (using recurrent arch), and outperformed no matter

  • in the theoretical complexity (as shown in Table 1 of the paper);
  • or in the empirical complexity (as shown in Table 1 of the 1-page PDF file in “general rebuttal”);
  • or in the theoretical effectiveness (as shown in Theorem 5.1 and entire Appendix B for proof);
  • or in the empirical effectiveness (as shown in Table 2 of the paper for graph-level tasks and in Table 3 of the paper for node-level tasks).

Moreover, besides our two existing baselines that already used recurrent architectures (TGN [4] and DyRep [5]), we also added your recommended work to our comparison, and the result is shown below.

Table. Temporal Graph Classification Accuracy on Infectious Dataset

MethodAccuracy
WL-Subtree0.600 ±\pm 0.044
Shortest Path0.670 ±\pm 0.075
Random Walk0.670 ±\pm 0.073
Graph2Vec0.565 ±\pm 0.081
NetLSD0.625 ±\pm 0.061
GL2Vec0.545 ±\pm 0.051
GraphMixer0.500 ±\pm 0.000
TGN0.520 ±\pm 0.019
EvovleGCN0.521 ±\pm 0.093
TEMP-G3^{3}NTK (OURS)0.740 ±\pm0.058

From the above table, we can see that EvolveGCN performs similarly with another recurrent architecture baseline TGN, which aligns with the analysis [1,2] that complex neural architectures can be replaced by simple architectures in the temporal graph learning community. Since it is replaceable, we next expand the possible trade-off.

Possible Trade-off

Here, we would like to share the trade-off of this evolution trend in the temporal graph learning community. To simplify the neural architecture but most importantly maintain (or even boost) the performance, it requires the researchers to devote time and efforts to propose new paradigms, just like what we did in the paper to replace the neural computing with exact mathematical expressions and ensure the theoretical and empirical effectiveness.

On the other hand, our method still triggers the future direction for the temporal graph learning community. For example, even though we already achieved theoretical and empirical effectiveness and efficiency, we discerned that the room still exists to do better for future research. For example, in our Equation (13), making the decision for the testing data, it usually needs a bunch of observed training data. Even though our computation is simple and not related to neural computation like gradient descent and back propagation, we still wish to decrease the number of training data to meet the specific case if the training data is inadequate.

Here, we would like to share a possible solution. We can involve the prototype learning in a non-neural way, such that a bunch of training data can be presented by a single prototype representation. Hence, the number items in Equation (13) can be decreased and the efficiency can be largely increased. This is an interesting topic, and we would like explore it as our future research direction.


Reference

[1] A Generalization of ViT/MLP-Mixer to Graphs, ICML 2023

[2] Do We Really Need Complicated Model Architectures For Temporal Networks, ICLR 2023

[3] Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks, KDD 2019

[4] Temporal graph networks for deep learning on dynamic graphs, arXiv 2020

[5] Dyrep: Learning representations over dynamic graphs, ICLR 2019

评论

Thanks for your new results, which have clarified my doubts. I am happy to increase the score.

评论

Thanks again for your appreciation! Answering your insightful questions helps us improve the quality of the paper. We will include the above answer in our paper. Thanks very much for your time and review!

审稿意见
7

This paper proposes a graph neural tangent kernel method that extends the advantages from static graphs to temporal graphs. Theoretical analyses are conducted to prove the transferability and robustness of the model. Experiments are performed on graph-level tasks as well as node-level tasks.

优点

1: The motivation for the paper is strong. Temporal graphs are quite a hot topic these days, and transferring graph tangent kernel method to temporal graphs shows clear intuition.

2: The theoretical analysis regarding the generalization bound, convergence and time complexity are comprehensive.

3: Experiments show the competitive results, which demonstrate the potential and effectiveness of the proposed method.

缺点

The related work section is a bit condensed, which may decrease the persuasiveness of the paper.

The explanation regarding the simplicity and interpretation ability of the proposed method is unclear. I wonder if the authors can further clarify.

问题

This is good work. Is it possible to use more graph neural architecture-based methods in experiments for two-level tasks? Also the datasets selection can be more inclusive.

局限性

N/A

作者回复

Thanks very much for the review! We greatly appreciate your acknowledgment of our paper's motivation, theoretical analysis, and experiments. We address your questions in the form of Q&A below.

Moreover, for the length limit, we distillate our answer and provide brief but important answers, more detailed intermediate steps, reasonings, and derivations can be provided during the upcoming author-reviewer discussion phase. All the detailed reply and supplementary experiments will be added to our updated paper.

W1. Expanding the related work section

On one hand, we expand the detailed introduction to Graphons, Graphon Neural Network, and Graphon Neural Tangent Kernel.

For example, following details will be added to Section C.1 of our Appendix.

Graphons. A graphon is a symmetric, bounded, and measurable function $W : [0, 1]^2 \rightarrow [0, 1]$ that acts as the limit object of dense graph sequences and defines a family of similar graphs. Specifically, given an undirected graph $\mathbf{F}$ and a graph sequence $\{G_n\}$, whose $i$-th graph in the sequence, $G_i$, has $i$ nodes, then $\{G_n\}$ converges to $W$ if $\lim_{n \rightarrow \infty} t(\mathbf{F}, G_n) = t(\mathbf{F}, W)$, where $t(\mathbf{F}, G_n)$ defines the density of homomorphisms between $\mathbf{F}$ and $G_n$, and $t(\mathbf{F}, W)$ can be defined similarly. In this way, $W$ can also be regarded as a generative models for a certain graph family.

Graphon Neural Networks (WNNs) [26]. Similar to how GNNs perform graph convolutions on a graph given the adjacency matrix, $A$, and node / edge features $H$, Graphon Neural Networks (WNNs) also perform graphon convolutions using $W$ and $X$, where $X: [0, 1] \rightarrow \mathbb{R}$ (corresponds to node-level) or $X: [0, 1]^2 \rightarrow \mathbb{R}$ (corresponds to edge-level) is graphon signals. Intuitively, we could consider WNNs as the counterpart of GNNs that operate on graphons. As proven in [26], as the graph's size of the graph sequence $\{(G_n, H_n)\}$ generated by $W, X$ grows, i.e $n \rightarrow \infty$, the output of GNNs operating on $\{(G_n, H_n)\}$ converges to the output of WNNs operating on $W, X$. More details can be found in Section C.1 in our Appendix.

On the other hand, we include more temporal graph representation learning methods discussion in our paper, like CAWN [1], and DySAT [2].

For example, following details will be added to Related Work of our paper.

* CAWN obtains the causal walks for each node, anonymize the node indices on each walk, further process these walks with recurrent architectures, and the final temporal node representation is determined by aggregating walks encodings.

* DySAT first transforms the given CTDG into discrete temporal graph snapshots, then applies Graph Attention Network (GAT) on each snapshots to obtain the spatial information, $\mathbf{X}(t)$, where $t$-th corresponds to the graph snapshot at time $t$, and denote the spatial information for node $u$ in the $t-$th graph snapshot as $[\mathbf{X}(t)]_u$. Then $u$'s temporal representation is obtained by applying Transformer on its spatial information at different timestamps: $h_u(t_k), h_u(t_{k - 1}), \dots, h_u(t_0) = \text{Transformer}([\mathbf{X}(t_k)]_u, [\mathbf{X}(t_{k - 1})]_u, \dots, [\mathbf{X}(t_0)]_u)$.

Finally, we would be more than willing to discuss any related work the reviewer recommend during the upcoming author-reviewer discussion phase.

Reference:

[1] Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks, ICLR 2021

[2] DySAT: Deep Neural Representation Learning on Dynamic Graphs via Self-Attention Networks, WSDM 2020

W2. More explanation regrading the simplicity and interpretation of the proposed method

For simplicity, as stated in the paper, our kernel method does not involve complex neural architecture or training algorithms. Moreover, all terms appearing in the method can be computed in an iterative manner as shown from Equation 7 to Equation 11.

Additionally, we added the empirical running time for comparing the efficiency of our method and 8 baselines across 4 datasets, as shown in Table 1 in the 1-page pdf in "general rebuttal". Overall, the empirical running time aligns with our theoretical analysis of time complexity in Table 1 of the paper. That is, our method belongs to the graph kernel category, where the node-wise comparison is usually inevitable, and our time complexity is lower; Compared to the neural baselines, since our method does not rely on complex neural training like backpropagation, our method is still efficient.

In brief, the interpretation of our graph neural tangent kernel relates to the training dynamics of the temporal graph neural architecture that is defined in Section 3.1. Specifically, in the limit that the layer width of MLPs goes towards infinity, then the graph prediction using the architecture defined in Section 3.1 is equivalent to the prediction of using kernel regression, as stated in Equation 13.

Q1. This is good work. Is it possible to use more graph neural architecture-based methods in experiments for two-level tasks? Also the datasets selection can be more inclusive.

As stated in Section 6.1, for graph-level datasets, we cover a variety of different domains from DBLP (citation graph), Facebook (social graph), Infectious (human-contact graph), and Tumblr (social graph). Moreover, in Appendix D.4, we also show how our methods can scale on larger datasets (Wiki, Reddit, MOOC, LastFM), in terms of number of nodes and edges.

Our baselines include static graph kernel methods, static GNN methods, temporal GNN methods, and ours stand for the temporal graph kernel method.

During the rebuttal, we also include all baselines into a new graph dataset from https://chrsmrrs.github.io/datasets/docs/datasets/, the performance is shown in Table 2 in 1-page pdf in “general rebuttal”. According to the new results, our method still achieves the best performance.

评论

I would like to thank the authors for the hard work addressing my concers and considering my opinion. I've increased my score. Good luck with the submission.

评论

Thanks very much for your insightful review! We enjoy answering your questions and addressing your concerns! The above answer is promised to be added to our paper. Thanks again!

审稿意见
7

The proposed Temporal Graph Neural Tangent Kernel with Graphon-Guaranteed or Temp-G3 NTK method is a novel temporal graph learning method with provable error bounds. It extends the the simplicity and interpretation ability of Graph Neural Tangent Kernel to the temporal graph setting leading to rigorous error bounds for the temporal graph classification task. Furthermore, in the case of growing networks, the proposed method will converge in the limit to the graphon NTK value. Temp-G3NTK method achieves strong performance on both temporal graph classification and the node property prediction task.

优点

  • novel method with theoretical bounds: the proposed method is one of the first methods in temporal graph learning with provable error bounds. This is novel and opens up research for more future methods with provable bounds.

  • strong performance. The Temp-G3NTK method shows strong performance on the temporal graph classification task and competitive performance for the node property prediction task from TGB. Additional ablation studies are also provided in the appendix.

  • clear presentation: overall the paper is clearly presented and also with discussion on the time complexity of the method.

缺点

The main concern I have is the time complexity. The O(n2LV2+nE)O(n^2L|V|^2 + n|E|) complexity is quite high thus limiting the scalability of the method to networks with millions of nodes. The node pair wise covariance matrix would be expensive to compute.

Additionally, there is no limitation and societal impact discussion in the paper.

问题

  • might be good to add plots or table showing the compute time for each dataset.
  • Is it possible to adapt the method to continuous time dynamic graphs (CTDGs)?
  • How would the model convergence be if the network both grows and shrinks over time. This is quite common in real world temporal graphs.

局限性

Missing discussion on limitations for current work as well as discussions on societal impact. Note that it is possible to lead to negative societal impact when temporal graph learning methods are deployed in applications such as anomaly detection, brain activity classification and more.

作者回复

Thanks very much for the review! We greatly appreciate your acknowledgment of our paper's novelty, presentation, and model performance. We address your questions in the form of Q&A below.

Moreover, for the length limit, we distillate our answer and provide brief but important answers, more detailed intermediate steps, reasonings, and derivations can be provided during the upcoming author-reviewer discussion phase. All the detailed reply and supplementary experiments will be added to our updated paper.

Q1. Might be good to add plots or tables showing the compute time for each dataset

We have added the empirical running time of our method and 8 other baselines on 4 benchmark datasets. The table is plotted as Table 1 in the 1-page pdf file in the “general rebuttal”.

Overall, the empirical running time aligns with our theoretical analysis of time complexity in Table 1 in the paper. That is, our method belongs to the graph kernel category, where the node-wise comparison is usually inevitable, and our time complexity is lower. Compared to the neural baselines, since our method does not rely on complex neural training like backpropagation, our method is still efficient.

Given our method achieved the best classification accuracy, as shown in Table 2 in the paper, according to the corresponding running time reported in Table 1 of the uploaded 1-page pdf during the rebuttal, our method is (1) more than 10x - 20x faster then complex temporal graph neural network methods like GraphMixer [6] and TGN [24]; (2) similarly efficient as simple kernel methods like WL-Subtree [27] and Shortest Path [4] and embedding methods like NetLSD [30] and GL2Vec [5]; and only Graph2Vec [21] is running faster than our method, but our performance is roughly 1.4x better.

Q2. Is it possible to adapt our method to continuous time dynamic graphs (CTDGs)?

The answer is yes.

In the paper, we choose discrete time just for a better illustration of the proposed method. The transformation between discrete time and continuous time is possible, as described in lines 63-65 of the paper.

In other words, given 2 CTDGs G,GG, G', suppose that we already computed the Temp-G3^{3}NTK value between G,GG, G' at time tt. If new links occurring at time t,t>tt', t' > t are added into G,GG, G', then we update the neighborhood of each node in G,GG, G', and re-compute the Temp-G3^{3}NTK value for (G,G)(G, G') at time tt'.

An extreme case needs more caution, i.e., spare continuous time dynamic graph. For example, only one edge is updated per timestamp. This scenario will make our method run slower, because more snapshots are emerging. But multiple edges updated per timestamps can be adapted seamlessly. Moreover, one edge update per timestamp is very rare no matter in the real-world setting.

Q3. How would the model convergence be if the network both grows and shrinks over time. This is quite common in real world temporal graphs.

In brief, our method handles the case if some graph shrinks at the intermediate timestamp, and our theoretical findings hold as long as the life-long pattern is growing.

In other words, for a graph that has some growing stages and some shrinking stages in its life, if its number of nodes follow an overall growing trend, then Theorem 5.2 holds.

To be more specific, let the number of nodes of a temporal graph snapshot at time tt, G(t)G^{(t)}, be n(t)n(t), then Theorem 5.2 holds if n(t)n(t) \rightarrow \infty as tt \rightarrow \infty. Intuitively, the term KW(W(t),W(t))KW(W,W)||K_W(W^{(t)}, W'^{(t)}) - K_W(W, W')|| could be decomposed and bounded by terms that depend on W(t)W,W(t)W||W^{(t)} - W||, ||W'^{(t)} - W'|| and X(t)X,X(t)X||X^{(t)} - X||, ||X'^{(t)} - X'|| (where XX is WW's signal function and X(t),W(t)X^{(t)}, W^{(t)} denote the induced graphon and induced graphon corresponds to temporal snapshot G(t)G^{(t)}). Then, W(t)W,X(t)X||W^{(t)} - W||, ||X^{(t)} - X|| are bounded by terms that depend on the number of nodes 1/n(t)1 / n(t), so W(t)W,X(t)X0||W^{(t)} - W||, ||X^{(t)} - X|| \rightarrow 0 as n(t)n(t) \rightarrow \infty.

Moreover, according to the real-world temporal graph dataset benchmark like (https://snap.stanford.edu/data/index.html), especially social graphs and citation graphs, a growing pattern is usually more common.

Q4. Possible negative societal impact

In general, graph deep learning often involves analyzing complex networks, which can include sensitive personal information. If not managed properly, this can lead to privacy violations, especially when dealing with social networks, financial systems, or medical data. Also, if the data used to train graph models contains biases, these biases can be propagated and even amplified by the algorithms.

To the best of the authors’ knowledge, we do not discern specific negative societal impacts caused by this proposed method. However, we will add the above general discussions regarding the risks of graph deep learning in our updated paper.

评论

I thank the authors for their detailed discussion and addressing my concerns. I will retain my score and believe this work will be of interest to the wider temporal graph learning community.

评论

We sincerely thank you for your helpful review and your appreciation of our work. We are glad that our answer addressed your concern. The above answer is promised to be added to our paper. Thanks!

审稿意见
5

This paper introduces the graph tangent kernel for temporal graph, thus enabling the learning task on time-evolving graph structures. The authors generalized the concepts of graph tangent kernel to incorporate the time domain information. They derived the generalization bound for the corresponding kernel predictor, and that their kernel approximates the graphon neural tangent kernel.

优点

The authors generalized the concepts of graph tangent kernel to incorporate the time domain information. They derived the generalization bound for the corresponding kernel predictor, and that their kernel approximates the graphon neural tangent kernel.

缺点

See the questions.

问题

  1. Formula (1) defines the node feature. In this definition, the time instance tt is encoded by the function tenc t_{\text {enc }}. From the construction it seems that tenc t_{\text {enc }} aims to make the time difference between ttˉt-\bar{t} as a vector. However, ttˉt-\bar{t} is already simple enough as a scalar. Can you explain the motivation of using tenc t_{\text {enc }} instead of ttˉt-\bar{t} directly? What does cos(Δtw)\cos (\Delta t \mathbf{w}) indicate?
  2. As the node feature is defined by formula (1), the authors suppose that there are no node features given in the beginning (from the dataset), and only edge features are accessible. Does the whole framework still work well if the node features are given by the dataset? In other words, does the conclusions in this paper rely on the specific construction method of the node features (i.e., formula (1))?
  3. Formula (12) provides the definition of the proposed "kernel", but the authors did not prove that this "kernel" satisfies the usual definition of kernel, i.e., it is symmetric and semi-definite.
  4. Theorem 5.1 and 5.2 assume the case that we indeed use the designed kernel and formula (13) to learn the task. However, there should be theoretical proofs that the proposed method (1) and (2) indeed converge to the Gaussian process or kernel method, like in [12].

局限性

N.A.

作者回复

Thanks very much for the review! We greatly appreciate your acknowledgment of our paper's theoretical analysis. We address your questions in the form of Q&A below.

Moreover, for the length limit, we distillate our answer and provide brief but important answers, more detailed intermediate steps, reasonings, and derivations can be provided during the upcoming author-reviewer discussion phase. All the detailed reply and supplementary experiments will be added to our updated paper.

Q1. Explain using tenc\mathbf{t}_{enc} instead of ttˉt - \bar{t} directly? What does cos(Δtw)\cos(\Delta t \mathbf{w}) indicate?

Theoretically, mapping a discrete scalar to a continuous high-dimensional vector space has two unique advantages. First, the embedding vector can capture more complex information like non-linear relationships. Moreover, the embedding does not hurt the original scalar representation, i.e., close time scalars still have similar embedding vectors, and the embedding vectors can also capture periodic behaviors [1]. Second, the embedding vector can jointly work with deep learning models, i.e., embedding vectors for time can be potentially used within any neural architecture’s representation learning process, which is incapable of using discrete time scalars.

In our case, t_enc(Δt)=cos(Δtw)\mathbf{t}\_{enc}(\Delta t) = \cos(\Delta t \mathbf{w}) maps a scalar Δt\Delta t to an dtd_t-dimensional vector, whose ii-th entry is cos(Δtα(i1)/β)\cos(\Delta t \cdot \alpha^{-(i-1) / \beta}). Given two real positive numbers p,q>0p, q > 0 such that p>qp > q. As id_ti \rightarrow d\_t, the ii-th entries of t_enc(p),t_enc(q)\mathbf{t}\_{enc}(p), \mathbf{t}\_{enc}(q) go towards 00. Since p<qp < q, it is likely that t_enc(p)\mathbf{t}\_{enc}(p) contains more 00 entries than t_enc(q)\mathbf{t}\_{enc}(q).

Moreover, there are many choices for time encoding. However, most functions require trainable parameters that add extra computational burden. But our t_enc(.)\mathbf{t}\_{enc}(.) relies on mathematical properties and only needs fixed parameters (α,β\alpha, \beta), which is suitable for kernel-based methods.

Empirically, as shown in Table 4 of our paper, using relative difference ttˉt - \bar{t} (3rd line of Table 4) directly yields a lower accuracy than using t_enc(ttˉ)\mathbf{t}\_{enc}(t - \bar{t}), justifying the effectiveness of t_enc(.)\mathbf{t}\_{enc}(.).

Reference: [1] Time2Vec: Learning a Vector Representation of Time

Q2. Does the whole framework still work well if the node features are given by the dataset?

The reason we omit node features in formula (1) is that prevalent benchmarks only support time-aware edge features.

But if node features are given, our framework still works. Theoretically, formula (1) can be extended as follows: hv(t)=c(u,tˉ)[t_enc(ttˉ)e_uv(t)x_u(tˉ)]h_v(t) = c \cdot \sum_{(u, \bar{t})}[\mathbf{t}\_{enc}(t - \bar{t}) || \mathbf{e}\_{uv}(t) || \mathbf{x}\_{u}(\bar{t})], where x_u(tˉ)\mathbf{x}\_{u}(\bar{t}) is the node features of uu at time tˉ\bar{t}.

Adding node features will not affect our conclusions and framework, as the construction of our kernel depends on hv(t)h_v(t) as a general term, instead of its components, such as node / edge features. Moreover, Theorem 5.1 and Theorem 5.2 still hold if node features are given.

Q3. Proof of symmetric and semi-definite properties.

We have proved the symmetric and semi-definite properties of our method. Due to the rebuttal length limit, the detailed proof will be uploaded during the author-reviewer discussion phase via the “official comment” channel.

Q4. However, there should be theoretical proof that the proposed method (1) and (2) indeed converge to the Gaussian process or kernel method, like in [12].

In short, our proposed method naturally enjoys the convergence in [12].

First, we provide some preliminary.

Let fnn(x,θ)f_{nn}(\mathbf{x}, \theta) be the output of a fully connected neural network, with parameters θRp\theta \in \mathbb{R}^{p} and xRd\mathbf{x} \in \mathbb{R}^d as the input.

Then the Neural Tangent Kernel (NTK) corresponds to ff, and any arbitrary input x,x\mathbf{x}, \mathbf{x'} would be θfnn(x,θ)θfnn(x,θ)\nabla_{\theta}f_{nn}(\mathbf{x}, \theta)^ \top \nabla_{\theta}f_{nn}(\mathbf{x}', \theta)

Suppose that fnnf_{nn} has LL layers, the ll-th layer's width is denoted as dld_l. Theorem 1 from [12] states that in the limit of d1,,dLd_1, \dots, d_L \rightarrow \infty, the NTK converges in probability to a deterministic limiting kernel.

Next, we analyze these terms in our paper's context.

If we apply Theorem 1 from [12], then the NTK (defined in [12]) associating with a deep neural network consisting of LL layers applying on temporal node representation v,vv, v' converges, with high probability, to the deterministic limiting kernel Θ(L)(G(t),G(t))_vvId_dL\boldsymbol{\Theta}^{(L)}(G^{(t)}, G^{(t)})\_{vv'} \otimes \text{Id}\_{dL}, (where Id_dL\text{Id}\_{d_L} is the identity matrix with dimension dLd_L and dLd_L is the width of the LL-layer MLP), if d1,,dLd_1, \dots, d_L \rightarrow \infty, and Θ(L)\boldsymbol{\Theta}^{(L)} is denoted as a scalar kernel. To be clear, the kernel value between two graphs in Equation (13) is defined based on the summation of limiting scalar kernel Θ(L)\boldsymbol{\Theta}^{(L)} taken over all node pairs v,vv, v'.

Regarding Gaussian process, based on Proposition 1 from [12], the output of the deep neural network consisting of LL layers operating on temporal node representations defined in Section 3.1 converges to Gaussian process of covariance Σ(L)\boldsymbol{\Sigma}^{(L)}, with d1,,dL1d_1, \dots, d_{L - 1} \rightarrow \infty (as defined in Equation 9).

To better sum up for understanding, our method is constructed based on limiting objects of the NTK convergence theories.

We will add the above derivation to our updated paper.

评论

[Symmetric proof]


We need to prove K(G(t),G(t))=K(G(t),G(t))K(G^{(t)}, G'^{(t)}) = K(G'^{(t)}, G^{(t)}).

Given our proposed kernel function, K(G(t),G(t))=_vV(t)_vV(t)Θ(L)(G(t),G(t))_vvK(G^{(t)}, G'^{(t)}) = \sum\_{v \in V(t)} \sum\_{v' \in V'(t)} \boldsymbol{\Theta}^{(L)}(G^{(t)}, G'^{(t)})\_{vv'}, we first write down another equation, where the internal order is flipped, i.e., K(G(t),G(t))=_vV(t)_vV(t)Θ(L)(G(t),G(t))_vvK(G'^{(t)}, G^{(t)}) = \sum\_{v' \in V'(t)} \sum\_{v \in V(t)} \boldsymbol{\Theta}^{(L)}(G'^{(t)}, G^{(t)})\_{v'v}


We first prove that Θ(l)(G(t),G(t))_vv=Θ(l)(G(t),G(t))_vv,l,1lL.\boldsymbol{\Theta}^{(l)}(G^{(t)}, G'^{(t)})\_{vv'} = \boldsymbol{\Theta}^{(l)}(G'^{(t)}, G^{(t)})\_{v'v}, \forall l, 1 \leq l \leq L.

For l=1l = 1, we have

Θ(1)(G(t),G(t))_vv=Θ(0)(G(t),G(t))_vvΣ˙(1)(G(t),G(t))_vv+Σ(1)(G(t),G(t))_vv\boldsymbol{\Theta}^{(1)}(G^{(t)}, G'^{(t)})\_{vv'} = \boldsymbol{\Theta}^{(0)}(G^{(t)}, G'^{(t)})\_{vv'} \cdot \dot{\boldsymbol{\Sigma}}^{(1)}(G^{(t)}, G'^{(t)})\_{vv'} + \boldsymbol{\Sigma}^{(1)}(G^{(t)}, G'^{(t)})\_{vv'}

= (h_v(t)h_v(t))πarccos(Σ(0)(G(t),G(t))_vv)2π+πarccos(Σ(0)(G(t),G(t))_vv)2π+1Σ(0)(G(t),G(t))2_vv2π(h\_v(t)^\top h\_{v'}(t)) \cdot \frac{\pi - \arccos(\boldsymbol{\Sigma}^{(0)}(G^{(t)}, G'^{(t)})\_{vv'})}{2\pi} + \frac{\pi - \arccos(\boldsymbol{\Sigma}^{(0)}(G^{(t)}, G'^{(t)})\_{vv'})}{2\pi} + \frac{\sqrt{1 - \boldsymbol{\Sigma}^{(0)}(G^{(t)}, G'^{(t)})^2\_{vv'}}}{2\pi}

= (hv(t)hv(t))πarccos(hv(t)hv(t))2π+πarccos(hv(t)hv(t))2π+1(hv(t)hv(t))22π(h_v(t)^\top h_{v'}(t)) \cdot \frac{\pi - \arccos(h_v(t)^\top h_{v'}(t))}{2\pi} + \frac{\pi - \arccos(h_v(t)^\top h_{v'}(t))}{2\pi} + \frac{\sqrt{1 - (h_v(t)^\top h_{v'}(t))^2}}{2\pi}

= (hv(t)hv(t))πarccos(h_v(t)h_v(t))2π+πarccos(h_v(t)h_v(t))2π+1(h_v(t)h_v(t))22π(h_{v'}(t)^\top h_{v}(t)) \cdot \frac{\pi - \arccos(h\_{v'}(t)^\top h\_{v}(t))}{2\pi} + \frac{\pi - \arccos(h\_{v'}(t)^\top h\_{v}(t))}{2\pi} + \frac{\sqrt{1 - (h\_{v'}(t)^\top h\_{v}(t))^2}}{2\pi}

= (hv(t)h_v(t))πarccos(Σ(0)(G(t),G(t))vv)2π+πarccos(Σ(0)(G(t),G(t))_vv)2π+1Σ(0)(G(t),G(t))2_vv2π(h_{v'}(t)^\top h\_{v}(t)) \cdot \frac{\pi - \arccos(\boldsymbol{\Sigma}^{(0)}(G'^{(t)}, G^{(t)})_{v'v})}{2\pi} + \frac{\pi - \arccos(\boldsymbol{\Sigma}^{(0)}(G'^{(t)}, G^{(t)})\_{v'v})}{2\pi} + \frac{\sqrt{1 - \boldsymbol{\Sigma}^{(0)}(G'^{(t)}, G^{(t)})^2\_{v'v}}}{2\pi}

= Θ(0)(G(t),G(t))_vvΣ˙(1)(G(t),G(t))_vv+Σ(1)(G(t),G(t))_vv\boldsymbol{\Theta}^{(0)}(G'^{(t)}, G^{(t)})\_{v'v} \cdot \dot{\boldsymbol{\Sigma}}^{(1)}(G'^{(t)}, G^{(t)})\_{v'v} + \boldsymbol{\Sigma}^{(1)}(G'^{(t)}, G^{(t)})\_{v'v}

= Θ(1)(G(t),G(t))vv\boldsymbol{\Theta}^{(1)}(G'^{(t)}, G^{(t)})_{v'v}

Then, suppose k,1kL\exists k, 1 \leq k \leq L, we can have Θ(k)(G(t),G(t))_vv=Θ(k)(G(t),G(t))_vv\boldsymbol{\Theta}^{(k)}(G^{(t)}, G'^{(t)})\_{vv'} = \boldsymbol{\Theta}^{(k)}(G'^{(t)}, G^{(t)})\_{v'v}

Therefore,

Θ(k+1)(G(t),G(t))_vv=Θ(k)(G(t),G(t))_vvΣ˙(k+1)(G(t),G(t))_vv+Σ(k+1)(G(t),G(t))_vv\boldsymbol{\Theta}^{(k + 1)}(G^{(t)}, G'^{(t)})\_{vv'} = \boldsymbol{\Theta}^{(k)}(G^{(t)}, G'^{(t)})\_{vv'} \cdot \dot{\boldsymbol{\Sigma}}^{(k + 1)}(G^{(t)}, G'^{(t)})\_{vv'} + \boldsymbol{\Sigma}^{(k + 1)}(G^{(t)}, G'^{(t)})\_{vv'}

= Θ(k)(G(t),G(t))_vvπarccos(Σ(k)(G(t),G(t))_vv)2π+πarccos(Σ(k)(G(t),G(t))_vv)2π+1Σ(k)(G(t),G(t))2_vv2π\boldsymbol{\Theta}^{(k)}(G^{(t)}, G'^{(t)})\_{vv'} \cdot \frac{\pi - \arccos(\boldsymbol{\Sigma}^{(k)}(G^{(t)}, G'^{(t)})\_{vv'})}{2\pi} + \frac{\pi - \arccos(\boldsymbol{\Sigma}^{(k)}(G^{(t)}, G'^{(t)})\_{vv'})}{2\pi} + \frac{\sqrt{1 - \boldsymbol{\Sigma}^{(k)}(G^{(t)}, G'^{(t)})^2\_{vv'}}}{2\pi}

= Θ(k)(G(t),G(t))_vvπarccos(Σ(k)(G(t),G(t))_vv)2π+πarccos(Σ(k)(G(t),G(t))_vv)2π+1Σ(k)(G(t),G(t))2_vv2π\boldsymbol{\Theta}^{(k)}(G'^{(t)}, G^{(t)})\_{v'v} \cdot \frac{\pi - \arccos(\boldsymbol{\Sigma}^{(k)}(G'^{(t)}, G^{(t)})\_{v'v})}{2\pi} + \frac{\pi - \arccos(\boldsymbol{\Sigma}^{(k)}(G'^{(t)}, G^{(t)})\_{v'v})}{2\pi} + \frac{\sqrt{1 - \boldsymbol{\Sigma}^{(k)}(G'^{(t)}, G^{(t)})^2\_{v'v}}}{2\pi}

= Θ(k+1)(G(t),G(t))_vv\boldsymbol{\Theta}^{(k + 1)}(G^{(t)}, G'^{(t)})\_{vv'}

Hence, if Θ(k)(G(t),G(t))_vv=Θ(k)(G(t),G(t))_vv\boldsymbol{\Theta}^{(k)}(G^{(t)}, G'^{(t)})\_{vv'} = \boldsymbol{\Theta}^{(k)}(G'^{(t)}, G^{(t)})\_{v'v}, then Θ(k+1)(G(t),G(t))_vv=Θ(k+1)(G(t),G(t))_vv\boldsymbol{\Theta}^{(k + 1)}(G^{(t)}, G'^{(t)})\_{vv'} = \boldsymbol{\Theta}^{(k + 1)}(G'^{(t)}, G^{(t)})\_{v'v}.

Moreover, we have proven that Θ(1)(G(t),G(t))_vv=Θ(1)(G(t),G(t))_vv\boldsymbol{\Theta}^{(1)}(G^{(t)}, G'^{(t)})\_{vv'} = \boldsymbol{\Theta}^{(1)}(G'^{(t)}, G^{(t)})\_{v'v}.

Thus, by induction, we have Θ(l)(G(t),G(t))_vv=Θ(l)(G(t),G(t))_vv,l,1lL.\boldsymbol{\Theta}^{(l)}(G^{(t)}, G'^{(t)})\_{vv'} = \boldsymbol{\Theta}^{(l)}(G'^{(t)}, G^{(t)})\_{v'v}, \forall l, 1 \leq l \leq L.


Finally, we have K(G(t),G(t))=vV(t)vV(t)Θ(L)(G(t),G(t))_vv=vV(t)vV(t)Θ(L)(G(t),G(t))_vv=K(G(t),G(t))K(G(t), G'(t)) = \sum_{v \in V(t)} \sum_{v' \in V'(t)} \boldsymbol{\Theta}^{(L)}(G^{(t)}, G'^{(t)})\_{vv'} = \sum_{v' \in V'(t)} \sum_{v \in V(t)} \boldsymbol{\Theta}^{(L)}(G'^{(t)}, G^{(t)})\_{v'v} = K(G'(t), G(t))

The proof is completed.

评论

[Semi-definite proof]


We need to prove K(G(t),G(t))0K(G^{(t)}, G^{(t)}) \geq 0

Given our proposed kernel function K(G(t),G(t))=vV(t)vV(t)Θ(L)(G(t),G(t))vvK(G^{(t)}, G'^{(t)}) = \sum_{v \in V(t)} \sum_{v' \in V(t)} \boldsymbol{\Theta}^{(L)}(G^{(t)}, G'^{(t)})_{vv'}, we first prove that Θ(L)(G(t),G(t))_vv0,l,1lL\boldsymbol{\Theta}^{(L)}(G^{(t)}, G'^{(t)})\_{vv'} \geq 0, \forall l, 1 \leq l \leq L.


For l=1l = 1,

Θ(1)(G(t),G(t))_vv=Θ(0)(G(t),G(t))_vvΣ˙(1)(G(t),G(t))_vv+Σ(1)(G(t),G(t))_vv\boldsymbol{\Theta}^{(1)}(G^{(t)}, G'^{(t)})\_{vv'} = \boldsymbol{\Theta}^{(0)}(G^{(t)}, G'^{(t)})\_{vv'} \cdot \dot{\boldsymbol{\Sigma}}^{(1)}(G^{(t)}, G'^{(t)})\_{vv'} + \boldsymbol{\Sigma}^{(1)}(G^{(t)}, G'^{(t)})\_{vv'}

= (hv(t)hv(t))πarccos(Σ(0)(G(t),G(t))_vv)2π+πarccos(Σ(0)(G(t),G(t))_vv)2π+1Σ(0)(G(t),G(t))2_vv2π(h_v(t)^\top h_{v'}(t)) \cdot \frac{\pi - \arccos(\boldsymbol{\Sigma}^{(0)}(G^{(t)}, G'^{(t)})\_{vv'})}{2\pi} + \frac{\pi - \arccos(\boldsymbol{\Sigma}^{(0)}(G^{(t)}, G'^{(t)})\_{vv'})}{2\pi} + \frac{\sqrt{1 - \boldsymbol{\Sigma}^{(0)}(G^{(t)}, G'^{(t)})^2\_{vv'}}}{2\pi}

= (hv(t)hv(t))πarccos(hv(t)hv(t))2π+πarccos(hv(t)hv(t))2π+1(hv(t)hv(t))22π(h_v(t)^\top h_{v'}(t)) \cdot \frac{\pi - \arccos(h_v(t)^\top h_{v'}(t))}{2\pi} + \frac{\pi - \arccos(h_v(t)^\top h_{v'}(t))}{2\pi} + \frac{\sqrt{1 - (h_v(t)^\top h_{v'}(t))^2}}{2\pi}

= xπarccos(x)2π+πarccos(x)2π+1x22π (Let x=(h_v(t)h_v(t)))x \cdot \frac{\pi - \arccos(x)}{2\pi} + \frac{\pi - \arccos(x)}{2\pi} + \frac{\sqrt{1 - x^2}}{2\pi} ~(\text{Let } x = (h\_v(t)^\top h\_{v'}(t)))

As visualized in Figure 1 in the uploaded 1-page PDF file in "general rebuttal", we know that xπarccos(x)2π+πarccos(x)2π+1x22π0,x[1,1]x \cdot \frac{\pi - \arccos(x)}{2\pi} + \frac{\pi - \arccos(x)}{2\pi} + \frac{\sqrt{1 - x^2}}{2\pi} \geq 0, \forall x \in [-1, 1]

As xx is the input of arccos(.)\arccos(.) function, so it must be constrained to interval [1,1][-1, 1]. Also, we can achieve the constraint x[1,1]x \in [-1, 1] by normalizing hv(t),hv(t)h_v(t), h_v'(t).

Therefore, Θ(1)(G(t),G(t))_vv=Θ(0)(G(t),G(t))_vvΣ˙(1)(G(t),G(t))_vv+Σ(1)(G(t),G(t))_vv0,v,v\boldsymbol{\Theta}^{(1)}(G^{(t)}, G'^{(t)})\_{vv'} = \boldsymbol{\Theta}^{(0)}(G^{(t)}, G'^{(t)})\_{vv'} \cdot \dot{\boldsymbol{\Sigma}}^{(1)}(G^{(t)}, G'^{(t)})\_{vv'} + \boldsymbol{\Sigma}^{(1)}(G^{(t)}, G'^{(t)})\_{vv'} \geq 0, \forall v, v'


Suppose k,1kL\exists k, 1 \leq k \leq L, we can have Θ(k)(G(t),G(t))vv0\boldsymbol{\Theta}^{(k)}(G^{(t)}, G'^{(t)})_{vv'} \geq 0

Then, we can derive the three following equations that Θ(k+1)(G(t),G(t))_vv=Θ(k)(G(t),G(t))_vvΣ˙(k+1)(G(t),G(t))_vv+Σ(k+1)(G(t),G(t))_vv\boldsymbol{\Theta}^{(k + 1)}(G^{(t)}, G'^{(t)})\_{vv'} = \boldsymbol{\Theta}^{(k)}(G^{(t)}, G'^{(t)})\_{vv'} \cdot \dot{\boldsymbol{\Sigma}}^{(k + 1)}(G^{(t)}, G'^{(t)})\_{vv'} + \boldsymbol{\Sigma}^{(k + 1)}(G^{(t)}, G'^{(t)})\_{vv'},

Σ˙(k+1)(G(t),G(t))_vv=E(a,b)N(0,Λ(k+1)(G(t),G(t))_vv)[σ˙(a)σ˙(b)]0\dot{\boldsymbol{\Sigma}}^{(k + 1)}(G^{(t)}, G'^{(t)})\_{vv'} = \mathbb{E}_{(a, b) \sim \mathcal{N}(0, \boldsymbol{\Lambda}^{(k+1)}(G^{(t)}, G'^{(t)})\_{vv'})} [\dot{\sigma}(a) \cdot \dot{\sigma}(b)] \geq 0, as σ(x)˙0,x\dot{\sigma(x)} \geq 0, \forall x,

and Σ(k+1)(G(t),G(t))_vv=E(a,b)N(0,Λ(k+1)(G(t),G(t))_vv)[σ(a)σ(b)]0\boldsymbol{\Sigma}^{(k + 1)}(G^{(t)}, G'^{(t)})\_{vv'} = \mathbb{E}_{(a, b) \sim \mathcal{N}(0, \boldsymbol{\Lambda}^{(k+1)}(G^{(t)}, G'^{(t)})\_{vv'})} [\sigma(a) \cdot \sigma(b)] \geq 0, as σ(x)0,x\sigma(x) \geq 0, \forall x.

Moreover, as Θ(k)(G(t),G(t))_vv0\boldsymbol{\Theta}^{(k)}(G^{(t)}, G'^{(t)})\_{vv'} \geq 0, we now have Θ(k+1)(G(t),G(t))_vv0\boldsymbol{\Theta}^{(k + 1)}(G^{(t)}, G'^{(t)})\_{vv'} \geq 0.

We also have Θ(1)(G(t),G(t))_vv0\boldsymbol{\Theta}^{(1)}(G^{(t)}, G'^{(t)})\_{vv'} \geq 0, so by induction, Θ(l)(G(t),G(t))_vv0,l1L\boldsymbol{\Theta}^{(l)}(G^{(t)}, G'^{(t)})\_{vv'} \geq 0, \forall l \leq 1 \leq L.


Finally, we have K(G(t),G(t))=vV(t)vV(t)Θ(L)(G(t),G(t))vv0K(G^{(t)}, G'^{(t)}) = \sum_{v \in V(t)} \sum_{v' \in V(t)} \boldsymbol{\Theta}^{(L)}(G^{(t)}, G'^{(t)})_{vv'} \geq 0

The proof is completed.

作者回复

First of all, we want to sincerely thank the time and review of all reviewers and chairs, and we are very glad to learn the reviewers' appreciation for this paper.

Also, the reviewers' raised questions are very actionable and helpful to improve our paper. We have uploaded the answers individually to each reviewer.

Here, we upload a 1-page PDF file that contains Figures and Tables, which are required by reviewers for additional theoretical analysis and experiments.

Thanks again!

最终决定

This paper introduces the Temporal Graph Neural Tangent Kernel (Temp-G3NTK) for learning on time-evolving graph structures. The method extends the Graph Neural Tangent Kernel (GNTK) to temporal graphs, incorporating time-domain information and deriving generalization bounds. Additionally, the work establishes a connection between the convergence of the temporal GNTK and the graphon GNTK. While none of the reviewers opposed the acceptance of the paper, two expressed reservations. Their concerns centered on (1) the method being seen as an incremental extension of GNTK, and (2) doubts about the effectiveness of how temporal information is represented, despite the authors' detailed explanations.

I believe these concerns are not sufficient grounds for rejection. Most reviewers acknowledged that this is one of the first attempts to generalize GNTK to temporal graphs, and they recognized the theoretical rigor, solid results, and competitive performance of the method, which were further supported by additional experiments provided during the discussion.

Overall, I recommend this work for acceptance.