/10

Poster3 位审稿人

最低2最高4标准差0.8

ICML 2025

Learnable Spatial-Temporal Positional Encoding for Link Prediction

Katherine Tieu,Dongqi Fu,Zihao Li,Ross Maciejewski,Jingrui He

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

摘要

关键词

Positional EncodingLink PredictionTransformer

评审与讨论

审稿意见

评分: 32025-03-13

This paper introduces a framework L-STEP for learning representations on discrete temporal dynamic graphs, which comprises two key components: 1) LPE (Learnable Positional Encoding): a learnable spatial-temporal positional encoding designed to capture evolving graph topology from a spectral viewpoint. Concretely, LPE approximates positional encodings by applying the Discrete Fourier Transform (DFT) over past positional encodings. 2) A Node-Link-Positional Encoder for computing temporal node representations. Experiments are done to validate the effectiveness of L-STEP.

给作者的问题

Please kindly refer to Other Strengths And Weaknesses.

论据与证据

Line 241, the binary cross-entropy loss function in Equation (11) seems incorrect. The second part should be $log(1 - \hat{y})$ , instead of $(1 - log(\hat{y}))$ .

方法与评估标准

Please kindly refer to Other Strengths And Weaknesses.

理论论述

Please kindly refer to Other Strengths And Weaknesses.

实验设计与分析

Please kindly refer to Other Strengths And Weaknesses.

补充材料

I went through most of the appendix sections.

与现有文献的关系

None.

遗漏的重要参考文献

None.

其他优缺点

Strengths:

The design of LPE is clear. The authors provide a rationale for using DFT in positional encoding computation, that is, positional encodings could be regarded as signals on graphs and DFT can be applied to filter noises caused by randomness or entity activities. Experimental results show that LPE significantly improves the performance of downstream tasks.
L-STEP is efficient. The authors show in Section 4 that, in certain scenarios, the proposed encoder is more efficient than methods that rely on neighborhood co-occurrence for node embedding computation. Besides, the L-STEP contains a novel loss function for learning representations on discrete temporal dynamic graphs, as outlined in Equation (13).
The paper is well written and easy to follow.

Weaknesses:

L-STEP is thoroughly evaluated on recent and widely used datasets collected by [1], using evaluation metrics that are commonly adopted in prior studies [1, 2]. However, I do have some concerns regarding the use of negative links in the training and evaluation process. Specifically, in Section 3.3, the authors seem to employ three different negative sampling strategies—random, historical, and inductive—during the loss computation. In contrast, the code implementation from [2] trains models using randomly selected negative edges and tests them on random, historical, and inductive negative edges, respectively.

The experiment settings seem to be different. Since the results for the historical and inductive negative edge settings, where L-STEP significantly outperforms the previous state-of-the-art [2], are directly taken from the result tables in [2], the evaluation and subsequent conclusions could potentially be affected.

[1]: Poursafaei, Farimah, et al. "Towards better evaluation for dynamic link prediction." Advances in Neural Information Processing Systems (2022) [2]: Yu, Le, et al. "Towards better dynamic graph learning: New architecture and unified library." Advances in Neural Information Processing Systems (2023)

If this concern can be addressed with thorough analysis and concrete evidence, my rating can be raised.

其他意见或建议

Please kindly refer to Other Strengths And Weaknesses.

作者回复

2025-04-01

Thanks very much for your review!

We are excited to learn your appreciation of our model's design, effectiveness and efficiency, and paper's writing.

Your suggestions are actionable and helpful.

First of all, we correct the typo in Eq (11) and promise to update it in our camera-ready version.

Then, we would like to further discuss your concern in the experimental setting.

[Data Source]

The datasets used in our paper, according to [2], is firstly collected by [1].

However, as a library, [2] provides detailed all dataset processing and baseline implementation, https://github.com/yule-BUAA/DyGLib, and we used all the datasets from [2].

[Implemtation]

We strictly followed the implementation of [2] and reported the performance.

For more clarification, we follow the training and evaluation settings of [2] as follows.

Specifically, during training, we compute the loss function and perform back propagation for the model only using randomly sampled negative links.
Then, with this trained model, we test the model on random, historical, and inductive negative edges.
The corresponding configurations are listed in Appendix I of the paper.

[Proof from Other Sources]

Beyond 13 classic datasets in [2], we also tested our method on the open-source worldwide large-scale benchmark leaderboard, TGB [3] (https://tgb.complexdatalab.com/docs/leader_linkprop/), which also provides the standard dataset split and training paradigm.

In TGB, our method also achieves a very competitive performance in temporal link prediction tasks, as shown in Table 2 of the paper.

[Code Release]

We promise to release our code upon publication.

Reference

[1]: Poursafaei, Farimah, et al. "Towards better evaluation for dynamic link prediction." Advances in Neural Information Processing Systems (2022)

[2]: Yu, Le, et al. "Towards better dynamic graph learning: New architecture and unified library." Advances in Neural Information Processing Systems (2023)

[3]: Gastinger, J., Huang, S., Galkin, M., Loghmani, E., Parviz, A., Poursafaei, F., Danovitch, J., Rossi, E., Koutis, I., Stuckenschmidt, H., Rabbany, R., and Rabusseau, G. TGB 2.0: A benchmark for learning on temporal knowledge graphs and heterogeneous graphs, Advances in Neural Information Processing Systems (2024)

审稿人评论

2025-04-02

Thanks for the rebuttal. Please share your anonymous code link.

作者评论

2025-04-06

Dear Reviewer uGHA,

Thanks very much for your reply and your satisfaction with our rebuttal answer.

Regarding your remaining concern about our code release, we prepared the detailed ReadMe file along with the source code, the link is https://anonymous.4open.science/r/L-STEP-9ED2. Again, we promise to attach the code repository in our camera-ready version.

For your extra information, to verify our performance, Reviewer U1pJ asked for extra baseline comparison, HTGN (Yang et al., Hyperbolic temporal network embedding, TKDE 2022). Now the experimental results are ready as follows.

We adopt the official implementation of HTGN, and evaluate the model on CanParl. We provide the link prediction performance, i.e., Average Precision (AP) and Area Under the Receiver Operating Characteristic Curve (AUC-ROC), between HTGN and our L-STEP, along with other SOTAs, such as FreeDyG, DyGFormer, and GraphMixer, in different settings.

Dataset	Method	Transductive (AP)	Transductive (AUC-ROC)	Inductive (AP)	Inductive (AUC-ROC)
CanParl	HTGN	64.22 ± 1.40	59.95 ± 3.52	71.01 ± 1.74	67.03 ± 2.82
	FreeDyG	72.22 ± 2.47	81.09 ± 2.2	52.96 ± 1.05	52.89 ± 1.61
	DyGFormer	97.36 ± 0.45	97.76 ± 0.41	87.74 ± 0.71	89.33 ± 0.48
	GraphMixer	77.04 ± 0.46	83.17 ± 0.53	55.91 ± 0.82	58.32 ± 1.08
	L-STEP (Ours)	98.24 ± 0.11	98.97 ± 0.06	92.25± 0.24	95.06 ± 0.11

Thanks again for your review, please feel free to let us know if you have any further questions.

Best,

#8546 Authors

审稿意见

评分: 22025-03-13

This work explores the problem of temporal link prediction, and proposes a semi-non-graph model that shows quite good performance across a variety of datasets. Their model, LSTEP, works by introducing a learnable time-dependent positional encoding, where the time dependence is parameterized by a learnable fourier kernel. This idea is simple but seems to provide reasonable performance.

给作者的问题

What is the spectrum of filtered noise in your experience? Do you
By using the DFT to model temporal propagation, you're essentially assuming wave-like dynamics because the wave equation's time-propagater can be expanded in terms of a the fourier basis. Have you explored other possible bases? For example, cosh/sinh to model diffusion?
How does this model work in an inductive setting? Specifically, how is p0 computed for a new vertex?
How does the DFT presented here relate to the traditional graph fourier transform that's defined by the eigensystem of the laplacian? Is the filter that's learned in LSTEP expected to be in any way similar to the filters learned by traditional MPNNs? If not, could you comment on why this DFT appears to perform better?
How did you implement the complex-valued gradient of W? This can be quite a nuanced topic, so I'm curious as to how it works.

论据与证据

The claims are reasonably supported, with good performance across many datasets. I appreciate the ablation experiments that compare LPE with the full model as well as the exploration of parameter sensitivity.

It's interesting that LSTEP performs comparatively worse on both TGBL datasets, and that in most settings the lifts aren't statistically significant. Would you care to comment?

方法与评估标准

Yes, the experiments absolutely make sense and the benchmark datasets can be viewed is as close to exhaustive as is likely possible.

理论论述

The theoretical claims in section 4 seem ancillary to the point of the paper, and doesn't seem to meaningfully explain why these positional encodings work. Furthermore, the theoretical analysis makes an adiabaticity assumption, which is probably sufficient for the setting where you have a relatively mature graph at time t1, and then it evolves without changing the community structure too much. I wonder how this theorem holds during the initial growth phase of a preferential attachment style generative process. It would see that the adiabaticity of the graph isn't there, and thus the model wouldn't really hold, but I'd love the authors thoughts on the matter.

实验设计与分析

I reviewed the experiments presented in section 5 as well as the appendices. As presented, nothing seemed unexpected but I did not reimplement the model.

补充材料

Yes. Appendices C, E, H

与现有文献的关系

As mentioned above, the ideas in this paper aren't particularly novel. For example, the use of a DFT to compute evolving PEs isn't new. While the theoretical and methodological contributions seem to be limited, they are combined in a way that leads to a simple to understand model with relatively significant performance gains. Because ML is an empirical science, I believe it's important to make space for papers which don't maximize novelty if they refine these ideas in a way that leads to significant empirical gains.

遗漏的重要参考文献

Not to my knowledge

其他优缺点

Other Weaknesses

The paper is poorly written, which can make it hard to follow in spots. For example, the first paragraph of section 3.1 is a large run-on sentence. I would recommend that the authors give the paper a thorough grammatical review. Language modeling tools like chatgpt or claude can be quite effective for this.

其他意见或建议

See other strengths and weaknesses

作者回复

2025-04-01

Thanks for your appreciation of our method’s soundness and experimental design! Addressing your concerns improved the quality of our paper, and we prepared the answer below.

How learnable positional encodings work in dynamic environment.

Due to the length limit of response, we sincerely invite you to check our first answer in U1pJ

W1: First paragraph of section 3.1 is a large run-on sentence, use LLM to refine.

Thanks, we will.

Q1: What is the spectrum of filtered noise in your experience?

We would like to elaborate on what kind of information our learnable filter (used in Eq (1)) learns.

Theoretically, we first apply DFT on $u$ ’s sequence of learned positional encodings at previous timestamps { $\mathbf{p}^{t’}_u$ }, then the learnable filter filters out high-frequency noise across dimensions of the learned positional encodings in the spectral domain, and finally, the Inverse DFT transformed the filtered positional encodings back to the original domain.

Practically, the filtered noises are those that cannot contribute to link prediction decisions.

Q2: By using the DFT to model ...

First, to our understanding, the sequence is not necessarily wave-like to be decomposed by DFT, so we do not constrain our positional encoding to follow any wave-like dynamics. In addition, we did not consider applying cosine or sine bases, since DFT’s bases already contain cosine and sine components.

Q3: How is p0 computed for a new vertex?

If a new vertex emerges, we first initialize this vertex’s positional encoding to a zero-valued vector, then proceed with the model's computations to make link predictions involving this vertex.

Q4: How does the DFT relate to ...

We clarify the difference between DFT and traditional graph fourier transform, then elaborate on why employing DFT is more suitable for the temporal link prediction problem as follows.

DFT is leveraged for processing time-series-like data, while graph fourier transform analyzes the signal associated with nodes of a graph. Intuitively, the signals in time-series are defined along a timeline, i.e., each timestamp on the timeline produces a signal, while graph fourier transform processes signals defined on nodes of a graph, i.e., each node is associated with a signal, for example, each node of the graph has a node feature vector.

In our framework, we are trying to “approximate” a node $u$ ’s positional encoding at time $t$ based on its positional encodings in previous $L$ most recent distinct timestamps, $\{\mathbf{p}\_u^{t’\_j}\}\_{j = 1}^L (t’\_j < t, \forall j)$ , so through the lens of DFT, we can consider $\{\mathbf{p}\_u^{t’\_j}\}\_{j = 1}^L$ as a sequence of signals defined on a timeline, where each timestamp $t’\_j$ produces the signal $\mathbf{p}\_{u}^{t’\_j}$ .
Graph fourier transform (GFT) acts as a convolution on the graph, i.e., GFT computes a node’s representation based on other nodes’ information and GFT is defined for static graphs, while we only want to obtain a node’s positional information based on its positional information at previous timestamps, so we think that DFT is a better fit for our goal.
Due to the difference in the definition and setting between DFT (for sequences, temporal data) and GFT (for static graphs), we do not expect the filter of L-STEP to be similar to the filters learned by traditional MPNNs.

In short, we aim to analyze the temporal dependencies in the sequence of node $u$ ’s positional encodings at different timestamps, so we employ DFT, which is a better fit for sequences and time-series-like data, while GFT operates convolution on static graphs.

Q5: How did you implement the complex-valued gradient ...

In our implementation, we first initialize $\mathbf{W}\_{filter}$ using nn.Linear() module of PyTorch library and then convert the parameter to torch.complex64 data type, i.e., turning $\mathbf{W}\_{filter}$ into a complex-valued tensor. Now, we present our implementation to obtain Eq (1) as follows. Suppose batch_pe is $\{\mathbf{p}\_{u}^{t’\_j}\}\_{j = 1}^L$ and filter is $\mathbf{W}\_{filter}$ .

The pseudocode for implementing Eq (1) could be described as follows:

1. filter = nn.Linear(d_P, L, bias = False).to(torch.complex64) // initialization of W_{filter}

For epoch in range(num_epochs): // model training phase
   2. batch_pe = batch_pe.to(torch.complex64) // convert batch_pe to complex-valued tensor        
   3. batch_pe = fft.fftn(batch_pe, dim = 1) // apply FFT transformation on batch_pe
   4. batch_pe = filter.weight.unsqueeze(0) * batch_pe 
   5. batch_pe = fft.ifftn(batch_pe, dim = 1) // apply inverse FFT
   6. batch_pe = batch_pe.to(torch.float32) // convert back to float32 data type

Here, the 4th line of the pseudo code represents $\mathbf{W}\_{filter} \odot \mathcal(\{\mathbf{p}\_{u}^{t’\_j}\}\_{j = 1}^L)$ , and the 5th line is equivalent to applying inverse DFT. Finally, we convert batch_pe back to real-valued tensors.

审稿意见

评分: 42025-03-17

The paper introduces a new method for predicting connections in networks that change over time. Instead of using fixed rules or complex models that require a lot of computing power, L-STEP learns how positions in the network change over time using the Fourier Transform. The authors show that their approach keeps important information about the network’s structure and performs as well as more complicated models. They test it on 13 datasets and two large benchmark datasets and find that it predicts connections better than existing methods.

给作者的问题

Why are FreeDyG results missing in table 11?
An interesting assumption is that you try to learn a function to estimate a positional encoding of u at time t. However, predicting an exact positional encoding assumes a high degree of regularity in graph evolution, which may not hold in real-world networks. That part is not quantified by using synthetic graphs. You also do not try to quantify it in datasets and compare your performance as a function of that change.

论据与证据

Yes, the results are supported by experiments on datasets.

方法与评估标准

yes, the methods and criteria are appropriate

理论论述

I checked the theorem in the appendix.

实验设计与分析

I checked theorem 4.1, and it depends on the graph's slow change. For this claim to be dependable, the authors must have shown that slow change is a phenomenon in real graphs.

补充材料

yes, I checked additional experimental results and sensitivity analysis

与现有文献的关系

The article studies the long and well-studied problem of predicting edges that have already appeared in the past. In this sense, I do not see much novelty. However, they also propose a learnable positional encoder, and this is novel.

遗漏的重要参考文献

I would like to see an HTGN comparison (not our article) to see how it compares to hyperbolic methods, but there are two recent and A+ articles. They seem enough.

Yang, M., Zhou, M., Xiong, H. and King, I., 2022. Hyperbolic temporal network embedding. IEEE Transactions on Knowledge and Data Engineering, 35(11), pp.11489-11502.

其他优缺点

Strengths:

The article is quite detailed and have covered almost all potential analyses in the appendix. The baselines are strong and the studied datasets are many. I appreciate the sensitivity analysis in the appendix.

Weaknesses:

The article has quite many components that require everything to go as expected. The low changing aspect needs to be tested. Many models break down in sparse graphs or highly dynamic environments, but the authors do not analyze cases where the graph changes too quickly for their positional encoding estimation to remain accurate.
The code is not shared. I would raise a bigger issue about this, but the detailed results in the appendix are the saving grace, and I will assume that the authors will share the code soon.
The hyperparameter analysis suggests that L-STEP performs well within a certain range, but does not discuss how sensitive the model is to bad choices.

其他意见或建议

The main article does not contain dataset descriptions and most of the reported results.
Lstep results are bolded in table 2 right panel but these are not the best results. It is misleading to bold them.

作者回复

2025-04-01

Thanks very much for your review! We are excited to learn your appreciation of our extensive experiments and corresponding outperformance.

Your suggestions are quite actionable, and we seriously prepared the answer in Q&A format below.

W1: The low changing aspect needs to be tested.

First, the outperformance of our method across all 15 real-world datasets has proven our theoretical design is valid to some extent.

Of course, we understand your concern, and prepared the illustration of the slowly changing pattern in the real-world graphs below, by taking UNVote dataset as an instance. The assumption is further proved, and we promise to add this new content to our camera-ready version.

[Positional Encoding Patterns with Time]

In the 1st plot of UNvote, the x-axis represents timestamps, and y-axis represents the distance between each consecutive pair of positional encoding, i.e, $\mathbf{p}^t$ $p^{t}$ and $\mathbf{p}^{t + 1}, \forall t$ $p^{t + 1}, \forall t$ . As $\mathbf{p} \in \mathbb{R}^{|V| \times d_P}$ $p \in R^{∣ V ∣ \times d_{P}}$ , which is a matrix and $i$ $i$ -th row is the positional encoding of node $i$ $i$
- We compute the distance between $\mathbf{p}^t, \mathbf{p}^{t + 1}$ by first computing the Euclidean distance between $\mathbf{p}\_u^t$ and $\mathbf{p}^{t + 1}\_u$ and then let the mean of Euclidean distance of positional encoding overall all nodes, i.e, $\frac{1}{|V|} \sum_{u \in V} d(\mathbf{p}\_u^{t}, \mathbf{p}\_u^{t + 1})$ , be the distance between $\mathbf{p}^t$ and $\mathbf{p}^{t + 1}$ .
- We discover the distance between each consecutive pair of positional encoding, $\mathbf{p}^{t}$ and $\mathbf{p}^{t + 1}$ is consistently small across all timestamps.
In the 2nd plot, we illustrate the distance between our learned positional encoding and the true positional encoding for each graph snapshot
- Overall, the distance between our learned positional encoding and the true positional encoding of each snapshot is consistently small across all timestamps.

[Theoritical Explaination]

To further address your concern, we would like to go through Theorem 1 intuitively for you. In short, it shows that when the temporal graph is slowly evolving with respect to time, the learned positional encoding at previous timestamp can well approximate the next close future timestamp’s positional encodings.

Theoretically, the significance of Theorem 1 is to demonstrate the effectiveness of our proposed LPEs in the inductive setting, where future graph structural information is hard to observe, such that the “slowly changing” assumption for the coming snapshots is needed for this theoretical derivation. Moreover, defining a degree of change over temporal snapshots is an interesting but largely open question, which may involve the random matrix theory and relevant knowledge [1], and so far no clear definition, to the best of our knowledge.

Therefore, assuming the temporal smoothness [1], i.e., next snapshot will not change dramatically from the previous snapshot is a quick-win solution for the theoretical derivation, and we are more than happy to dive in this topic for our future research direction.

[1] Chi et al., Evolutionary Spectral Clustering by Incorporating Temporal Smoothness, KDD 2007

W2: Available code.

We promise to publish our code upon the paper’s publication. Also, we are preparing an anonymous link, so we can share the implementation shortly (before the deadline for final response).

W3: Hyperparameter sensitivity.

Based on our parameter analysis, our model is comparatively robust to bad choices, i.e., we experience drops in performance while varying the hyperparameter but the drops are not significant. For more details and intuition for choosing the hyperparameter, please refer to Appendix H.5.

Q1: Why are FreeDyG results missing in table 11?

Original FreeDyG did not run on UNTrade dataset. We follow FreeDyG’s official implementation, but the corresponding performance is around 0.500. We will add this explanation to our camera-ready version.

Comment 1: The main article does not contain dataset descriptions and most of the reported results.

Due to page limit, the description is in Appendix G. Results of other experimental settings (historical and inductive negative sampling), ablation study, and parameter analysis are in Appendix H.

Comment 2: L-step results are bolded in table 2 right panel but these are not the best results.

Table 2 is ranked from top to bottom with a ranking index; we bold our method for just highlighting its position.

Suggestion: I would like to see an HTGN comparison to see how it compares to hyperbolic methods, but there are two recent and A+ articles. They seem enough.

We understand, and we are adapting HTGN to this paper's setting. If finished early, we will post HTGN’s results before the final response deadline. If not, we will add it to our camera-ready version.

最终决定Accept (poster)

2025-05-01

The paper introduces L-STEP, a novel learnable spatial-temporal positional encoding framework for temporal link prediction. The method leverages Discrete Fourier Transform (DFT) to learn dynamic positional encodings efficiently, avoiding reliance on costly attention mechanisms. The authors demonstrate strong empirical performance across 13 datasets and the TGB benchmark, with theoretical analysis supporting the model’s effectiveness under slow graph evolution assumptions. The authors should: 1) Clarify the slow evolution assumption’s scope. 2) Improve the writing. 3) Add the full TGB datasets comparisons, and properly cite and discuss the leaderboard baselines. 4) Add wall-clock efficiency experiments. Conditioned on the above, I would like to recommend an acceptance.