We greatly appreciate the reviewer's efforts in reviewing our paper. We thank the reviewer for recognizing our novel problem, well-developed method, and rigorous experiments. Our responses to the valuable feedback are presented below:

W1: Given various efficiency improvements in the proposed method, the trade-off to effectiveness is still unclear.

Reply: Our approximation techniques greedy selection algorithm, partitioning strategy, and regularization set simplification are crucial for making the training and selection phases computationally feasible. Without these approximations, it would be practically impossible to run the experiments under limited resources.

Greedy selection avoids exhaustive searches which have factorial time complexity, making them intractable for real-world graphs.
Partitioning, as analyzed in the paper, significantly reduces memory requirements. Without it, subset selection could demand over 100 GB of RAM, which is prohibitive in most settings.
During training, we simplify the regularization dataset from (full old-class data) to a subset , greatly reducing the regularization loss complexity. This simplification is necessary to prevent GPU memory overflow and failed experiments.

Despite these constraints, we conducted further analysis to empirically characterize the trade-off between effectiveness and computational efficiency by varying the sizes of selected subsets (for ) and (for ). The backbone model used is DyGFormer, and the dataset is Yelp.

Effect of Varying : is the major carrier of old-class knowledge, a larger will improve its knowledge quality yet takes more training time. Below experiments validate our claim, where increasing improves average precision (AP) and linearly increasing the training time.

	AP	Time(s)
250	0.0434	34.99
500	0.0681	54.30
750	0.713	72.31

Effect of Varying : approximates the distribution of , which helps generalize the learnt knowledge from . A larger leads to a better distribution approximation at the cost of more training time. Experiments support our claim on : larger values improve AP at the cost of longer training time, again approximately linear due to the complexity of the regularization loss.

	AP	Time(s)
Best baseline	0.0601	14.24
0 (No Regularization)	0.0618	18.51
250	0.0624	36.70
500	0.0681	54.30
750	0.0693	71.88

It is also worth noting that although our method achieves the best results among all baselines, this comes with an additional time cost due to the regularization term. However, since our selection and regularization modules are independent, users with stricter efficiency requirements can disable the regularization component. Our selection-only setup still achieves state-of-the-art performance while maintaining comparable time costs to other replay-based baselines.

Q1: Why use MSE instead of Cross-entropy for classification loss?

Reply: We employ MSE loss due to its strong theoretical alignment with domain adaptation theory. The key quantity in this framework, i.e., the divergence measures disagreement between hypotheses[1]. MSE naturally captures this disagreement by penalizing squared geometric distances between model outputs. In contrast, cross-entropy focuses on prediction confidence (log-probabilities), making it less sensitive to inter-hypothesis disagreements. Additionally, MSE’s bounded gradients lead to more stable optimization and better adherence to generalization bounds in domain adaptation.

Q2: What are the limitations of the proposed method?

Reply: Our method is limited in two points:

Task Scope: Our current work focuses on the node classification task within temporal graphs. While this is an important and under-explored area, it is equally critical to address the continual learning problem on other key temporal graph tasks such as link prediction, which may pose different challenges to our approach. We discussed this limitation in Appendix M.
Class Dynamics Assumption: We assume that each node is associated with a single, static class label throughout its lifetime. However, node labels can also change over time in real life. For example, in epidemic transmission networks, an individual may transition between different states (e.g., infected to recovered), which our current formulation does not accommodate. Capturing such class-switching behavior is also important.

Q3: More repeated experiments are needed.

Reply: We agree with the reviewer that more repeated experiments are needed to address concerns about the influence of randomness in the results. We will conduct additional experiments and report the results in the final version of the paper.