5.5

/10

Poster4 位审稿人

最低5最高6标准差0.5

3.5

置信度

正确性3.0

贡献度3.0

表达2.8

NeurIPS 2024

A Topology-aware Graph Coarsening Framework for Continual Graph Learning

Xiaoxue Han,Zhuo Feng,Yue Ning

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

摘要

关键词

Continual Graph LearningCatastrophic ForgettingGraph Coarsening

评审与讨论

审稿意见

评分: 5置信度: 42024-07-01

This paper proposes a novel rehearsal-based continual graph learning method, TACO, which stores the information of previous tasks as a reduced graph. TACO performs graph coarsening based on node representation proximities to reduce a graph while preserving essential topological information. TACO shows significant performance improvements, as demonstrated by experiments on three graph datasets.

优点

The proposed method TACO coarsens the previous graphs into the memory buffer for replaying while preserving the graph topology.

The memory buffer constructed by TACO can maintain a stable size when sequentially learning new tasks

The proposed method's effectiveness is supported by rich experimental results

缺点

The paper is not well structured, and some statements are confusing.

The authors are encouraged to have a more complete comparison to better validate their proposed method.

Some important existing works are missing.

问题

In the introduction, the authors are encouraged to discuss different types of methods for continual graph learning to highlight the motivation and contribution of the proposed method.
The paper aims to reduce the current graph into a compressed form that maximally preserves its properties. This idea is similar to existing works [1][2] which employ graph condensation to capture the holistic semantics of original graphs. However, these works are missing.
When calculating the node similarities, the used features are the output of the first layer of GNNs. Is there a special reason to utilize this feature?
The section 4.4.3 is very confusing. The authors should rephrase this part to make it easier to understand.
The authors should further include an ablation study to investigate the overall performance of TACO with different reduction rates.
The authors are encouraged to include descriptions of other graph coarsening methods used in the paper.
The information of some references is incomplete such as [47] and [54] in the reference section.

[1] Liu, Yilun, Qiu,Ruihong, & Huang, Zi. 2023b. CaT: Balanced Continual Graph Learning with Graph Condensation. In: Proc. of ICDM.

[2] Niu, Chaoxi, Guansong Pang, and Ling Chen. "Graph Continual Learning with Debiased Lossless Memory Replay." arXiv preprint arXiv:2404.10984 (2024).

局限性

The authors have discussed the limitations and societal impacts of the proposed method which are adequate.

作者回复

2024-08-07

Thank you for your insightful feedback! We are happy to address your concerns and questions. Detailed responses to your comments are provided below, along with new experimental results.

W1. The paper is not well structured

A: We are committed to making these improvements to enhance the readability and overall quality of the paper. We made modifications based on your specific comments.

W2, W3, Q2. Additional literature review/baselines

A: We have conducted additional literature review and compared with more baseline. Please refer to our global response.

Q1. Lack of discussion of related work

A: Thank you for the suggestion. We discussed different types of continual graph learning and the limitations of SOTA studies in related work section, which is included in Appendix A due to page limit. We will ensure to include the discussion in the main text of our revised version.

Q3. The reasoning behind using GNN embedding for calculating similarity

A: We use the embedding of the first layer of GNNs to calculate the similarity as we believe it embeds both the node feature and the topology information. As we discussed in section 3.2, two nodes are deemed similar based on three factors: 1) feature similarity; 2) neighbor similarity; and 3) geometry closeness. GNN embeddings capture the first two factors.

Q4. Section 4.4.3 is confusing

A: Thank you for pointing it out. We will rephrase this section as follows:

“We investigate the performance of the model when more tasks are learned on different CGL frameworks. We visualize the model’s performance (AP-F1) for all previous tasks after training on each new task on the Kindle dataset, as shown in Figure 4. This figure demonstrates that all CGL methods can mitigate the catastrophic forgetting problem on the Kindle dataset to varying degrees. Additionally, we observe that experience replay-based methods (SSM and TACO) maintain more stable performance, whereas the regularization-based method, TWP, experiences greater fluctuations. We deduce that this is because experience replay-based methods are better at preventing the model from drastically forgetting previous tasks, even when the current task has more distinct distributions.”

Q5. Performance of TACO on different ratio rates

A: We agree with you that investigating different reduction rates would be helpful. We conducted additional experiments on different reduction ratios (see attached PDF in global response). When the reduction rate is 1, TACO is equivalent to finetuning. The results show that Kindle, DBLP, and ACM exhibit different "saddle points," with a smaller reduction having a decreased marginal effect. However, we observe that DBLP and ACM are less sensitive and can still achieve decent performance at larger reduction rates, while the performance of Kindle significantly drops as the reduction rate increases further.

Besides, in the submitted version, to clearly demonstrate how the reduced graph can approximate the original graph with different reduction rates, we trained the GNN on reduced graphs with different reduction rates, and test on the original graphs in Section 4.4.2. Our intuition is that if learning on the reduced graphs can approximate learning on the original graphs, then using the reduced graphs as a replay buffer can also effectively consolidate old knowledge.

Q6. Description of graph coarsening methods

A: Thank you for the suggestions. We provide summaries of graph coarsening methods and their limitations in Appendix A.2 due to page limit.

Besides, we will further provide a more detailed description of the baseline graph coarsening methods as follows:

Alge. JC [8], Aff. GS [31], Var. Neigh [33], and Var. Edges [33] are all spectral-based methods that aim to preserve the graph spectral property (Laplacian consistency) when merging nodes.

Alge. JC uses algebraic distance to measure the strength of connection between nodes and merges nodes with smaller algebraic distances to form a coarser graph.

Aff. GS uses affinity, a normalized relaxation-based node proximity heuristic, during the aggregation process, merging nodes with higher affinity scores.

Var. Neigh and Var. Edges are both local variation algorithms that contract edges (Var. Edges) or subsets of the neighborhood (Var. Neigh) with the smallest local variation.

Above approaches result in high computational complexity and rely solely on graph structures, ignoring node features.

FGC [25] takes into account the node features and aims to integrate both graph structure and node features in a unified optimization-based approach to produce a coarser graph that retains the properties of the original graph. However, it models the two objectives as separate optimization terms, which means the efficiency problem from the spectral-based methods remains.

Q7. Incomplete information in paper references

Q7. We will fix this and make sure all references are correctly formatted in our revised version.

2024-08-12

I appreciate the authors' effort in addressing my questions in their responses. As reflected in my scores, my overall assessment is positive, and I maintain my current scores.

评论- Thank you

2024-08-14

We greatly appreciate your positive evaluation and the insightful suggestions provided!

审稿意见

评分: 6置信度: 42024-07-10

The paper proposes a continual learning framework for the node classification task. The key idea is to learn a compressed form of graph from previous task and train the model by combining (and further compressing) the reduced graph from previous stage and the new graph. The graph compression step solves the research gap of sampling isolated nodes in the buffer and losing the topological structure in the previous works. The authors also propose a generalised class-il setting in which tasks are based on time steps rather than classes. Experimental results show improvements with the proposed technique.

优点

1.The paper is well written and easy to read

The paper introduces a novel method incorporating the graph coarsening method into the continual learning framework.
The experiments demonstrate the effectiveness of the proposed method.

缺点

Please see questions. Overall I liked the paper. what is missing is a nuanced discussion on a few important aspects. Other important thing is to show empirically that replay buffer is not the most important booster of the performance. I am happy to raise my score if my concerns are addressed.

问题

The step 1 : Combine is a bit unclear. In lines 162-163 it is stated that G_t contains nodes from task t and old task. Is it correct ? Why is it allowed?
A few related works specifically on setups of different settings [1,2,3] are missing. Please include them in your discussion and in particular state the differences among the proposed settings and yours. For example, [3] specifically focuses on the case when does have multiple labels. For that scenario the settings described in [3] cover scenarios with overlapping labels even if the tasks are divided based on labels. is your study limited to multi-class classification or would it also include multi-label classification?
In realistic settings it could also happen that new labels emerge over time. Would your model perform as well? In that case the information from the past might not be very useful for the current task.
In the replay buffer did you only keep the nodes or also their neighborhood structures?
It would be useful to see an ablations study by removing the replay buffer to verify if it is not the main component that affects the performance too much.
I am wondering how large the original subgraphs are at each time step and whether their sizes affect the performance?
What would be effect of total number of classes on the performance of your method? In the experiments all datasets have a small number of classes.

[1] Xikun Zhang, Dongjin Song, and Dacheng Tao. Cglb: Benchmark tasks for continual graph learning. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. [2] Jihoon Ko, Shinhwan Kang, and Kijung Shin. Begin: Extensive benchmark scenarios and an easy-to-use framework for graph continual learning. arXiv preprint arXiv:2211.14568, 2022. [3] Tianqi Zhao and Alan Hanjalic and Megha Khosla. AGALE: A Graph-Aware Continual Learning Evaluation Framework. In Transactions on Machine Learning Research, 2024.

局限性

The authors might additionally want to clearly state the scope of the study for example the current study seems not to cover the multi-label scenario.

作者回复

2024-08-07

Thank you for your positive feedback. We are happy to address your concerns and questions. Detailed responses to your comments are provided below.

Q1. Clarification on the setting of task splitting

A: We understand your concern. To clarify, $G_t$ contains nodes from task $t$ and old tasks, but the node attributes and labels of nodes from old tasks are not available. As we discussed in Section 2.2, "Node attributes and class labels are only available if the nodes belong to the current time period." To justify our setup, imagine a real-life situation in a citation network: when a new paper A is published in 2024, it cites paper B published the previous year. We can extract the title (ID) of paper B from paper A, but no further information because all raw data of papers from previous years may be unavailable due to limited memory.

Q2, Limitations. Additional related work on benchmarking / different CGL setting

A: We appreciate you bringing the additional related benchmarking studies to our attention. We briefly mentioned the CGLB work [1] in Section 2.1, noting that it focuses on transductive settings, while ours focuses on inductive setting. We believe BeGin [2] is a valuable study that provides a variety of benchmarks with diverse settings. Although they do not cover the inductive and generalized class-incremental learning (class-IL) settings used in our paper, it is an important related work to discuss, and we will include it in our revised version.

We also appreciate you mentioning AGALE [3]. Our current work focuses solely on multi-class classification. We will include a discussion of this limitation. However, we believe our method can be applied to multi-label classification with modifications, and we will discuss incorporating multiple labels as a potential direction for future work.

Q3, Q7. Small numbers of classes/new classes

A: In our main experiments, we adopt a realistic Generalized Class-Incremental Learning setup, where tasks are split by time. To mimic the scenario where each task has different class sets, we randomly drop some classes from each task, although this still cannot guarantee that each task contains entirely new classes not seen in previous tasks. We acknowledge that the datasets we used have a relatively small number of classes. However, there are fewer options for dynamic expanding graph datasets. Nevertheless, we also evaluated our method in the traditional Class-Incremental Learning setting on three additional datasets in Appendix E.6, where 1) The datasets can have as many as 70 classes, and 2) new tasks contain classes unseen in previous tasks, demonstrating the adaptability of our method to both scenarios.

Q4. Replay buffer

A: In the memory buffer, we keep a coarsened version of the previous graph, which contains the node attributes of the supernodes and the edges between them. The edges of the coarsened graph are expected to preserve the topology of the original graphs.

Q5. Ablation study on replay buffer

A: To clarify, we propose TACO as an experience-replay-based method to consolidate knowledge from old tasks, thus, when the experience replay buffer is removed, our method is reduced to finetune.

Our major contribution is the introduction of a dynamic graph coarsening framework that can efficiently preserve both node features and topology information in a coarsened graph, and the coarsening graph is stored in the memory buffer for replay. We compare our method with other experience-replay methods (e.g., ER-rs) to demonstrate how our approach better preserves knowledge from old tasks and thus more effectively alleviates forgetting.

Q6. Task size / smaller tasks

A: We provide the statistics of the datasets we use in Appendix Table 3, where the average task size can be calculated as the number of items divided by the number of tasks. We also investigated how the model performs when the graphs are split into smaller tasks, where each subgraph contains fewer nodes, and reported the results in Appendix E.2. Note that in such cases, all models tend to forget more because the number of tasks is also doubled. We claim that the size of the subgraphs alone should not have a significant effect on model performance, as it does not influence the model's forgetting behavior. The replay buffer size should also be adjusted accordingly with the size of the subgraphs.

2024-08-13

Dear reviewer AUFc,

We would like to further clarify the task splitting by providing a step-by-step procedure. We use the DBLP dataset as an example. We hope this is helpful.

We collected and selected papers from 1995-2014 from the DBLP database. The papers were then divided into 10 intervals, each spanning 2 years. For each task, we constructed a subgraph consisting of the nodes (with their features and labels) representing the papers published within the given time interval.
For each paper $i$ in interval $t$ , we examined each paper $j$ it cites:

If $j$ is from the same interval, we add an edge between the two nodes.
If $j$ is from a previous interval, we add paper $j$ to the subgraph and connect it to paper $i$ . In this case, the ID of paper $j$ is available, but its attributes and labels are not. We only use their id to align them with super nodes in the coarsened graph in the combine step. Please note that only the nodes that are connected to the nodes in the current time interval appear in G_t.
If $j$ is from a future interval (which is unlikely in a citation network), we ignore it.

评论- Thanks for the response

2024-08-14

Thanks for clarifying my doubts. I maintain my positive score and hope to further improve it in the discussion period.

评论- Thank you

2024-08-14

We sincerely value your feedback and are grateful for your recognition of our efforts!

审稿意见

评分: 6置信度: 32024-07-11

This paper presents TACO, a novel framework designed to address the issue of catastrophic forgetting in Graph Neural Networks (GNNs) during continual learning. The framework proposes a method to store information from previous tasks as a reduced graph. This graph expands with new tasks and undergoes a reduction process to maintain a manageable size while preserving topological information. The authors introduce a graph coarsening algorithm, RePro, which uses node representation proximities to effectively reduce graph size while retaining essential features. Extensive experiments demonstrate TACO's superiority over existing methods.

优点

The introduction of a topology-aware coarsening approach specifically for continual learning in GNNs is innovative and addresses a significant problem in the field. The paper presents a well-defined methodology with clear steps for combining, reducing, and generating graphs, which enhances the reproducibility of the approach. Extensive experiments and comparisons with state-of-the-art methods demonstrate the efficacy of TACO, providing strong empirical support for the proposed framework. Additionally, by focusing on reducing the graph size, the framework offers a scalable solution that can handle the growing size of real-world graphs in continual learning scenarios.

缺点

The choice of baseline methods for comparison might not cover all recent advances in continual learning and graph coarsening, potentially overlooking some relevant techniques.

问题

The full name of GCL is not explicit defined.
In the Combine step, graphs are combined using a hash table. Are the graphs labeled?
Graph pooling is highly related to graph coarsening. Why not add some comparisons with graph pooling methods?
At each timestamp, the model only aggregate information from last timestamp. Then what is preventing the model from forgetting information happened long ago?

局限性

yes.

作者回复

2024-08-07

Thank you for your positive feedback! We are happy to address your concerns and questions. Detailed responses to your comments are provided below.

W1. Additional related studies

A: Thank you for pointing it out. We have conducted additional literature review and compared with more baselines. Please refer to our response to all reviewers.

Q1. Full name of CGL

A: Thank you for pointing it out. We will ensure the full name of CGL (Continual Graph Learning) is explicitly defined in our revised version.

Q2. Clarification of combine step

A: We would like to further clarify the process in the Combine step. When we combine the graphs, we align their co-existing nodes. A node (with a unique ID) that appeared in previous tasks was assigned to a cluster (supernode) during the coarsening. A hash table keeps track of the mapping between the original node ID and the cluster. When the same node appears again in the next task, the hash table directs it to and aligns it with the cluster it was originally assigned to.

Q3. Comparison with graph pooling

A: Thank you for this suggestion. Indeed, graph pooling and graph coarsening are highly related concepts. Similar to graph coarsening, the goal of graph pooling is to reduce the number of nodes in a graph while preserving the semantic information [1]. However, we recognize graph pooling as a much broader concept. According to [1], graph pooling can be roughly divided into readout pooling and hierarchical pooling. The former aims to obtain a single representation for the whole graph (e.g., for graph-level tasks) and does not fit our purpose. On the other hand, hierarchical pooling reduces the graph into a smaller-sized new graph, which is of our interest.

Two major categories of hierarchical pooling methods are often used: node dropping and node clustering.

Node Dropping: This method deletes nodes from a graph, similar to the subgraph sparsifying method (SSM). Deleting most nodes significantly sparsifies the graph and could eventually make it equivalent to the experience replay methods.
Node Clustering: This method merges connected nodes into clusters, highly similar to graph coarsening. Both build a cluster assignment matrix, and nodes assigned to the same cluster are merged. **Graph coarsening techniques can be applied to node clustering, and in a broader sense, they are equivalent. The graph coarsening techniques we focus on in this paper are heuristic approaches that do not require additional learning for efficiency. **

There are also widely-used Node Clustering methods specifically for such as DiffPool [2], HGP-SL [3], and SAGPool [4], which all require learning the cluster assignment matrix as a parameter. This could complicate the framework, making it computationally expensive and against the initial goal of efficient replay.

We believe that comparing graph coarsening and graph pooling can help readers better understand the concepts and the contribution of our work. We will include this discussion in the revised paper.

[1] Zhang, M., & Li, P. (2022). Graph Pooling for Graph Neural Networks: Progress, Challenges, and Opportunities. arXiv preprint arXiv:2204.07321.

[2] Ying, R., You, J., Morris, C., Ren, X., Hamilton, W., & Leskovec, J. (2018). Hierarchical Graph Representation Learning with Differentiable Pooling.

[3] Zhang, Z., Bu, J., Ester, M., Zhang, J., Yao, C., Yu, Z., & Wang, C. (2019). Hierarchical Graph Pooling with Structure Learning. arXiv preprint arXiv:1911.05954.

[4] Yuan, H., Huang, J., Zhang, X., Ji, S., & Xia, Y. (2020). StructPool: Structured Graph Pooling via Conditional Random Fields. arXiv preprint arXiv:2002.06565.

Q4. Preservation of knowledge from long time ago

A: We understand your concern. When we combine the new graph with the coarsened graph, we expect the coarsened graph to preserve knowledge from all previous tasks. Although the "strength" of old tasks may gradually decrease as more tasks are introduced, we argue that this can be somewhat counteracted by the consolidation of old knowledge through multiple runs. The main results show that even with as many as 10 tasks, the overall forgetting is relatively small, demonstrating the method's ability to preserve knowledge from old tasks. Additionally, in Section 4.4.3, we investigate the test performance of the model on each task after more tasks are learned. Figure 4 shows that the performance on the first task does not further drop when more tasks are learned, demonstrating that the model is able to retain knowledge from earlier tasks.

2024-08-08

Thanks to the authors for the response. However, I still have some concerns towards the following statement.

“There are also widely-used Node Clustering methods specifically for such as DiffPool [2], HGP-SL [3], and SAGPool [4], which all require learning the cluster assignment matrix as a parameter. This could complicate the framework, making it computationally expensive and against the initial goal of efficient replay.”

One should note not all clustering based graph pooling methods require learning the assignments. E.g. Graclus [1] is also an efficient heuristic approach that widely used in graph pooling. Furthermore, I am not convinced that efficiency is the major reason for not choosing learning-based clusters. E.g., in MinCutPool [2], clusters are learned by applying MLP to node features, which I believe is more efficient than your method, since you require computing cosine similarity of all nodes and sorting edges.

[1] Dhillon, Inderjit S., Yuqiang Guan, and Brian Kulis. "Weighted graph cuts without eigenvectors a multilevel approach." IEEE transactions on pattern analysis and machine intelligence 29.11 (2007): 1944-1957. [2] Bianchi, Filippo Maria, Daniele Grattarola, and Cesare Alippi. "Spectral clustering with graph neural networks for graph pooling." International conference on machine learning. PMLR, 2020.s

2024-08-09

Thank you for pointing it out. First, we emphasize that the goal of these node clustering algorithms is different from graph coarsening. The main goal of node clustering is to partition the graph and find strongly connected communities (where the number of clusters is significantly smaller than the number of nodes). In contrast, our goal with graph coarsening is to approximate the original graph with a smaller graph that preserves the original graph topological properties at a moderate reduction rate (e.g., 10% to 50%) for efficient computation. Thus, their objectives are fundamentally different. Applying graph clustering for our goal does not guarantee the preservation of topology properties.

Furthermore, we point out that these node clustering methods are less efficient when applied to coarsening. We clarify that our coarsening algorithm RePro only merges connected nodes, so we only need to calculate cosine similarity between node pairs sharing edges between them. As a result, it takes $O(E)$ to compute cosine similarity and $O(E \log (E))$ to sort, so the overall complexity is $O(E \log (E))$ . Note that $E$ is much smaller than $N^2$ in practice.

Graclus [1] is a heuristic graph clustering method. It uses a kernel k-means algorithm that requires multiple iterations, and the complexity for each iteration is $O(N^2K)$ , where $N$ is the number of nodes, and $K$ is the number of clusters. Applying this to our case, $K$ is a constant ratio of $N$ , making the time complexity $O(N^3)$ . Thus, RePro is more efficient than even one iteration of Graclus.

As for MinCutPool [2], it learns the cluster assignment matrix with an MLP that minimizes cut loss, and the time complexity for each iteration to calculate the loss, as stated in the paper, is $O(K(E + NK))$ , where $N$ is the number of nodes, $E$ is the number of edges, and $K$ is the number of clusters. Applying this to our case, $K$ is a constant ratio of $N$ , making the time complexity $O(NE + N^3)$ . Note that this is just the time complexity for each iteration, and it could take a large number of iterations for the MLP to converge. Thus, RePro has advantages over MinCutPool in terms of efficiency.

Similarly, the efficiency problem for other learning-based graph pooling methods also exists when applying them to graph coarsening, as they require learning and updating the models with multiple iterations.

2024-08-12

Thanks for all the authors' efforts to address my concerns. I have no further questions and vote an acceptance.

评论- Thank you

2024-08-14

Your recognition and constructive feedback are much appreciated. Thank you for your support!

审稿意见

评分: 5置信度: 32024-07-12

This paper studies the continual learning of Graph Neural Networks. Specifically, the de facto rehearsal based methods fail to adequately capture the topological information. Accordingly, in this work, the authors propose to develop a graph coearsening based method, which stores the topological information into the zoomed-out graphs. The authors empirically justified that the proposed methods can closely approximate the original graph.

Experiments are conducted on three datasets against multiple SOTA baselines.

优点

Continual learning of graph neural network is practically important and largely overlooked.

This paper proposes a coarsening based method for preserving the topological information while maintaining a tractable memory consumption, which is novel.

Empircal results are promising. The proposed method consistently outperform baselines.

缺点

It is unclear how the adopted datasets split into different tasks.

It seems that fintune also works well on the three adopted datasets, maybe this indicates that the adopted datasets and tasks are not challenging enough.

问题

graph coarsening will incur extra computation. will this result in significant extra resource consumption.
subgraph rehearsal methods also exist, can they also preserve the topology? What is the advantage of the proposed method?

局限性

As mentioned by the authors, this paper does not cover the scenario in which nodes and edges can be deleted or modified.

作者回复

2024-08-07

Thank you for your insightful feedback. We are happy to address your concerns and questions. Detailed responses to your comments are provided below.

W1. How to split datasets into different tasks

A: We split the original graph into different tasks (subgraphs) based on the timestamp of the source nodes. For instance, in a citation network, if paper A was published at timestamp t and cites paper B, which was published at timestamp t-1, the subgraph for task t contains both paper A and paper B. However, the node attributes of paper B are not available in task t since they belong to the previous timestamp. We discuss the details of task splitting in Section 2.2.

W2. Fine-tuning results and challenges of the tasks

A: We agree with you that the forgetting in fine-tuning may not appear as severe as reported in more extreme settings such as class-incremental setting. This is because we used a Generalized Class-Incremental Learning setup, where each task may have overlapping classes. We claim this is more realistic compared to the traditional Class-Incremental Learning setting. Please refer to Section 2.1 for further discussion.

Nevertheless, we consider the nearly 20% forgetting reported in our setting is significant and worth studying.

First, in many critical domains, even a relatively small performance drop can be a significant concern. For instance, in scenarios such as diagnostic prediction [1], a false classification could potentially put a patient's life at risk. The difference between 99% and 80% accuracy is substantial; the latter could render the system's utility and trustworthiness completely questionable.
Second, we emphasize that less forgetting is not equal to less challenging; rather, many catastrophic forgetting scenarios can be easily mitigated with simple strategies. For instance, prior work [59] (reference number in the paper) uses a Class-Incremental Learning setting and the model initially experiences around 90% forgetting in the fine-tune setting. However, after applying a lightweight regularization-based method, EWC, the forgetting drops to 30%. In contrast, in our setting, forgetting is not significantly alleviated with regularization methods and simple ER methods, highlighting the challenges of this setting and the necessity of proposing more sophisticated methods to address the problem.

Additionally, to demonstrate the generalizability of our method, we also evaluated it in the traditional Class-Incremental Learning setting and reported the results in Appendix E.6. These results show that forgetting in fine-tuning is more severe in this setting and that our method can effectively alleviate this forgetting.

[1] Luo, X., Huang, X., & Han, L. (2020). Artificial intelligence in disease diagnosis: A systematic literature review, synthesizing framework and future research agenda. Journal of Ambient Intelligence and Humanized Computing, 11(2), 432-450.

Q1. Extra computation of graph coarsening

A: We agree with you that graph coarsening incurs extra computation, so it is important to ensure that the method we used does not take too much time in training and inference. Therefore, we propose RePro, an efficient graph coarsening method with linear runtime complexity. We believe that a small extra time investment is necessary to accurately preserve topology for consolidating old knowledge. To demonstrate that our coarsening method can effectively alleviate forgetting with relatively small extra time, we reported a trade-off metric in Section 4.4.1. This metric is defined as the time required for coarsening the graph divided by the increase in prediction accuracy compared to the fine-tuning method. As shown in Table 2 in the paper, our coarsening method has the smallest trade-off value of extra time versus performance increase.

Q2. Comparison with subgraph rehearsal method

A: Although the subgraph rehearsal method can also preserve topology, we claim it is less efficient. In related work (Appendix A.1), we discuss a previous work [59] (reference number in the paper) that uses sparsified subgraph memory (SSM) to store the L-hop neighborhood of replay nodes. We argue that "storing topology information of nodes through this method is not very efficient, as the information of uncovered nodes is completely lost, and it fails to capture inter-task correlations in our setup." Also, as suggested by [1], sparsified subgraph methods “partially sacrifice topological information, especially when the computation ego-subnetworks are large and a majority of nodes/edges is removed after sparsification”.

We also compare this method, SSM, with ours in Table 1, which shows that it does not perform as well as our method when similar memory is used.

To better demonstrate the advantages of our method, we also provide a table comparing the characteristics of different methods in the attached PDF file.

[1] Zhang, X., Song, D., Chen, Y., & Tao, D. (2024). Topology-aware Embedding Memory for Learning on Expanding Graphs. arXiv preprint arXiv:2401.03077.

2024-08-13

Dear reviewer HprJ,

We think it would be helpful to illustrate how task splitting works with a step-by-step procedure. We use the DBLP dataset as an example.

We collected and selected papers from 1995-2014 from the DBLP database. The papers were then divided into 10 intervals, each spanning 2 years. For each task, we constructed a subgraph consisting of the nodes (with their features and labels) representing the papers published within the given time interval.
For each paper $i$ in interval $t$ , we examined each paper $j$ it cites:

If $j$ is from the same interval, we add an edge between the two nodes.
If $j$ is from a previous interval, we add paper $j$ to the subgraph and connect it to paper $i$ . In this case, the ID of paper $j$ is available, but its attributes and labels are not.
If $j$ is from a future interval (which is unlikely in a citation network), we ignore it.

作者回复

2024-08-07

We appreciate the thorough reviews provided for our paper. We are encouraged by the positive comments. All four reviewers recognize the novelty of our work. They also concur that our extensive experimental evaluation serves as strong evidence to support the effectiveness of our method. Additionally, two reviewers (F1Ar, AUFc) acknowledge that our paper is well-written, with a clear description of the methodology.

We are keen to address the concerns and questions of the reviewers and provide detailed responses to every point raised. Please refer to our responses to individual reviewers.

We summarize the common concerns raised by the reviewers and how we address them as follows. New results are also provided in response to some of the concerns.

We have added the two important related studies suggested by Reviewer F1Ar. Additionally, we conducted an extensive literature review and identified some other recent work. To the best of our knowledge, by including these additional methods, our literature review is comprehensive and covers the most advanced experience-replay-based CGL methods. However, we are happy to include any relevant studies suggested by the reviewers.
We have implemented two additional related methods which have open-source code and reported the new results (see attached PDF).
To better demonstrate our contribution and position our method within the literature, we have provided a table that compares the characteristics of our method with other experience replay-based methods, including the newly added ones (see attached PDF).

In the attached PDF, we include:

New results on two new baselines CGL methods. (Table 1)
Comparison between different experience-replay-based CGL methods. (Table 2)
Performance of TACO with different reduction rates. (Figure 1)

Additional related work on CGL

We appreciate reviewers AYqY and F1Ar for pointing out some missing related work as comparison methods in CGL. Following reviewer F1Ar's suggestion, we have included the two most up-to-date experience replay-based methods [1][2] in our comparison.

CaT [1] uses a condensed graph as the memory buffer and employs a "Training in Memory" scheme that directly learns on the memory buffer to alleviate the data imbalance problem. However, the condensed graph contains only sampled nodes with self-loops, resulting in the loss of valuable topology information.
DeLoMe [2] stores the learned representations of the sampled nodes as a memory store, which can preserve graph structure information. However, it still cannot handle inter-task edges, and the stored representations are inadequate for dealing with the dynamic receptive field of old nodes when new nodes form connections with them.

Besides, we also find other two recent studies that are worth mentioning including SEM-curvative [3] and PGDNN [4]. They all provide unique insights into storing useful information from previous tasks.

We conducted additional experiments to evaluate the CaT and DeLoMe, and the results are reported in Table 2 in the attached PDF. We are not able to implement SEM-curvative and PGDNN as their source code is not available.

We will incorporate the above discussion and the new results in the revised version.

[1] Liu, Yilun, Qiu,Ruihong, & Huang, Zi. 2023b. CaT: Balanced Continual Graph Learning with Graph Condensation. In: Proc. of ICDM.

[2] Niu, Chaoxi, Guansong Pang, and Ling Chen. "Graph Continual Learning with Debiased Lossless Memory Replay." arXiv preprint arXiv:2404.10984 (2024).

[3] Zhang, X., Song, D., & Tao, D. (2023). Ricci Curvature-Based Graph Sparsification for Continual Graph Representation Learning. IEEE Transactions on Neural Networks and Learning Systems.

[4] Zhang, X., Song, D., Chen, Y., & Tao, D. (2024). Topology-aware Embedding Memory for Learning on Expanding Graphs. arXiv preprint arXiv:2401.03077.

The Contribution of TACO

To better highlight the advantages of our proposed method, we provide a comparison between our method and these experience-based methods in Table 2 in the attached PDF file. Our method, TACO, is the only one that can preserve graph topology, handle inter-task edges, and consider the growing receptive field of old nodes.

最终决定Accept (poster)

2024-09-25

This paper presents a topology-aware graph coarsening framework for continual graph learning. In particular, the proposed framework stores information from previous tasks as a reduced graph, which expands by integrating with a new graph and aligning shared nodes. Promising results are observed on benchmark datasets against various baselines. Reviewers agreed that this paper studies an important problem and the paper is well written. Meanwhile, reviewers provided some comments regarding problem setting, technical details, paper writing and experiments. During post-rebuttal discussion, reviewers found that most of these concerns have been well addressed in the rebuttal. The authors should update their final version accordingly.