Structural Knowledge Informed Continual Learning for Multivariate Time Series Forecasting
摘要
评审与讨论
This paper proposes a continuous multivariate time series prediction framework based on structural knowledge, which utilizes structural knowledge to enhance the prediction of multivariate time series in a continuous learning environment. The proposed framework incorporates a deep prediction model that combines a graph learner to capture variable dependencies and a regularization scheme to ensure the consistency between the learned variable dependencies and the structural knowledge. The authors tackle the challenge of modeling variable dependencies across different regimes while maintaining the prediction performance. Experimental results on several real datasets are presented, demonstrating the effectiveness of the proposed framework in improving the prediction performance and maintaining consistency with the structural knowledge.
优点
-
This paper propose a new freamwork which is aimed at knowledge transfer learning.
-
The proposed model can well use the knowledge of former tasks.
-
The paper is well written.
缺点
-
The complexity of the model increases so much relative to the performance gain that I don't see the need for such a complex design.
-
The model is not novel enough, as far as I know, the graph structure, the memory module, these are not new concepts.
-
Models are not so essential to the development of the field.
问题
The OFA is published in ICML2022, the more recently model should be added as baselines.
Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International Conference on Machine Learning, pp. 27268–27286. PMLR, 2022.
W1: Complexity of model design.
Thank you for your feedback. The proposed design is necessary to effectively leverage the graph structure information, which is crucial for modeling variable dependencies in time series within a continual learning setting. This design utilizes the structural relationships that simpler designs cannot, addressing key challenges unique to this problem domain.
W2&W3: Novelty.
Thank you for your comment. While it is true that graph structures and memory modules are not entirely new concepts, existing graph models fail to effectively reflect the correct variable dependencies in a continual learning setting. As shown in Figure 4, this limitation can significantly deteriorate performance. Our proposed approach addresses this gap by accurately capturing and utilizing variable dependencies, which is a critical improvement. Additionally, regarding memory replay, we introduce a novel sampling scheme designed to systematically select more representative samples, further enhancing the replay mechanism. These contributions collectively demonstrate the novelty and importance of our work in advancing the field.
Q1:The OFA is published in ICML2022, the more recently model should be added as baselines.
Thank you for your comments, The OFA is published in NeurIPS 2023, which can be considered as a SOTA model.
Xue Wang Liang Sun Rong Jin Tian Zhou, Peisong Niu. One Fits All: Power general time series analysis by pretrained lm. In NeurIPS, 2023.
Existing MTS models suffer from the forgetting problem in the continuous learning scenario, which makes it difficult to remember the variable dependencies from each regime. To solve this problem, the authors propose a structural knowledge informed continuous learning framework to infer the dependency structure between different regimes. In this framework, the authors propose to use a graph structure to explicitly represent the dependencies between regimes, and introduce a regularization to facilitate continuous learning. In addition, in order to alleviate the forgetting problem, the authors propose a new memory replay method, which effectively preserves the temporal dynamics and dependency structure of historical regimes by selecting MTS data samples that satisfy the maximum temporal coverage. Finally, the authors verify the superior performance of the proposed method through a large number of experiments.
优点
- Modeling dependencies between states in a graph is novel.
- When describing the method, a large number of model diagrams and structure diagrams are used to help readers understand the method.
- The authors compared a large number of baselines, and the experimental workload was large.
缺点
- On the one hand, the authors introduce a graph structure to model dependencies. On the other hand, the authors propose a new replay method to solve the forgetting problem in continuous learning. In my opinion, these are two points, but it is inappropriate for the author to mix the two points together in the abstract and introduction.
- This paper did not conduct ablation experiments and did not demonstrate the significance of each part of the method.
- The performance on the four datasets is not much better than the baselines.
问题
- How well does the model resist forgetting after many new regimes coming
- How to understand "We emphasize that we don't intend to use structural knowledge as a ground truth", but isn't structural knowledge still used as a label in the loss of ?
W1: Mixing graph structure and continuous learning in the abstract and introduction.
Thank you for your valuable review comments. We would like to emphasize that continual learning for time series with variable dependencies is a critical challenge. Our contribution lies in proposing a novel graph model to effectively leverage structural information in this continual learning setting. Additionally, we introduce a more efficient memory sample selection method for replay, aimed at mitigating the forgetting problem in continual learning.
W2: Ablation study.
Thank you for your comments. We include the ablation experiments (different predicting horizon setting) in Appendix Section F and Hyperparameter analysis in Section 4.5 and Appendix Section E.
W3: Performance compared to baselines.
Thank you for your comments. Our method achieves first or second place consistently across the datasets compared to baseline methods. This demonstrates the robustness and generalizability of our approach across diverse scenarios. Such consistent high-ranking performance highlights the effectiveness of our proposed method in addressing the challenges of continual learning for time series with variable dependencies.
Q1: Evaluation metrics.
Thank you for raising this point. The metrics we report, AP (Average Precision) and AF (Average Forgetting), are widely recognized and commonly used to evaluate a model's resistance to forgetting in continual learning settings. For clarity, we have provided detailed definitions of these metrics in Section 4.1.
Q2: Question regarding structure knowledge.
Thank you for raising this question. First, we incorporate a forecasting objective to guide the learning of the graph structure, ensuring it aligns with the continual learning setting. Second, unlike directly applying structural knowledge for message passing—an approach that, as shown in Tables 7 and 8, does not yield satisfactory results (e.g., STGCN)—we instead leverage structural knowledge to guide the model in identifying and adapting across regimes. This strategic use of structural information effectively reduces performance degradation and enhances the model's robustness in dynamic environments.
Thank you for the author's response, which has addressed some of my concerns. However, I still believe that the contribution of the paper to the research field is limited, both in terms of methodology and results. Overall, after considering the author's reply and the feedback from other reviewers, I feel that this paper is not yet ready for acceptance at ICLR, and therefore I will maintain my current score.
This paper introduces the Structural Knowledge Informed Continual Learning (SKI-CL) framework for multivariate time series forecasting, which addresses the challenge of catastrophic forgetting when modeling variable dependencies across different regimes. SKI-CL leverages structural knowledge to guide the model in identifying and adapting to regime-specific patterns, and employs a representation-matching memory replay scheme to preserve temporal dynamics and dependency structures. The framework's efficacy is validated through experiments on synthetic and real-world datasets, demonstrating its superiority over state-of-the-art methods in continual MTS forecasting and dependency structure inference.
优点
-
The motivation is very meaningful.
As recent research [1] has pointed out, distribution drift of time series (including dependency structures) may be the core bottleneck in the forecasting process. Therefore, I believe the authors are attempting to conduct a very significant study.
-
The writing is good and easy to follow.
缺点
- I am concerned whether the change in dependency structures is indeed the core bottleneck in real-world scenarios.
- 1.1 The authors should consider using analyses based on real data rather than just the schematic diagram in Figure 1.
- 1.2 Is this issue the core bottleneck of the dataset chosen by the authors? Are other more commonly used datasets, such as METR-LA and PEMS04, also applicable to this method?
- 1.3 In the current manuscript, it seems that the effectiveness of modeling the change in dependency structures can only be validated through experimental results. Thus, the authors need to compare against a broader range of stronger baseline methods, such as latent (but static) graph models, dynamic (but predefined graph-based) models (e.g., DGCRN [2]), and non-graph models (e.g., STID[3], STNorm[4], STEAformer[5]). The baselines currently chosen by the authors are not strong enough. For example, TCN is a conventional temporal model, while PatchTST, DLinear, TimesNet, and iTransformer are long-sequence forecasting models that are not specifically designed for spatiotemporal prediction and do not explicitly model the dependency graph between sequences. Additionally, the code for GTS contains unintentional errors that significantly impact its performance compared to the original paper.
- Lack of sufficient insights.
- 2.1 What are the core challenges and solutions in modeling the dynamic changes of dependency structures for time series forecasting? Currently, I cannot clearly see the connection between the challenges and the proposed techniques.
- 2.2 What is the distinction between dynamically changing dependency structures and dynamic graph learning?
[1] BasicTS: Exploring progress in multivariate time series forecasting: Comprehensive benchmarking and heterogeneity analysis. TKDE 2024.
[2] DGCRN: Dynamic graph convolutional recurrent network for traffic prediction: Benchmark and solution. TKDD 2023.
[3] STID: Spatial-Temporal Identity: A Simple yet Effective Baseline for Multivariate Time Series Forecasting. CIKM 2022.
[4] STNorm: Spatial and Temporal Normalization for Multi-variate Time Series Forecasting. SIGKDD 2022.
[5] STAEformer: Spatio-Temporal Adaptive Embedding Makes Vanilla Transformer SOTA for Traffic Forecasting. CIKM 2023.
问题
See Weakness.
W1
(1) Analysis for Figure 1.
Thanks for your comments. We use Figure 1 to demonstrate the high-level concept of sequential training to ease the understanding of continual learning for multivariate time series forecasting, where we have provided a real-world example in lines 053-061. We have provided sufficient analyses in the experiments in the main text and appendices.
(2) Other datasets.
Other benchmarks are applicable too. As we mentioned in Appendix C.1 from line 825, the Traffic-CL dataset is built upon the PEMSD3 benchmark, which is similar to METR-LA and PEMS04. We choose the PEMSD3 as it contains sufficient traffic data from 2011 to 2017, with sensors expanding across consecutive years, which is more suitable to demonstrate the continual forecasting setting.
(3) Baselines
The full experimental results covering all baselines are presented in Table 7 & 8. As we mentioned in Appendix C.2 from line 923, our collection of baselines covers latent static graph-based forecaster (e.g., AGCRN, MTGNN), latent dynamic graph-based forecaster (e.g., ESG, StemGNN), static graph-based forecaster (STGCN), non-graph-based forecasters.
Moreover, the most recent baselines like TimesNet, OFA, and iTransformer are general-purposed time series forecaster, where the state-of-the-art performance of short-term forecasting has been validated in their papers as well. In addition to that, iTransformer also captures the variable dependencies via attention.
W2: Insights our paper.
Thanks for your comments. As we have depicted in introduction (lines 49-61) and figure 1, the challenge is about the catastrophic forgetting of learned dependency structures in multivariate time series forecasting in a sequential training learning scenario where MTS are continuously accumulated under different regimes. This challenge typically leads to performance degradation and distorted dependency structures (which are also validated in the main experiments and visualizations). To address this challenge, we leverage structural knowledge to steer the forecasting model toward identifying and adapting to different regimes, and select representative MTS samples from each regime for memory replay. As such, the obtained model can maintain accurate forecasts and infer the learned structures from the existing regimes.
The dynamically changing dependency structures means the underlying varying characteristics of multivariate time series across different regimes, especially in the context of sequential training/continual learning, where the basic assumption is the underlying dependency structures are sampled from the same distribution within one regime, and from different distributions across regimes. The notion of dynamic graph learning is about capturing the variable dependencies in a more fine-grained manner, saying an input window.
Thank you for the rebuttal, but I am not fully convinced. Additionally, I still suggest that the authors adopt stronger spatial-temporal baselines. The current spatial-temporal baselines are not state-of-the-art (SOTA) and are from 2022 or earlier. Moreover, long-term time series forecasting methods are not suitable for spatial-temporal forecasting tasks. They perform significantly worse than spatial-temporal models on such tasks, and I am very confident in this point. The authors should select more appropriate baselines to validate the effectiveness of the proposed method and proactively discuss the specific problems this method addresses as well as the limitations of the algorithm.
The authors propose a novel Structural Knowledge Informed Continual Learning (SKI-CL) framework with a graph-based forecaster and a novel representation-matching memory replay scheme, which can perform MTS forecasting and infer dependency structures accurately under the continual learning setting. Experiments demonstrate the superiority of the proposed framework in continual MTS forecasting and dependency structure inference.
优点
1.The paper proposes an interesting method by combining dynamic graph learning with a novel representation-matching memory replay scheme for MTS forecasting and dependency structures inference under continual learning setting.
2.The organization of this paper is clear.
缺点
1.The paper just combines continual learning with multi-variate time series forecasting, which can be seen as an incremental work. The authors should identify the research topic and main contribution, instead of adapting continual learning to multi-variate time series forecasting.
2.The design of structural knowledge-informed graph learning model lacks innovation. The parameterized graph learning is similar to many works, e.g., AGCRN [1]. The authors could further clarify how their graph learning method differs significantly from existing works.
3.The analysis of representation-matching memory replay scheme is uncompleted:
(1)The authors should clarify how their representation-matching memory replay scheme differs with other experience-replay methods.
(2)The analysis of efficiency about the scheme is not well discussed in the paper. It would be beneficial to explain the scheme's efficiency from both theoretical and experimental perspectives.
(3)The visualization of selected samples in the scheme should be included to demonstrate the effectiveness of the model.
4.In section 3.4, the authors could explain the inference process more clearly.
5.The paper has some weaknesses in the experiments, which are not convincing enough:
(1)Since one of the main contributions is developing a graph-based forecaster, some recent graph-based time series forecasting models should be mentioned and compared, e.g., CrossGNN [2] and MSGNet [3]. In addition, the continual learning methods applied to forecasting methods are old and some latest methods could be compared.
(2)Different datasets have different methods to construct regimes, e.g., by year, state, activity, and adjacency, authors could further investigate the effect of different construction methods on the performance of SKI-CL. Some construction methods, e.g., by state and activity, are not reasonable and deviate from the intention of continual learning. In addition, the paper misses details regarding the train-test data splits.
6.From a reader's perspective, the authors should enhance presentation to avoid misunderstanding. For example, for Fig. 6, the horizontal coordinate and vertical coordinate of heat map should start from 1. For Table. 1, what do the bolded and underlined results mean? For Equation 3, what does the mean?
7.Strong recommendation to make the code publicly available.
[1] Bai, L., Yao, L., Li, C., Wang, X., & Wang, C. 2020. Adaptive graph convolutional recurrent network for traffic forecasting. Advances in Neural Information Processing Systems, 33, 17804-17815.
[2] Huang, Q., Shen, L., Zhang, R., Ding, S., Wang, B., Zhou, Z., & Wang, Y. 2023. CrossGNN: Confronting noisy multivariate time series via cross interaction refinement. Advances in Neural Information Processing Systems, 36, 46885-46902.
[3] Cai, W., Liang, Y., Liu, X., Feng, J., & Wu, Y. (2024, March). MSGNet: Learning multi-scale inter-series correlations for multivariate time series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, 38, 11141-11149.
问题
See Weaknesses.
W1: Research topic.
We have demonstrated the importance of addressing catastrophic forgetting when time series are sequentially collected (please check figure 1 and our descriptions, and the real-world example in lines 053-061). That being said, our major contributions are about proposing a unique perspective to effectively solve continual multivariate time series forecasting.
W2: Our method differs from exisiting works .
The existing literature either learns the structure via training parameters or representations. Our novelty lies in the role of dynamic structure learning for regime characterization, rather than an individual structure learning component. It is important that we impose the consistency regularization in the dynamic structure learning process, which aligns the learned structure with the universal and task-irrelevant structural knowledge to characterize a specific regime. The joint structure modeling based on both components leads to an capability that the existing backbones cannot achieve, i.e., during the inference stage, our model can automatically infer a consistent structure solely based on the time series data, without knowing the regime and accessing the memory buffer, as shown in Figure 1: SKI-CL: Testing and validated in Figure 4 in the experiments.
W3: Analysis of representation-matching memory replay scheme .
Thanks a lot for the comments.
-
We have introduced the experience-replay methods in the experiment section from lines 354-360: A herding method randomly replays training samples, a DER++ method enforces a L2 knowledge distillation loss on the previous logits, a MIR method selects samples with highest loss for experience replay.
-
Intuitively, our representation-matching replay scheme is efficient in terms of selecting samples after the most diverse partitions, instead of the whole training set, which minimizes the search space. We will provide an efficiency analysis in our future manuscripts.
(3) We agree with the reviewer that visualizing the selected samples can be helpful to demonstrate our idea. We have provided the visualization of learned dependencies structures, continual learning performance, and a case-study with time series visualizations, and we will provide the visualization of selected samples once we update the manuscript.
W4: Inference process.
Our training and inference process follows the continual learning paradigm, where the model is sequentially trained and evaluated for each regime, and the pipeline has been demonstrated in figure 1. We thank the reviewer for the advice and will consider further elaboration.
W5: Experiments.
(1) We have included the most recent state-of-the-art time series forecasting model, iTransformer that explicitly captures the variable dependencies of time series for forecasting purposes. We will consider adding more baselines. Currently continual learning tailored for time series is still underexplored, and we’ve implemented the most well-known and effective methods that serve as strong baselines in the continual learning area.
(2) The basic intention of continual learning is maintaining performance across different tasks. We intend to demonstrate the effectiveness of our SKI-CL in terms of different continual learning scenarios regarding the time series dependencies, including the regimes representing different underlying structures. We have introduced the data splits in the training details in line 1013: The data split is 6/2/2 for training/validation/testing.
W7: Clarification of the presentation.
Due to the space limitations of the main text, we have indicated that we provided the visualizations of regimes 3 and 4 for case-study. As we have mentioned in line 513, the full case study visualizations of all regimes are provided in Appendix G. The bold and underlines results mean the best and second-best performance respectively. For equation 3, the number of samples for K-th mode after K-partition, which is jointly defined with K. That being said, K and represent the sample partition, which are the parameters that optimize our objective.
W8: Code
Please check the supplemental materials where we have attached our code.
Thank you for your detailed response and my concerns have been partially addressed. But I am not fully convinced and I decide to maintain my original score. Below are the specific reasons:
1.I still believe this work is an incremental work which just combines continual learning with multi-variate time series forecasting after carefully reading of paper and responses.
2.I still think the design of structural knowledge-informed graph learning model lacks innovation because the parameterized graph learning methods are well-researched. The authors claim that ‘Our novelty lies in the role of dynamic structure learning for regime characterization’, however, I think it just applies the existing dynamic structure learning to regime characterization, which lacks innovation.
3.I still do not understand the detailed inference process from responses. The authors just introduce the regular inference process of continual learning and the benefits of SKI-CL in the paper and responses, but overlooks the detailed description of the inference process of SKI-CL.
4.I still advise the authors to add some latest graph-based baselines because most baselines in the paper are old baselines from two years ago and not graph-based baselines, which would not prove the effectiveness of SKI-CL well. In addition, I still suggest that the authors should add the efficiency analysis of the representation-matching memory replay scheme and visualize the selected samples in the scheme to demonstrate the efficiency and effectiveness of the scheme respectively.
This paper proposes Structural Knowledge Informed Continual Learning (SKI-CL), a framework for multivariate time series (MTS) forecasting that addresses challenges posed by variable dependencies across different regimes. By leveraging structural knowledge (such as physical constraints and domain knowledge), SKI-CL aims to mitigate catastrophic forgetting and improve model adaptation to shifting data distributions. The framework utilizes dynamic graph learning with consistency regularization, aligning learned dependencies with structural knowledge for better regime recognition. A representation-matching memory replay scheme is introduced to retain essential temporal dynamics from each regime. Experiments on synthetic and real-world datasets show superior forecasting accuracy and reliable dependency inference compared to state-of-the-art methods.
优点
-
This paper is well written. The notations are clear.
-
It provides up-to-date literature on MTS techniques with regards to regime shift. It underscores the potential of graph-based learning, paving ways for deep regime awareness in MTS.
-
Among many lines of work addressing the regime discovery in MTS, graph-based learning has been well explored. However, this paper provides a systematic approach to tackle the regime shift in a rationale and reasonable way.
-
The experiments are convincing and supports the arguments in the merits of the proposed SKI-CL framework. Especially, Figure 4, 5 and Figure 7,8,9,10 high light the differentiation of the proposed framework from competing methods well.
缺点
- Overall, the technical documentation is comprehensive. It would be clearer if an algorithmic procedure is provided to give a high-level reference of how different components orchestrates in Figure 2 and 3.
- While the traceback of nodes is valid and transparent in the inference, in domain applications, calibrated regimes matter to model owners because it helps interpret the results in the format of narratives. The proposed learning and inference of graph structure contain Coverage Maximization and Representation Matching Selection which are unsupervised, therefore not yet assembled for regime calibration tasks.
问题
-
Benchmarking regimes in multivariate time series forecasting is a foundation problem in MTS research, traditional econometrics method, e,g., Markov regime switch model that can be solved by EM algorithm can serve as a baseline model for regime discovery. Would that be something that can help bring the diverging methodologies to the same ground for a fair and reasonable competition, instead of the checking the numerical metrics?
-
The message of Table 5 is to compare the number of baseline model parameters. Therefore, would it be better if the sorting is number of baseline model parameters instead of chronological order?
W1
Thank you for your comments, we will adjust it and provide a high-level explanation in our updated version.
W2
Thank you for your insightful analysis of our work. We agree that calibrated regimes are indeed critical for interpretability in domain-specific applications. In our current model design, the Coverage Maximization and Representation Matching Selection processes are unsupervised, aimed to generalize to dynamic environments. In future work, we plan to incorporate supervised signals into these components to better support regime calibration tasks, thereby further enhancing the interpretability and usability of the model in real-world scenarios.
Q1
Thank you for your insightful idea and we will continue our research in this direction.
Q2
Thank you for your suggestion and we will adjust it in our updated version.
I thank the authors for their responses. Most of my concerns are addressed, despite of the open questions. Therefore, I remain my comment on this paper.
This work tackles the continual learning problem for multivariate time series forecasting, where variables are added and their dependency structure evolves over time. To model the dependency structure, this work adopts a method similar to existing graph convolutional networks, e.g., AGCRN. Overall, the novelty lacks and its in-depth analyses are missing.
审稿人讨论附加意见
The authors left short rebuttal messages, and the reviewers mostly stick to their original reviews.
Reject