ROSE: Register-Assisted General Time Series Forecasting with Decomposed Frequency Learning
摘要
评审与讨论
This paper addresses the growing demand for general time series forecasting models that can be pre-trained on diverse datasets to facilitate a variety of downstream prediction tasks.
优点
-
The paper is well-writen and easy to understand.
-
The paper effectively addresses the pressing need for general time series forecasting models, which is designed to be pre-trained on diverse datasets, enhancing its applicability across various prediction tasks.
-
The method in the paper facilitates the capture of domain-specific representations during pre-training, thereby improving adaptive transfer for downstream tasks.
缺点
-
The paper introduces a pre-trained method for time series forecasting, but it lacks a clear demonstration of the effectiveness of the pre-trained model in comparison to existing time series representation learning methods.
-
The evaluation is limited to only one type of downstream task, which makes it difficult to assess the robustness and generalizability of the proposed method across various applications in time series forecasting.
问题
-
In the pre-training stage, how to solve the problem of excessive data differences in different domains, because data with too much difference often leads to training failure?
-
Does the dimension of the time series affect the performance?
We would like to sincerely thank Reviewer e6XG for acknowledging our presentation quality and empirical contributions, as well as the helpful comments. We have revised our paper accordingly.
Q1: Lack a clear demonstration of the effectiveness of the pre-trained model in comparison to existing time series representation learning methods.
A1:
-
Compared with representation learning methods. Existing time series representation learning methods such as SimMTM [1], TS2Vec [2] and TF-C [3], primarily focus on pre-training on a single dataset and achieve in-domain prediction through fine-tuning on the same dataset. In contrast, ROSE is pre-trained on multi-source data and achieves fast adaptation in various out of domain scenarios.
The following table illustrates the performance of ROSE in full-shot and 10% few-shot settings, in comparison with that of time series representation learning methods in full-shot settings. It can be observed that ROSE outperforms the representation learning methods in full-shot setting, and achieves a competitive performance even in 10% few-shot setting, which substantiates the effectiveness of ROSE as a pre-trained model..
| Models | ROSE | ROSE (10%) | SimMTM | Time2Vec | TF-C |
|---|---|---|---|---|---|
| Metric | MSE / MAE | MSE / MAE | MSE / MAE | MSE / MAE | MSE / MAE |
| ETTm1 | 0.341/ 0.367 | 0.349/ | 0.341 / 0.377 | 0.691/ 0.547 | 0.732 / 0.652 |
| ETTm2 | 0.246/ 0.305 | / | 0.258 / 0.315 | 0.316 / 0.351 | 1.721 / 0.922 |
| ETTh1 | 0.391/ 0.414 | / | 0.401 / 0.423 | 0.426 / 0.436 | 0.614 / 0.601 |
| ETTh2 | 0.331/ 0.374 | / | 0.342 / 0.384 | 0.423 / 0.459 | 0.387 / 0.374 |
| Weather | 0.217/ 0.251 | / | / 0.262 | 0.231 / 0.264 | 0.286 / 0.349 |
| Electricity | 0.155/ 0.248 | 0.164/ | / 0.254 | 0.203 / 0.283 | 0.355 / 0.389 |
| Traffic | 0.390/ 0.264 | 0.418/ 0.278 | / | 0.450 / 0.330 | 0.702 / 0.443 |
- Compared with other pre-trained models. , we compare different pre-trained models in zero-shot setting, and ROSE demonstrates superior performance.
[1] Dong, J., Wu, H., Zhang, H., Zhang, L., Wang, J., & Long, M. (2024). Simmtm: A simple pre-training framework for masked time-series modeling. Advances in Neural Information Processing Systems, 36.
[2] Yue, Z., Wang, Y., Duan, J., Yang, T., Huang, C., Tong, Y., & Xu, B. (2022, June). Ts2vec: Towards universal representation of time series. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 8, pp. 8980-8987).
[3] Zhang, X., Zhao, Z., Tsiligkaridis, T., & Zitnik, M. (2022). Self-supervised contrastive pre-training for time series via time-frequency consistency. Advances in Neural Information Processing Systems, 35, 3988-4003.
Q2: The evaluation focus on one type of downstream task, which limits the assessment of the method's robustness and generalizability.
A2: Our paper focuses on general time series forecasting as some existing models, such as Moirai [1], Timer [2] and TimesFM [3].
To verify the robustness and generalizability of ROSE in various applications of time series forecasting, we have presented the experimental results in the following tasks in the paper:
- Various downstream fine-tuning settings: including full-shot, few-shot and zero-shot settings .
- Various downstream datasets: including realistic datasets from multiple domains such as Transport, Nature, Prices and Energy, with varying numbers of channels and sampling frequencies. Detailed statistics on the downstream datasets are .
- Various prediction lengths: including both long-term forecasting (with multiple prediction horizons: 96, 192, 336 and 720) and short-term forecasting .
Q3: How to solve the problem of excessive data differences in different domains. Data with too much difference often leads to training failure.
A3: Learning a unified representation from time series data across different domains is challenging. We address this problem and avoid training failure from the following three aspects:
- Scale differences. We solve this by normalizing each pre-training dataset individually.
- Frequency distribution differences. We propose a decomposed frequency learning to solve this. we combine multi-frequency masking with reconstruction task to understand time series from both low and high frequencies, enabling the model to learn a unified representation across multi-domain datasets with varying frequency distributions.
- Domain differences. We propose the time series register to capture and store domain-specific information to solve it.
Q4: Does the dimension of the time series affect the performance.
A4: We employ the channel-independent strategy that can naturally handle time series of varying dimensions. Our experiments have included datasets with various dimensions and achieved promising results. The dimensions of different datasets are shown .
[1] Woo, G., Liu, C., Kumar, A., Xiong, C., Savarese, S., & Sahoo, D. (2024). Unified training of universal time series forecasting transformers. arXiv preprint arXiv:2402.02592.
[2]Liu, Y., Zhang, H., Li, C., Huang, X., Wang, J., & Long, M. (2024). Timer: Transformers for time series analysis at scale. arXiv preprint arXiv:2402.02368.
[3] Das, A., Kong, W., Sen, R., & Zhou, Y. (2023). A decoder-only foundation model for time-series forecasting. arXiv preprint arXiv:2310.10688.
Dear Reviewer,
Thank you for your valuable and constructive feedback, which has inspired further improvements to our paper. As a gentle reminder, it has been more than 3 days since we submitted our rebuttal. We would like to know whether our response addressed your concerns. We eagerly await your feedback and are ready to respond to any further questions you may have.
Thank you for your time and consideration.
Best regards
Dear Reviewer e6XG,
Since the End of author/reviewer discussions is coming soon, may we know if our response addresses your main concerns? If so, we kindly ask for your reconsideration of the score. If you have any further concerns, please let us know and we will be more than happy to engage in more discussion and paper improvements.
Once again, thank you for your suggestion and time!
Dear Reviewer e6XG,
We would like to sincerely thank you for your time and efforts in reviewing our paper.
We have made an extensive effort to try to successfully address your concerns:
-
Adding comparison with the representation learning methods;
-
Clarifying the diverse scenarios covered in our experiments;
-
Explaining the three aspects in which we address the problem of excessive data differences in different domains;
-
Explaining our implementation of channel-independent strategy to overcome the affects of dimension of the time series;
-
Making revisions to the paper and appendix accordingly.
We hope that our response can address your concerns to your satisfaction. If so, we kindly ask for your reconsideration of the score. If you have any further concerns or questions, please do not hesitate to let us know, and we will respond timely. We kindly remind you that the reviewer-author discussion phase will end soon. After that, we may not have a chance to respond to your comments.
All the best,
Authors
Dear reviewer e6XG,
Thank you for taking the time and effort in providing a valuable review of our work. As the discussion period is coming to a close, we hope that you have had the chance to review our rebuttal.
If our rebuttal has resolved your concerns or improved your understanding of the paper, we would greatly appreciate it if you could reconsider your assessment and update the score accordingly. Your feedback has been incredibly helpful, and we value the opportunity to further improve the work based on your insights.
Thank you again for your thoughtful review and for considering our responses. Please feel free to reach out if you have any additional questions or require further clarifications.
Best regards,
Authors
Dear Reviewer e6XG,
Thank you for taking the time and effort to provide a valuable review of our work. We have responded to each of your insight comments point by point.
Since the end of the discussion period is coming soon, we hope that you have had the chance to read our rebuttal. We eagerly await your feedback and are ready to respond to any remaining concerns you may have. If our rebuttal has resolved your concerns or improved your understanding of the paper, we would also be very grateful if you could reconsider the score, which would give us a greater opportunity to present this work at the conference.
Thank you once again for your time and review.
Best regards,
Authors
Dear Reviewer e6XG,
We sincerely appreciate the time and effort you dedicated to reviewing our paper during this busy period, as well as your recognition of its strengths.
With the author/reviewer discussion phase now concluded and no additional concerns raised, we believe our rebuttal has addressed all of your comments. However, based on the scoring standards of past ICLRs, we find that our current score of 5.75 is at the borderline level. We would be deeply grateful if you could consider raising your score and giving us the opportunity to present our work at the conference.
Thank you once again for your thoughtful review and valuable feedback.
Best regards,
Authors
In this paper the authors introduce ROSE, a pretrained encoder-decoder style time series forecaster. They propose to pretrain this model with frequency decomposition, a cross-domain time series register for nearest neighbor lookup, and both reconstruction and prediction heads. They complete empirical study demonstrating excellent zero-shot and few-shot performance of ROSE.
优点
The paper presents a solid design of a general time series forecaster in the encoder-decoder style. In particular it shows the novelty which includes (1) a standalone registry for dimension reduction in contrast to, e.g. vector quantization or codebook learning, and (2) the multi frequency masking for representation learning.
For the empirical section, the ablation study is also comprehensive.
The paper has good writing quality and is easy to follow.
缺点
There are a few theoretical and empirical issues that prevent this work from being absolutely sound. Please see questions.
问题
Regarding the design:
- What's the justification of creating a general forecaster in an encoder-decoder style, which (1) to some extent locks in its lookback and forecast lengths thus does not handle any context + any horizon, and (2) is more well received for time-series embedding (e.g. Moment [1]).
- Given the distinct targets of reconstruction vs prediction, is the benefit of the reconstruction loss due to better presentation learning, or just more ground truth labels touched (e.g. imagine training only using the prediction loss on (512/720) * 100% more examples)?
- Why using 4 forecasting heads instead of 1 head with losses on 4 horizons? What's the practice when the forecast horizon is covered by multiple heads?
- Will the registry fail when the input time series is much shorter than 512 because of the linear projection into it?
- Is the registry a seasonality lookup eventually (Not an issue. Just curious).
Regarding the empirical study:
- When ROSE is treated as a zero-shot foundation model, the benchmark on the 7 datasets popular for supervised learning is not sufficient. For example, benchmarks with shorter horizons or shorter lookups are missing. Something similar to Table 9 but in the zero-shot setup would be helpful.
- Inference times of other zero-shot models seem a bit off. Likely some baseline models are not called as expected (e.g. without compilation or not on accelerators).
[1] Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models. arXiv preprint arXiv:2402.03885, 2024.
After rebuttal: I appreciate the authors' effort addressing my concerns here. I've revised my rating accordingly.
We would like to sincerely thank Reviewer 8yo6 for providing detailed review and insightful comments regarding the model design and empirical study . We have revised our paper accordingly.
Q1: The justification for encoder-decoder style, which (1) to some extent locks in its look-back and forecast lengths, (2) is more well received for time-series embedding (e.g. Moment [1]).
A1:
-
The justification for encoder-decoder style:
- We would like to clarify that our model is essentially an encoder-only architecture. The naming of the decoder is merely to distinguish it functionally from the Transformer encoder within the backbone. We adopt this architecture along with a masked reconstruction pre-training task, as it is demonstrated effective for time series representation learning. This is also validated by MOMENT [1] and MOIRAI [2]. Unlike MOMENT’s random masking or MOIRAI’s last masking strategy, we propose a novel frequency-based masking approach. We does not choose a decoder-only architecture because its auto-regressive prediction nature tends to result in cumulative errors.
- To further enhance the effectiveness of this architecture for time series prediction tasks and enable few-shot and zero-shot capabilities, aiming to build a general time series forecasting model, we add prediction heads to improve prediction accuracy.
-
For concern 1: The fixed look-back window and prediction lengths only apply to pre-training, and our model still supports multiple look-back window and prediction lengths in downstream tasks. Specifically, for the look-back windows and prediction lengths within the range of the pre-training settings, ROSE supports zero-shot forecasting. Moreover, in all scenarios, the model supports few-shot learning through fine-tuning.
-
For concern 2: ROSE enhances prediction performance and learning efficiency in diverse downstream datasets by leveraging multi-source pre-training with both reconstruction and prediction tasks. Unlike MOMENT, which focuses more on general representation learning for various tasks, ROSE emphasizes predictive accuracy and supports both zero-shot and few-shot capabilities, making it particularly effective for forecasting. We will make it as a future work to extend ROSE to broader tasks similar to those addressed by MOMENT.
-
Experiment:To demonstrate ROSE excels not only with fixed inputs, we evaluate on downstream tasks with input length significantly shorter than 512 which is used in pre-training. The following table which is updated , shows the results of ROSE after fine-tuning with a look-back window of 96. Despite shorter input lengths, ROSE still achieves the state-of-the-art performance, demonstrating effective transfer of pre-trained knowledge.
Model ROSE iTransformer PatchTST TimesNet Dlinear GPT4TS IP-LLM Metric MSE/MAE MSE/MAE MSE/MAE MSE/MAE MSE/MAE MSE/MAE MSE/MAE ETTm1 /0.389 0.407/0.410 0.387/0.400 0.400/0.406 0.403/0.407 / 0.390/0.399 ETTm2 0.272/0.321 0.288/0.332 / 0.291/0.333 0.350/0.401 0.285/0.331 0.278/0.327 ETTh1 0.432/0.426 0.454/0.447 0.469/0.454 0.458/0.450 0.456/0.452 0.447/0.436 / ETTh2 0.376/0.393 0.383/0.407 0.387/0.407 0.414/0.427 0.559/0.515 0.381/0.408 / Weather 0.257/0.276 / 0.259/0.281 0.259/0.287 0.265/0.317 0.264/0.284 0.266/0.284 Electricity 0.176/0.268 / 0.205/0.290 0.192/0.296 0.354/0.414 0.205/0.290 0.195/0.285 Traffic /0.276 0.428/ 0.481/0.304 0.620/0.336 0.625/0.383 0.488/0.317 0.467/0.305
We sincerely appreciate your insightful suggestions, which have greatly helped us better clarify the model's characteristics and enhance the quality of our paper.
[1] Goswami, M., Szafer, K., Choudhry, A., Cai, Y., Li, S., & Dubrawski, A. (2024). Moment: A family of open time-series foundation models. arXiv preprint arXiv:2402.03885.
[2] Woo, G., Liu, C., Kumar, A., Xiong, C., Savarese, S., & Sahoo, D. (2024). Unified training of universal time series forecasting transformers. *arXiv preprint arXiv:2402.
Q2: Is the benefit of the reconstruction loss due to better presentation learning, or just more ground truth labels touched?
A2:
- The benefit of the reconstruction loss comes from decomposed frequency learning rather than more ground truth labels. The design of decomposed frequency learning disentangles complex temporal patterns, effectively addresses the issue of coupled semantic information, leading to unified representation from time series data across various domains.
- More ground truth labels do not bring the direct benefit. In the ablation experiment of the masking methods , we compare our multi-frequency masking with two alternative masking strategies: random frequency masking and patch masking. Three methods use the same ground truth labels. However, using these two alternative masking strategies does not benefit the model performance.
Q3:
Q3.1 Why using 4 prediction heads instead of a single prediction head with losses on 4 prediction lengths?
A3.1:
-
Intuition: Training four prediction heads in pre-training allows the model to focus on forecasts of different lengths, enhancing accuracy in specific ranges. Conversely, if we use only one prediction head, it must handle both short-term and long-term forecasts, requiring the model to balance accuracy across various lengths, which could limit its performance on multiple prediction lengths.
-
Experiment: As shown in the table below, which is updated of the revised paper, we compare the full-shot performance of using four prediction heads versus a single prediction head in calculating the loss across four different prediction lengths during pre-training. The results indicate that using four prediction heads consistently yields better performance across all lengths.
pred_len ETTh1 ETTh2 ETTm1 ETTm2 MSE/MAE MSE/MAE MSE/MAE MSE/MAE pre-train_w_four_heads 96 0.354/0.385 0.265/0.320 0.275/0.328 0.157/0.243 192 0.389/0.407 0.328/0.369 0.324/0.358 0.213/0.283 336 0.406/0.422 0.353/0.391 0.354/0.377 0.266/0.319 720 0.413/0.443 0.376/0.417 0.411/0.407 0.347/0.373 avg 0.391/0.414 0.331/0.374 0.341/0.367 0.246/0.305 pre-train_w_one_head 96 0.357/0.388 0.269/0.325 0.277/0.330 0.158/0.245 192 0.394/0.411 0.333/0.372 0.324/0.356 0.215/0.287 336 0.415/0.422 0.359/0.395 0.360/0.380 0.272/0.325 720 0.420/0.444 0.390/0.425 0.421/0.410 0.355/0.385 avg 0.397/0.417 0.338/0.379 0.346/0.369 0.250/0.310
Q3.2 What's the practice when the prediction lengths is covered by multiple heads?
A3.2:
- The practice: During inference, when the prediction length is covered by multiple heads, we select a prediction head whose output length is the closest to the prediction length. For example, if the prediction length is 48, we select only the prediction head whose output length is 96, even though the other three heads could also predict with the prediction length of 48 by cropping. This selection is because each head is optimized during pre-training for a specific prediction range and thus focuses on different information.
- Experiment: To demonstrate the effectiveness of this practice, the following tables, which are updated , display the prediction performance for different prediction lengths that are covered by multiple heads. It is evident that the strategy of choosing the closest prediction head allows the model to adapt to various prediction lengths and achieve SOTA performance.
ETTh1: (As an example, the first column indicates that using prediction heads of 96, 192, 336, and 720 respectively, to predict with the length of 48.)
| prediction heads\ prediction lengths | 48 | 96 | 192 | 336 | 720 |
|---|---|---|---|---|---|
| MSE / MAE | MSE / MAE | MSE / MAE | MSE / MAE | MSE / MAE | |
| head of 96 | 0.325 / 0.364 | 0.354 / 0.385 | - | - | - |
| head of 192 | 0.327 / 0.365 | 0.358 / 0.385 | 0.389 / 0.407 | - | - |
| head of 336 | 0.327 / 0.365 | 0.361 / 0.386 | 0.399 / 0.410 | 0.406 / 0.422 | - |
| head of 720 | 0.329 / 0.365 | 0.359 / 0.389 | 0.401 / 0.411 | 0.419 / 0.425 | 0.413 / 0.443 |
ETTm2:
| prediction heads\ prediction lengths | 48 | 96 | 192 | 336 | 720 |
|---|---|---|---|---|---|
| MSE / MAE | MSE / MAE | MSE / MAE | MSE / MAE | MSE / MAE | |
| head of 96 | 0.120 / 0.213 | 0.157 / 0.243 | - | - | - |
| head of 192 | 0.120 / 0.213 | 0.159 / 0.245 | 0.213 / 0.281 | - | - |
| head of 336 | 0.122 / 0.214 | 0.159 / 0.244 | 0.216 / 0.283 | 0.266 / 0.319 | - |
| head of 720 | 0.124 / 0.216 | 0.159 / 0.245 | 0.216 / 0.284 | 0.269 / 0.321 | 0.347 / 0.373 |
Q4: Will the register fail when the input time series is much shorter than 512?
A4:
-
The register can handle input time series shorter than 512 without failing. To adapt to shorter inputs, we create a new linear projection layer for the new input length and update its parameters during fine-tuning.
-
Experiment: We validate the effectiveness of the register under an input length of 96 which is shorter than 512. The average results of all predicted lengths in the table below demonstrate that the effectiveness of the register is maintained.
w register w/o register Metric MSE / MAE MSE / MAE ETTm1 0.389/ 0.389 0.392 / 0.390 ETTm2 0.272/ 0.321 0.287 / 0.323 ETTh1 0.432/ 0.426 0.436 / 0.431 ETTh2 0.376/ 0.393 0.381 / 0.398 Traffic 0.440/ 0.276 0.450 / 0.288 Weather 0.257/ 0.276 0.264 / 0.275 Solar 0.230/ 0.255 0.249 / 0.270 Electricity 0.182/ 0.268 0.189 / 0.275
Q6: Is the register a seasonality lookup eventually?
A6:
-
To investigate whether the register serves as a seasonality lookup, we designed experiments to compare the cosine similarity of register vector selections from datasets across different domains and periods. The results, as shown in the table below, indicate that:
-
Although ETTh2 and ETTm1 (which are both from the energy domain) have different periods, their register vectors selection are very similar;
-
ETTh2 and Traffic, despite having the same period but coming from different domains, display low similarity in their register vectors selection.
Based on these observations, we conclude that the register is not a seasonality lookup.
ETTh2 (period = 24) ETTm1 (period = 96) Traffic (period = 24) Pems08 (period = 288) ETTh2 (period = 24) 1 0.92 0.1 0.24 ETTm1 (period = 96) 0.92 1 0.07 0.13 Traffic (period = 24) 0.1 0.07 1 0.62 Pems08 (period = 288) 0.24 0.13 0.62 1 -
Q7: Benchmarks with shorter prediction lengths or shorter look-back window.
A7: Based on your suggestion, we add experiments for short-term prediction in the zero-shot setting following Moment. For the too short inputs in the M4 dataset, we adapt them with padding. The results are shown in the table below and updated .
| ROSE | Moment | GPT4TS | TimesNet | N-BEATS | |
|---|---|---|---|---|---|
| M4 Yearly | 14.08 | 14.84 | 14.80 | 14.40 | |
| M4 Quarterly | 12.02 | 11.77 | 13.21 | 12.25 | |
| M4 Monthly | 15.80 | 15.36 | 15.67 | 15.24 |
Q8: Inference times of other zero-shot models seem a bit off.
A8: We appreciate your detailed and rigorous comments. We test the inference time of foundation models on the ProbTS framework [1]. Upon thorough review, we find that Moment and TimesFM do not run on accelerators, which resulted in a bias in their inference times. We have re-evaluated these models and update the results .
The corrected results continue to affirm that ROSE maintains a clear efficiency advantage while also demonstrating superior zero-shot performance.
[1] Zhang, J., Wen, X., Zheng, S., Li, J., & Bian, J. (2023). ProbTS: A Unified Toolkit to Probe Deep Time-series Forecasting. arXiv preprint arXiv:2310.07446.
Dear Reviewer,
Thank you for your valuable and constructive feedback, which has inspired further improvements to our paper. As a gentle reminder, it has been more than 3 days since we submitted our rebuttal. We would like to know whether our response addressed your concerns. We eagerly await your feedback and are ready to respond to any further questions you may have.
Thank you for your time and consideration.
Best regards
Dear Reviewer 8yo6,
Since the End of author/reviewer discussions is coming soon, may we know if our response addresses your main concerns? If so, we kindly ask for your reconsideration of the score. If you have any further concerns, please let us know and we will be more than happy to engage in more discussion and paper improvements.
Once again, thank you for your suggestion and time!
Dear Reviewer 8yo6,
We would like to sincerely thank you for your time and efforts in reviewing our paper.
We have made an extensive effort to try to successfully address your concerns:
-
Justifying the choice of architecture;
-
Experimental proving the effectiveness of reconstructing loss is due to better representations learning rather than more labels;
-
Evaluating the model performance in more scenarios (different input/output lengths and short-term predictions in the zero-shot setting);
-
Describing the selection strategy for prediction heads;
-
Experimental proving the effectiveness of register with shorter input lengths;
-
Describing the difference between register and seasonality lookup;
-
Re-evaluating the inference time;
-
Making revisions to the paper and appendix accordingly.
We hope that our response can address your concerns to your satisfaction. If so, we kindly ask for your reconsideration of the score. If you have any further concerns or questions, please do not hesitate to let us know, and we will respond timely. We kindly remind you that the reviewer-author discussion phase will end soon. After that, we may not have a chance to respond to your comments.
All the best,
Authors
Thanks to the authors for their work. We would like to ask a question. ROSE has potential as a fundation model, and you also experimented with zero-shot prediction in your paper. But what if the prediction length of the downstream task does not coincide with the length supported by the prediction head in the model? Or what if the prediction length of the downstream task exceeds the prediction length supported by the prediction head in the model? In other words, how much performance will it affect if we use 96 head for 192/336/720 prediction tasks?
Thank you for your attention and interest in our work. We appreciate the opportunity to provide clarification and highlight key aspects of our model design and contributions.
-
General adaptability: ROSE is a general time series forecasting model that leverages multi-source pretraining with reconstruction and prediction tasks. While demonstrating strong zero-shot capabilities, it also focuses on significantly improving prediction performance in few-shot and full-shot settings, making it adaptable to various scenarios.
-
Flexible inference: During inference, we select the prediction head closest to the required prediction length, enabling the model to handle various prediction lengths within 720. This long prediction horizon supports most downstream zero-shot needs, achieving SOTA performance.
-
Extending prediction lengths: With a prediction length of 720 already being quite extensive, zero-shot prediction for lengths beyond 720 can still be achieved using an autoregressive approach. However, we recommend fine-tuning as the preferred method for better adaptation.
We hope this explanation addresses your concerns and provides a clearer understanding of our work.
Dear reviewer 8yo6,
Thank you for taking the time and effort in providing a valuable review of our work. As the discussion period is coming to a close, we hope that you have had the chance to review our rebuttal.
If our rebuttal has resolved your concerns or improved your understanding of the paper, we would greatly appreciate it if you could reconsider your assessment and update the score accordingly. Your feedback has been incredibly helpful, and we value the opportunity to further improve the work based on your insights.
Thank you again for your thoughtful review and for considering our responses. Please feel free to reach out if you have any additional questions or require further clarifications.
Best regards,
Authors
Dear Reviewer 8yo6,
Thank you for taking the time and effort to provide a valuable review of our work. We have responded to each of your insight comments point by point.
Since the end of the discussion period is coming soon, we hope that you have had the chance to read our rebuttal. We eagerly await your feedback and are ready to respond to any remaining concerns you may have. If our rebuttal has resolved your concerns or improved your understanding of the paper, we would also be very grateful if you could reconsider the score, which would give us a greater opportunity to present this work at the conference.
Thank you once again for your time and review.
Best regards,
Authors
Dear Reviewer 8yo6,
We sincerely appreciate the effort you have put into reviewing our paper during this busy period, as well as the recognition of the strengths of our work. We are truly grateful for the increased score.
We fully respect your current evaluation of our work. However, based on the scoring standards of past ICLRs, we find that our current score of 5.75 is at the borderline level. Therefore, we kindly ask if you could consider raising the score once again, so that we would have the opportunity to present our work at the conference. We would be deeply grateful.
Thank you again for taking the time to review and comment on our paper.
Best regards,
Authors
This paper introduces ROSE, a model designed to achieve a unified representation and effectively capture domain-specific information through two main components: Decomposed Frequency Learning and Time Series Register. Extensive experiments have demonstrated the competitive performance.
优点
-
This paper is well-written and easy to follow.
-
The time series register is interesting.
-
Extensive experiments have shown that ROSE achieves competitive performance.
缺点
-
The paper should provide further explanations on how Decomposed Frequency Learning contributes to learning a unified representation.
-
For ROSE, the requirement for a different new Register for each dataset during the pretrain stage means that adding a new dataset necessitates starting a new pretrain process. This increases the computational cost, whereas previous methods primarily focus on the fine-tune stage and typically involve only one pretrain phase.
-
Additional analysis experiments for Decomposed Frequency Learning are necessary to clarify why it is beneficial for achieving a unified representation.
-
The experimental settings for Figure 1(b) are lacking.
问题
please see weaknesses
We would like to sincerely thank Reviewer Z8Ez for providing a detailed review and insightful comments. We have revised our paper accordingly.
Q1: Why does decomposing frequency learning help achieve unified representations, and what further analytical experiments support this?
A1:
- Design: Decomposed frequency learning helps achieve unified representations due to the following two design aspects: 1) Multi-frequency masking randomly masks high or low-frequency components of time series multiple times, which achieves decoupling of complex temporal patterns. 2) The reconstruction task enables the model to understand data from various frequency perspectives, allowing it to learn unified representations.
- Experiment: To further demonstrate the effectiveness of decomposed frequency learning in capturing unified representations, we pre-traine the model using multi-frequency masking, patch masking and multi-patch masking. We visualize the reconstruction performance of the three methods in out-of-distribution (OOD) scenarios. We put the results of the visualization . The model pre-trained with multi-frequency masking exhibits greater robustness to complex temporal patterns, confirming that decomposed frequency learning can assist in learning unified representations.
Q2: The requirement for a different new Register for each dataset during the pre-train stage.
A2: We would like to clarify that during the pre-train stage, ROSE does not require training a separate register for each dataset. Instead, a single register is used to cluster and store the domain-specific information from multi-source datasets. When a new dataset is added for pre-training, the pre-trained register can continue to be used and updated through incremental training. When a new dataset is added in fine-tuning, the pre-trained register can fine-tuning with a learnable low-rank matrix A. Two cases above are all similar to existing pre-trained models [1][2][3], without the need to start a new pre-training process.
Q3: The description of the setting for Figure 1(b) is missing.
A3: We select three datasets (Pems08, PSRA, Electricity) from transport, nature and energy domains respectively and compare the differences in hidden representations between direct transfer and adaptive transfer. Specifically, direct transfer refers to the case where domain specific information is not considered, while adaptive transfer considers domain specific information that is learned by register tokens. We visualized the output of the encoder's hidden representations using t-SNE. The description of the setting is updated .
[1] Gao, S., Koker, T., Queen, O., Hartvigsen, T., Tsiligkaridis, T., & Zitnik, M. (2024). Units: Building a unified time series model. arXiv preprint arXiv:2403.00131.
[2] Woo, G., Liu, C., Kumar, A., Xiong, C., Savarese, S., & Sahoo, D. (2024). Unified training of universal time series forecasting transformers. arXiv preprint arXiv:2402.02592.
[3]Liu, Y., Zhang, H., Li, C., Huang, X., Wang, J., & Long, M. (2024). Timer: Transformers for time series analysis at scale. arXiv preprint arXiv:2402.02368.
Dear Reviewer,
Thank you for your valuable and constructive feedback, which has inspired further improvements to our paper. As a gentle reminder, it has been more than 3 days since we submitted our rebuttal. We would like to know whether our response addressed your concerns. We eagerly await your feedback and are ready to respond to any further questions you may have.
Thank you for your time and consideration.
Best regards
Dear Reviewer Z8Ez,
Since the End of author/reviewer discussions is coming soon, may we know if our response addresses your main concerns? If so, we kindly ask for your reconsideration of the score. If you have any further concerns, please let us know and we will be more than happy to engage in more discussion and paper improvements.
Once again, thank you for your suggestion and time!
Dear Reviewer Z8Ez,
We would like to sincerely thank you for your time and efforts in reviewing our paper.
We have made an extensive effort to try to successfully address your concerns:
-
Giving further explanations and analysis experiments on the effectiveness of Decomposed Frequency Learning;
-
Clarifying the process of pre-training ROSE with register;
-
Supplementing the experimental settings for Figure 1(b);
-
Making revisions to the paper and appendix accordingly.
We hope that our response can address your concerns to your satisfaction. If so, we kindly ask for your reconsideration of the score. If you have any further concerns or questions, please do not hesitate to let us know, and we will respond timely. We kindly remind you that the reviewer-author discussion phase will end soon. After that, we may not have a chance to respond to your comments.
All the best,
Authors
Dear reviewer Z8Ez,
Thank you for taking the time and effort in providing a valuable review of our work. As the discussion period is coming to a close, we hope that you have had the chance to review our rebuttal.
If our rebuttal has resolved your concerns or improved your understanding of the paper, we would greatly appreciate it if you could reconsider your assessment and update the score accordingly. Your feedback has been incredibly helpful, and we value the opportunity to further improve the work based on your insights.
Thank you again for your thoughtful review and for considering our responses. Please feel free to reach out if you have any additional questions or require further clarifications.
Best regards,
Authors
Dear Reviewer Z8Ez,
Thank you for taking the time and effort to provide a valuable review of our work. We have responded to each of your insight comments point by point.
Since the end of the discussion period is coming soon, we hope that you have had the chance to read our rebuttal. We eagerly await your feedback and are ready to respond to any remaining concerns you may have. If our rebuttal has resolved your concerns or improved your understanding of the paper, we would also be very grateful if you could reconsider the score, which would give us a greater opportunity to present this work at the conference.
Thank you once again for your time and review.
Best regards,
Authors
Dear Reviewer Z8Ez,
We sincerely appreciate the time and effort you dedicated to reviewing our paper during this busy period, as well as your recognition of its strengths.
With the author/reviewer discussion phase now concluded and no additional concerns raised, we believe our rebuttal has addressed all of your comments. However, based on the scoring standards of past ICLRs, we find that our current score of 5.75 is at the borderline level. We would be deeply grateful if you could consider raising your score and giving us the opportunity to present our work at the conference.
Thank you once again for your thoughtful review and valuable feedback.
Best regards,
Authors
The paper proposes a pre-trained model for time series forecasting. In this approach the pre-training is done through masked reconstruction in the frequency domain. Authors also attempt to capture domain-specific information by learning a register of cluster center vectors and then using top-k most similar vectors during the fine-tuning/prediction stage.
优点
The paper is well written and easy to follow. The proposed approach is interesting and innovative. Both masked reconstruction and domain-specific register can be valuable for developing a foundational time series model. Experiments on real world datasets show that the proposed approach outperforms leading baselines. Authors conduct extensive ablation results, I found few shot generalization and register vector selection particularly interesting. It indicates that the model is robust in the low data setting and the model is transferring domain knowledge from similar datasets. Finally, authors provide both code and benchmark datasets.
缺点
My main concern is that this method is complex, it has multiple components and a multi-term loss. Balancing this loss and contributions of different components while also appropriately tuning the required hyper parameters can be a challenging task. I'm also not convinced on why the low rank learnable matrix A needs to be added. Is this the only parameter that is adapted during the finetuning phase? How much is the accuracy impacted if A is removed?
问题
Is low rank learnable matrix A the only parameter that is adapted during the finetuning phase? How much is the accuracy impacted if A is removed?
We would like to sincerely thank Reviewer oZ4H for acknowledging our technical novelty and effectiveness, as well as the insightful comments. We have revised our paper accordingly.
Q1: How to handle the multi-components, multi-losses and appropriately tune hyper-parameters?
A1:
-
Multi-components and multi-losses:
- The main components of ROSE include time series register, backbone, as well as prediction heads and reconstruction head.
- We optimize these components with different loss functions. The register loss corresponds to the register. The reconstruction loss and the prediction loss correspond to the reconstruction head and the prediction heads, respectively, and they work together for the backbone.
-
Multi-losses Balancing:
-
We did not use a hyper-parameter to balance the loss functions because our experiment find that the model is not sensitive to the weights of the multi-losses. In the experiment, we introduce a hyper-parameter , and define . Since the register loss only constrains the parameter updates of register, its gradient does not influence backbone of the model. Therefore, the register loss does not cause an imbalance in the training of the model.
-
We vary 's value and report results in 10% few-shot setting in the table below which are updated . As ROSE is not sensitive to changes of , balancing the loss of the model is not challenging. Therefore, our final loss function does not contain .
0.2 0.4 0.6 0.8 Standard Deviation MSE / MAE MSE / MAE MSE / MAE MSE / MAE MSE / MAE ETTh1 0.3973 / 0.4199 0.3978 / 0.4193 0.3978 / 0.4205 0.3996 / 0.4230 0.0008 / 0.0014 ETTh2 0.3339 / 0.3790 0.3347 / 0.3802 0.3369 / 0.3822 0.3351 / 0.3830 0.0011 / 0.0016 ETTm1 0.3500 / 0.3733 0.3512 / 0.3747 0.3492 / 0.3717 0.3479 / 0.3719 0.0012 / 0.0012 ETTm2 0.2538 / 0.3111 0.2534 / 0.3095 0.2512 / 0.3092 0.2505 / 0.3092 0.0014 / 0.0007
-
-
Hyper-parameters: We conduct a sensitivity analysis to tune key hyper-parameters for ROSE, including the number of masked series, thresholds' upper bound, number of register tokens, register size, and the number of selections in the Top-K strategy. Through comprehensive analysis, we determine an optimal hyper-parameter setting, as shown . As the training cost of ROSE is low, such analysis is not challenging.
Q2: The necessity and effectiveness of the low rank matrix A in time series register.
A2:
- The necessity: During the pre-training stage, the domain-specific knowledge is stored in the time series register. However, different datasets also exhibit variations even at the same domain. Therefore, to better adapt to a specific downstream dataset during fine-tuning, we introduce a low-rank matrix A to further take the distinct information of the dataset into account.
- The effectiveness: , we compare the results of the model that are designed with and without the low-rank matrix A , respectively. It is found that adding the low-rank matrix effectively improves the model performance with different numbers of register tokens.
Q3: The model parameters adjusted during the fine-tuning stage.
A3: In the fine-tuning stage, we freeze the time series register, and fine-tune the learnable low-rank matrix A, the model's encoder-decoder, and the corresponding prediction head.
Dear Reviewer,
Thank you for your valuable and constructive feedback, which has inspired further improvements to our paper. As a gentle reminder, it has been more than 3 days since we submitted our rebuttal. We would like to know whether our response addressed your concerns. We eagerly await your feedback and are ready to respond to any further questions you may have.
Thank you for your time and consideration.
Best regards
Dear Reviewer oZ4H,
Since the End of author/reviewer discussions is coming soon, may we know if our response addresses your main concerns? If so, we kindly ask for your reconsideration of the score. If you have any further concerns, please let us know and we will be more than happy to engage in more discussion and paper improvements.
Once again, thank you for your suggestion and time!
Dear Reviewer oZ4H,
We would like to sincerely thank you for your time and efforts in reviewing our paper.
We have made an extensive effort to try to successfully address your concerns:
-
Utilizing experiments to illustrate how to balance multi-components, multi-losses, and tune hyper-parameters;
-
Illustrating the role of matrix A;
-
Clarifying the parameters that need to be adapted during the fine-tuning phase;
-
Making revisions to the paper and appendix accordingly.
We hope that our response can address your concerns to your satisfaction. If so, we kindly ask for your reconsideration of the score. If you have any further concerns or questions, please do not hesitate to let us know, and we will respond timely. We kindly remind you that the reviewer-author discussion phase will end soon. After that, we may not have a chance to respond to your comments.
All the best,
Authors
Dear reviewer oZ4H,
Thank you for taking the time and effort in providing a valuable review of our work. As the discussion period is coming to a close, we hope that you have had the chance to review our rebuttal.
If our rebuttal has resolved your concerns or improved your understanding of the paper, we would greatly appreciate it if you could reconsider your assessment and update the score accordingly. Your feedback has been incredibly helpful, and we value the opportunity to further improve the work based on your insights.
Thank you again for your thoughtful review and for considering our responses. Please feel free to reach out if you have any additional questions or require further clarifications.
Best regards,
Authors
Dear Reviewer oZ4H,
Thank you for taking the time and effort to provide a valuable review of our work. We have responded to each of your insight comments point by point.
Since the end of the discussion period is coming soon, we hope that you have had the chance to read our rebuttal. We eagerly await your feedback and are ready to respond to any remaining concerns you may have. If our rebuttal has resolved your concerns or improved your understanding of the paper, we would also be very grateful if you could reconsider the score, which would give us a greater opportunity to present this work at the conference.
Thank you once again for your time and review.
Best regards,
Authors
Dear Reviewer oZ4H,
We sincerely appreciate the time and effort you dedicated to reviewing our paper during this busy period, as well as your recognition of its strengths.
With the author/reviewer discussion phase now concluded and no additional concerns raised, we believe our rebuttal has addressed all of your comments. However, based on the scoring standards of past ICLRs, we find that our current score of 5.75 is at the borderline level. We would be deeply grateful if you could consider raising your score and giving us the opportunity to present our work at the conference.
Thank you once again for your thoughtful review and valuable feedback.
Best regards,
Authors
We thank the Reviewers for the insightful comments and detailed feedback. We were delighted that reviewers find our paper has the following advantages:
Innovative Findings: Novel pre-train framework with decomposed frequency learning and TS-register. (oZ4H, 8yo6, Z8Ez, e6XG)
Well Writing: Easily comprehensible content. (oZ4H, 8yo6, Z8Ez, e6XG)
Robust Analysis: Wide-ranging experiments across tasks/datasets. (oZ4H, 8yo6, Z8Ez)
Simplicity & Top-tier Results: Outperforms complex techniques. (oZ4H, Z8Ez)
The reviewers also raised insightful and constructive concerns. We made every effort to address all the concerns by providing sufficient evidence and requested results. Here is the summary of the major revisions:
- Clarification of experiment setting for Figure 1: We clarify the detailed setting of the t-sne visualization of the hidden representations between direct transfer and adaptive transfer.
- Additional baselines and comparison settings: We include recently proposed, competitive time series representation learning models as new baselines. Furthermore, we conduct two additional experiments: (1) full-shot evaluation using a shorter look-back window (L=96), and (2) zero-shot evaluation for short-term forecasting.
- More model analysis and cases: We conduct a multi-loss balancing analysis by varying the value of to demonstrate the robustness of ROSE with respect to multi-loss functions. Additionally, we analyze the use of multiple prediction heads, comparing different training strategies and inference practices. Furthermore, we visualize the reconstruction performance across three masking methods under out-of-distribution scenarios.
- Polished writings: We conduct detailed proofreading and revisions with helpful suggestions from the reviewers.
All updates are highlighted in blue. The valuable suggestions from reviewers are very helpful for us to revise the paper to a better shape. We'd be very happy to answer any further questions.
Looking forward to the reviewers' feedback.
The paper proposes ROSE, a pre-trained model for time series forecasting that incorporates some new ideas including decomposed frequency learning, and a register for domain specific representations. All the reviewers liked the novelty around both these ideas, the strong results on the datasets in the evaluation, and well-designed ablation studies. There were a couple of concerns around the complexity of balancing multiple losses, and around the small number of benchmarks for zero-shot evaluations. In the ACs view, the latter is a genuine concern - particularly for the zero-shot foundation modeling regime, using 6-7 datasets is simply not extensive enough to properly benchmark ROSE's performance against the established foundation models. There are some other issues around the evaluation methodology around not employing consistent inference time estimates and not using consistent metrics/baselines in all tables, all of which suggests that the authors should perform a more exhaustive and comprehensive evaluation with more datasets, especially against the well-known zero-shot foundation models. This was truly a borderline paper with some innovative ideas, and the AC urges the authors to resubmit the paper to a future venue after improving the evaluation section.
审稿人讨论附加意见
During the rebuttal period, reviewers had several questions on the encoder-decorder architecture choice, the ability to handle different context lengths and horizons, and around the effectiveness of decomposed frequency learning, and some questions on. There were also questions around comparing ROSE with existing time series representation learning methods. The authors made considerable effort to address these questions using experiments, ablation studies and explanations, but desspite of these updates, the AC feels that the datasets for the evaluation section, especially in the zero-shot foundation model setting, needs to be much more comprehensive to provide confidence in the zero-shot generalization performance of ROSE.
Reject