AutoTimes: Autoregressive Time Series Forecasters via Large Language Models
We propose AutoTimes to repurpose large language models as autoregressive time series forecasters, which exhibits zero-shot, in-context learning and multimodal utilizability with state-of-the-art performance and efficiency.
摘要
评审与讨论
This paper argues that existing LLM based time series models have not fully exploited the inherent autoregressive property and the decoder-only architecture of LLMs. To address this problem, this paper introduces a novel AutoTimes model, which exploits the autoregressive property of LLMs. Experimental results demonstrate that AutoTimes could outperform the baseline methods.
优点
- This paper exploits the autoregressive property of the LLMs.
- Comprehensive experiments are conducted to demonstrate the effectiveness of the proposed AutoTimes, including forecasting, zero-shot forecasting and in-context forecasting.
- This paper is well-written, and is easy-to-follow.
缺点
- Autoregression seems a little bit trivial.
- It is not quite clear how much improvements can be brought by auto-regressively generating time series. Either direct comparison or theoretical analysis should be provided.
- It is difficult to directly draw conclusions about the scaling behaviors from Table 5. The authors draw conclusions by comparing results of OPT-x models, rather than comparing different models e.g., compare OPT with LLaMA or GPT-2.
问题
Please see the weaknesses.
局限性
N/A
Response to Reviewer k2uw
Many thanks to Reviewer k2uw for providing a valuable review.
Q1: Reclarify the contributions of the proposed method.
The reviewer mentioned that "the paper exploits the autoregressive property of the LLMs". We agree with this argument but also would like to highlight detailed contributions of our work:
- We delve for the first time into the autoregression in LLM4TS methods. It addresses the raising concerns about the effectiveness of prevalent non-autoregressive LLM4TS methods (refer to ).
- We proposed the one-for-all benchmark that breaks the status quo of respective training on specific lookback-forecast length (appreciated by ), which is an essential step toward foundation models.
- Our method achieves state-of-the-art forecasting performance and requires significantly fewer tunable parameters among advanced LLM4TS methods. The method's efficiency is acknowledged by all other reviewers.
- We present the concept of in-context forecasting for the first time, where time series prompts and prompt engineering are closely researched (refer to ).
- Beyond adopting LLMs on end-to-end forecasting, we facilitate the full capabilities of LLMs for time series, such as iterative multistep generation, zero-shot generalization, and scaling behavior.
Q2: How much improvement is brought by autoregressively generating time series?
To address your concern, we provide a comprehensive ablation study: AutoTimes (FlattenHead) replaces the original segment-wise projection layer by (Flatten + linear head) of PatchTST, the prevalent module in non-autoregressive models. Here are the results:
| ETTh1 (MSE|MAE) | AutoTimes (Original) | AutoTimes (FlattenHead) |
|---|---|---|
| Pred-96 | 0.360|0.400 | 0.385|0.420 |
| Pred-192 | 0.388|0.419 | 0.445|0.463 |
| Pred-336 | 0.401|0.429 | 0.463|0.475 |
| Pred-720 | 0.406|0.440 | 0.574|0.542 |
| ECL (MSE|MAE) | AutoTimes (Original) | AutoTimes (FlattenHead) |
|---|---|---|
| Pred-96 | 0.129|0.225 | 0.142|0.247 |
| Pred-192 | 0.147|0.241 | 0.157|0.259 |
| Pred-336 | 0.162|0.258 | 0.201|0.311 |
| Pred-720 | 0.199|0.288 | 0.232|0.331 |
| Weather (MSE|MAE) | AutoTimes (Original) | AutoTimes (FlattenHead) |
|---|---|---|
| Pred-96 | 0.153|0.203 | 0.155|0.209 |
| Pred-192 | 0.201|0.250 | 0.202|0.251 |
| Pred-336 | 0.256|0.293 | 0.257|0.286 |
| Pred-720 | 0.331|0.345 | 0.333|0.349 |
| Traffic (MSE|MAE) | AutoTimes (Original) | AutoTimes (FlattenHead) |
|---|---|---|
| Pred-96 | 0.343|0.248 | 0.367|0.261 |
| Pred-192 | 0.362|0.257 | 0.391|0.282 |
| Pred-336 | 0.379|0.266 | 0.404|0.287 |
| Pred-720 | 0.413|0.284 | 0.432|0.294 |
As shown in the above tables, the performance of non-autoregressively generation is consistently worse than original AutoTimes. In addition to empirical results, we'd like to provide an analysis of the two approaches:
- One model fits all lengths: While most deep forecasters have to be trained and applied to specific length settings, autoregressive models are more feasible in variable-length scenarios. We provide the comparison as follows:
| Comparison | Non-autoregressive | Autoregressive |
|---|---|---|
| Training | Trained with specific lookback-forecast lengths | Trained with the context length with each generated token being supervised |
| One-step Forecasting | Applicable only on fixed lookback-forecast lengths | Flexible on scenarios less than the context length like large language models |
| Rolling Forecasting | Has to drop the lookback series because of the fixed input length | Can prolong the lookback horizon until the total length exceeds the context length |
-
Fewer parameters for training: Supposing is the lookback length, is the forecast length, is the segment (token) length and is the segment (token) number. We count the parameters for embedding and projection:
-
Non-autoregression: time series segment () -> representation () -> flattened ()-> future time series ()
-
Autoregression: time series segment () -> representation () -> next time series segment ()
In non-autoregressive models, all the tokens of the lookback series are flattened and projected to future time series, which leads to costly parameters of . Instead, the projection of AutoTimes is independently conducted on segments, leading to fewer introduced parameters .
- Consistent with the utilization of LLMs: the main claim of our paper is that non-autoregressive LLM4TS leads to inconsistencies, where inherently GPT-style models are fine-tuned in the BERT-style. Instead, we suppose the token transition of LLMs is general-purpose and find it transferable as the transition of time series segments. Consequently, the powerful generation ability of LLMs can be naturally inherited.
Q3: Concern about the scaling behaviors of LLM-based forecasters.
Thank you for your feedback regarding the conclusions drawn from Table 5. It is recommended to see the scaling behavior in , where the scaling behavior is not only observed in OPT-x models but also revealed among GPT-2, OPT, and LLaMA.
We further eliminate the influence of parameter count in trainable layers. As shown in the following table, a larger LLaMA-7B with fewer trainable parameters can still achieve better performance compared to OPT (1.3B~6.7B), demonstrating the performance stems from the scaling of LLMs, not simply the parameter of trainable layers.
| Datasets | GPT-2 | OPT-350M | OPT-1.3B | OPT-2.7B | OPT-6.7B | LLaMA-7B |
|---|---|---|---|---|---|---|
| Hidden Dimension | 768 | 1024 | 2048 | 2560 | 4096 | 4096 |
| Embedding layer | 2-layer MLP | 2-layer MLP | 2-layer MLP | 2-layer MLP | 2-layer MLP | nn.Linear |
| Trainable Parameters (M) | 0.44 | 0.58 | 1.10 | 1.36 | 2.15 | 0.79 |
| Performance (Avg. MSE) | 0.397 | 0.401 | 0.396 | 0.394 | 0.394 | 0.389 |
Thanks for the response.
After reading all the comments and responses, my concerns are mostly addressed. I'm happy to raise my scores.
Thanks again for your response and raising the score to 7! In the final version, we will elaborate more on clarifying contributions and novelty, and include the additional experiments to the paper.
Dear Reviewer k2uw,
Thanks again for your valuable and constructive review. Would you please let us know if your concerns about autoregression effectiveness and scaling behavior are resolved? Since the Author-Reviewer discussion period is ending in two days, we eagerly await your response.
Till now, we find that your rating is still "reject". In a very worried mood, we respectfully remind you that we have comprehensively evaluated the improvement of autoregression and validated the scaling effect on different LLMs categories, which should help you better assess our work. Also, we have clarified the contributions of the method and paradigm renovation, and how they differ from previous approaches.
We do hope you can consider our rebuttal and kindly let us know if our rebuttal has addressed your concern.
Sincere thanks for your dedication! We are looking forward to your reply.
Authors
Dear Reviewer k2uw,
Thank you again for your valuable and constructive review.
We kindly remind you that there is only a half day left before the Author-Reviewer discussion period ends. We find that your rating is still "reject", so we are eager to receive your response to our rebuttal. We respectfully ask if you have any further concerns. Please let us know if you have other questions about our paper. We will be more than happy to engage in more discussions and improvements to the paper.
Thank you sincerely for your dedication! We eagerly await your reply.
Authors
The authors present AutoTimes, a method that utilizes large language models (LLMs) for time series forecasting. One of the key underexplored research topics addressed by the authors is the lack of models and pre-training mechanisms that result in foundation models capable of handling lookback and forecasting horizons of arbitrary length. This is achieved by adapting the LLM forecasting framework to autoregressively forecast time series segments. Furthermore, the paper outlines techniques such as in-context learning to further improve prediction performance. Compared to previous works, the methodology requires only a small number of trainable parameters compared to previous LLM fine-tuning techniques. As far as I am aware, this work is the first to be capable of handling multimodal input and producing an autoregressive forecast in the domain of LLMs and time series.
优点
- The paper is well-written and easy to follow. I appreciate how the authors clearly outlined their contributions.
- The observation that non-autoregressive methods may contradict decoder-only LLMs and the shortcomings of prior methods is well-motivated. The solution proposed in the context of LLMs and time series is novel.
- The introduction of a One-for-all benchmark, which involves prediction horizons transfer learning instead of dataset transfer learning, is innovative.
- The authors provide strong motivation and experimental proof to support their claims.
- The flexibility and scalability of their method are demonstrated by successfully swapping out different LLMs with varying numbers of parameters.
- The continuous improvement over previous SOTA methods on multiple datasets, along with their thorough ablations, strengthens their claims and illustrates the method's flexibility.
缺点
- I'm not a fan of your chosen color scheme. While I appreciate that you try to include colors for different entities, it is too vivid (jump from Figure 3 to 4 to 5), making the figures tough to read, especially when skimming through them quickly (which most first readers will do initially). It is not a major objection, but improving this aspect could enhance your manuscript.
- Typo:
- L21: etc [22, 42], missing dot.
- The paper makes certain simplified assumptions, such as treating time series segments independently for embedding. This might overlook complex inter-dependencies present in real-world time series data. Addressing these inter-dependencies could enhance the robustness and applicability of the proposed method.
问题
-
Claim of missing data for foundation models: I am aware that there is not a plethora of datasets available, but did you look into [1] (datasets) or [2] (models)?
-
While I see the autoregressive part as a great contribution, I am not entirely convinced that the proposed embedding scheme, which only fine-tunes a small portion, brings a competitive improvement. There are multiple works [4, 5] that do not require any fine-tuning of the LLMs.
-
L106: "Unlike previous works, our proposed method regards time series itself as the instructive prompt. It avoids the modality gap caused by concatenating time series and language tokens directly." This statement gives the impression that your approach is superior. I would be interested to see how your model compares to more of these learnable-prompt methods, especially how it compares to Test [4] performance-wise, which appears to be superior to Time-LLM.
-
How does AutoTimes handle missing data or irregular time series intervals? Your claimed improvement over [6], which does not use language directly, has the disadvantage that you cannot simply overcome NaNs in the time series directly.
-
"Unlike previous works, our proposed method regards time series itself as the instructive prompt. It avoids the modality gap caused by concatenating time series and language tokens directly." I am a bit confused. What kind of model checkpoint are you using? Instruction tuned? An instructive prompt is a directive where the user provides explicit instructions to guide the response. I thought you were skipping the language level. Could you outline how you embed your time series and text and how exactly this is fed into the model?
[1] Goswami et al. "MOMENT: A Family of Open Time-series Foundation Models." ICML 2024.
[2] Ekambaram et al. "Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series" arXiv:2401.03955
[3] Chang et al. "LLM4TS: Aligning Pre-Trained LLMs as Data-Efficient Time-Series Forecasters" 2024
[4] Sun et al. "TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series" ICLR 2024
[5] Wang et al. "Xinglei Wang, Meng Fang, Zichao Zeng, and Tao Cheng. Where Would I Go Next? Large Language Models as Human Mobility Predictors" 2023
[6] Gruver, Nate, et al. “Large Language Models Are Zero-Shot Time Series Forecasters.” Advances in Neural Information Processing Systems, 2023
局限性
While the paper provides a limitations section, I don't think it is sufficient to bury it in the appendix. During my first read, I thought the authors entirely skipped the limitations until I found it in the checklist.
Here are some specific points that I believe are missing or need improvement:
-
Although the paper claims that the embedding and projection layers only account for 0.1% of the total parameters, the scalability of these layers for very large datasets or extremely long time series is not thoroughly discussed. An evaluation of how the method performs with significantly larger datasets would strengthen the paper.
-
While the authors mention that they leave out real-world datasets for future work, I think the approach was not tested on datasets with missing values. This is especially important as their tokenization scheme seems to be based on non-overlapping patches, which are known to lose their locality assumption when missing values occur. An analysis of how the model handles missing data would provide a more comprehensive evaluation of its robustness and applicability.
In addition to that, the authors answered broader societal impacts sufficiently.
Response to Reviewer EAH7
Many thanks to Reviewer EAH7 for providing a thorough insightful review and recognizing our contributions.
Q1: Suggestion to improve the presentation of the paper.
Thanks for your valuable feedback regarding the color scheme and the mentioned typo. We will use a more subdued scheme and fix the typo in the revision.
Q2: Address inter-dependencies in real-world time series data.
We acknowledge that real-world multivariate time series often exhibit complex inter-dependencies that can significantly impact analysis. In terms of that, AutoTimes adopts Channel Independence like previous methods and further uses the timestamp as the position embedding to implicitly align different variates.
As you insightfully point out, it is necessary to explore the complex inter-dependencies, which is a hot topic in current deep time series models. It is also an essential problem for LLM4TS methods since the gap between natural language (1-D discrete token sequence) and time series (multi-dimensional continuous sequence) poses increasing challenges for LLMs to explicitly utilize the relationship between sequences.
We will explore several potential approaches: integrating textual descriptions of variates and employing adaptors for variate correlating. Your suggestion will guide us in refining our methodology.
Q3: Claim of missing data for foundation models and explore the scalability on larger datasets.
Thanks for your mentioned works, we are excited to see recent works advancing the development of datasets and pre-trained large models in the field of time series. We will cite them in the related work section and polish the claim.
Based on the mentioned works, we also provide an evaluation of AutoTimes on larger datasets (Time-Series Pile[1]) to address your concern about the scalability of the trainable layers:
| Perfromance of Subset (MSE|MAE) | nn.Linear | 3-layer MLP |
|---|---|---|
| ETTh1 | 0.724|0.586 | 0.363|0.395 |
| Weather | 0.288|0.335 | 0.166|0.211 |
| ECL | 0.856|0.764 | 0.135|0.231 |
| Traffic | 1.393|0.799 | 0.351|0.247 |
As we delve into layer scalability, the essence of designing the embedding scheme to address heterogeneous time series is highlighted, which provides good guidance for our future research.
Q4: The effectiveness of fine-tuning and comparison with more LLM4TS methods.
We appreciate your concerns about the effectiveness of our fine-tuning approach, especially in light of several works that use LLMs for time series without fine-tuning. We intend to leverage the general token transition of LLMs while tailoring it to the specific characteristics of the dataset, which is achieved by freezing the LLM backbone and training new embedding layers only.
We provided detailed code and scripts in the to ensure all the results are reproducible. Further, we compare the mentioned TEST[2], and AutoTimes achieves better performance on the majority of datasets.
| Datasets (MSE|MAE) | AutoTimes | TEST |
|---|---|---|
| ETTh1 | 0.389|0.422 | 0.414|0.431 |
| Weather | 0.235|0.273 | 0.229|0.271 |
| ECL | 0.159|0.253 | 0.162|0.254 |
| Traffic | 0.374|0.264 | 0.430|0.295 |
Q5: How does AutoTimes handle missing data or irregular time series intervals?
Thank you for your insightful question. At this stage, AutoTimes does not specifically address missing data or irregular intervals. It is the same with current works focusing on regular forecasting scenarios where time series are complete and consistently sampled.
We acknowledge that handling missing values and irregular intervals is a critical aspect of time series analysis, and we will add this as a limitation and conduct evaluations on well-acknowledged datasets in the future.
According to your suggestions, we will also consider moving the limitations section to a more prominent position within the main body of the paper to ensure that readers can easily access and engage with this critical information.
Q6: How does AutoTime embed and feed time series and texts?
The claim: "our proposed method regards time series itself as the instructive prompt..." refers to the following:
- As depicted in , previous LLM4TS methods feed (language tokens | lookback time series) to handle multimodal input.
- As depicted in , AutoTimes feed (time series prompt | lookback time series) to enable in-context forecasting, where time series is self-prompted.
- As shown in , AutoTimes uses textual timestamps as position embeddings and adds them with the corresponding series embedding of each token.
Therefore, AutoTimes presents prompting time series in two directions. Horizontally, AutoTimes append the time series prompt at the front of the lookback series and regards it as the task demonstration. Vertically (series embedding + timestamp embedding), merging token-wise embeddings can use timestamps in natural language and aligns multiple variates.
[1] Goswami et al. Moment: A Family of Open Time-Series Foundation Models.
[2] Sun et al. TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series.
I thank the authors for their thorough answers. I have to admit that I am, unfortunately, a bit confused. Would you mind double-checking your response and ensuring a consistent numbering of my questions? I asked five questions, but you referred to Q6, likely because you enumerated my outlined weaknesses as questions. Additionally, you seem to have combined Questions 1 and 2 into Q3; however, you appear to have only answered Question 2:
"Based on the mentioned works, we also provide an evaluation of AutoTimes on larger datasets (Time-Series Pile[1])."
Where did you do that? Before I attempt to decipher and remap the questions to their original form, I could be mistaken. To avoid any wrong conclusions, I kindly ask the authors to restructure it for me.
Thank you for your thoughtful feedback and for bringing these points to our attention. We apologize for any confusion caused by the numbering and organization of our responses.
Based on the original context of the rebuttal, the response is restructured as follows:
W1 & W2: Suggestions for improving the presentation of the paper.
Thanks for your valuable feedback regarding the color scheme and the mentioned typo. We will use a more subdued scheme and fix the typo in the revision.
W3: Address inter-dependencies in real-world time series data.
We acknowledge that real-world multivariate time series often exhibit complex inter-dependencies that can significantly impact analysis. In terms of that, AutoTimes adopts Channel Independence like previous methods and further uses the timestamp as the position embedding to implicitly align different variates.
As you insightfully point out, it is necessary to explore the complex inter-dependencies, which is a hot topic in current deep time series models. It is also an essential problem for LLM4TS methods since the gap between natural language (1-D discrete token sequence) and time series (multi-dimensional continuous sequence) poses increasing challenges for LLMs to explicitly utilize the relationship between sequences.
We will explore several potential approaches: integrating textual descriptions of variates and employing adaptors for variate correlating. Your suggestion will guide us in refining our methodology.
Q1: Claim of missing data for foundation models.
Thanks for your mentioned works, we are excited to see recent works advancing the development of datasets and pre-trained large models in the field of time series. We will cite them in the related work section and polish your mentioned claim.
Q2: About the effectiveness of the proposed embedding scheme.
We appreciate your concerns about the effectiveness of our approach, especially in light of several works that use LLMs for time series without fine-tuning. The mentioned works indeed provide an out-of-the-box experience that is free from fine-tuning.
To boost the performance further, AutoTimes intends to leverage the general token transition of LLMs while tailoring it to the specific characteristics of the dataset, which is achieved by freezing the LLM backbone (keeping the token transition) and training new embedding layers (learning the dataset-dependent embeddings of time series).
Please also refer to the detailed code and scripts provided in the , by which we ensure all the results of the paper are reproducible.
Q3: Comparison with the performance of TEST[1].
We compare TEST performance-wise. The averaged results among four prediction lengths {96, 192, 336, 720} are reported. AutoTimes achieves better performance on the majority of datasets.
| Datasets (MSE|MAE) | AutoTimes | TEST |
|---|---|---|
| ETTh1 | 0.389|0.422 | 0.414|0.431 |
| Weather | 0.235|0.273 | 0.229|0.271 |
| ECL | 0.159|0.253 | 0.162|0.254 |
| Traffic | 0.374|0.264 | 0.430|0.295 |
Q4: How does AutoTimes handle missing data or irregular time series intervals?
Thank you for your insightful question. At this stage, AutoTimes does not specifically address missing data or irregular intervals. It is the same with current works focusing on regular forecasting scenarios where time series are complete and consistently sampled.
We will add this as a limitation and conduct evaluations on well-acknowledged datasets in the revision.
Q5: How does AutoTime embed and feed time series and texts?
The claim: "our proposed method regards time series itself as the instructive prompt..." refers to the following:
- As depicted in , previous LLM4TS methods feed (language tokens | lookback time series) to handle multimodal input.
- As depicted in , AutoTimes feed (time series prompt | lookback time series) to enable in-context forecasting, where time series is self-prompted.
- As shown in , AutoTimes uses textual timestamps as position embeddings and adds them with the corresponding series embedding of each token.
Therefore, AutoTimes presents prompting time series in two directions. Horizontally, AutoTimes append the time series prompt at the front of the lookback series and regards it as the task demonstration. Vertically (series embedding + timestamp embedding), merging token-wise embeddings can use timestamps in natural language and align multiple variates.
L1: An evaluation of how the method performs with significantly larger datasets.
Based on your suggested works[2], we provide an evaluation of AutoTimes on larger datasets (Time-Series Pile[2]) to address your concern about the scalability of the trainable layers.
Please note that Time-Series Pile is an aggregation of time series datasets, which also includes ETTh1, Weather, ECL, and Traffic as the subsets. We first train the LLM-based forecaster by AutoTimes on Time-Series Pile, and report the performance on the above well-acknowledged subsets respectively. Therefore, the experiment evaluates the capacity of our embedding/projection layers to accommodate diverse time series.
| Performance of Subset (MSE|MAE) | nn.Linear | 3-layer MLP |
|---|---|---|
| ETTh1 | 0.724|0.586 | 0.363|0.395 |
| Weather | 0.288|0.335 | 0.166|0.211 |
| ECL | 0.856|0.764 | 0.135|0.231 |
| Traffic | 1.393|0.799 | 0.351|0.247 |
As we delve into layer scalability, the essence of designing the embedding scheme to address larger time series datasets is highlighted, such as increasing more layers and using MoE modules to increase the data generality, which provides great guidance for our future research.
L2: An analysis of how the model handles missing data.
We appreciate your highlighting the importance of this issue, especially about our tokenization scheme based on non-overlapping patches. Currently, our research has not included testing on datasets with missing values, as our focus has been on repurposing LLMs on regular time series.
According to your suggestions, we will consider moving the limitation to a more prominent position within the main body of the paper to ensure that readers can easily access and engage with this critical information.
[1] Sun et al. TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series.
[2] Goswami et al. Moment: A Family of Open Time-Series Foundation Models.
We appreciate your patience as we work to restructure our responses for clarity. Your feedback is invaluable in helping us improve the communication. We'd be very happy to answer any further questions.
I thank the authors again for their response and willingness to restructure their answer. I have read the other reviews and your response, and I am mostly satisfied.
Regarding Q4, while I acknowledge that the authors left it out, I still believe that, given the nature of this work, it could provide an interesting insight (at least the missing data part).
For Q5: Thank you for clarifying this. Based on your response, it might be a good idea to make this clearer in the manuscript, as it wasn't immediately clear to me at first glance.
However, with the newly conducted experiments and the forthcoming manuscript revisions, I believe this work could have a great impact on at least one subfield of time series prediction. As I have no further concerns, I will maintain my initial score. Great work!
Thank you for your prompt response and valuable comments on our paper, which have been of great help to us.
Here is a brief revision plan based on your suggestions: (1) We understand your perspective on the insights that our method could gain from delving into the missing data. We will conduct evaluations to enhance the depth of our paper. (2) We will elaborate on the pipeline of time series and text embedding and make it clearer to improve readability for future readers.
Thanks again for your response and support for our work. We promise to follow your suggestion to improve the quality of our paper.
This paper proposes a model named AutoTimes to repurpose LLMs for time series forecasting. Different from previous methods that use flattening and linear projection to get a prediction, this model repurposes LLMs in an autoregressive way, which is closer to the pre-training process of LLMs. Specifically, the main backbone of LLMs is frozen, and new patch embedding/output layers are added like in previous works. Absolute timestamps are embedded through LLMs to serve as position embeddings. Experiments show that the proposed model achieves SOTA performance and is more efficient.
优点
-
The main body of the thesis is clear and easy to understand.
-
Reproposing LLMs in an autoregressive way is intuitive and more reasonable than previous linear projection ones.
-
Numerous experiments were conducted to demonstrate the effectiveness of the proposed method.
缺点
-
The most I concerned is whether this type of methods truly leverages the capabilities of the pre-trained LLMs. Please conduct the following ablation study: randomly initialize a large model, freeze its parameters, train it using the proposed method, and compare it with the pre-trained ones.
-
The description of multimodal in Table 1 seems to be overselling, as the proposed model only uses texts for timestamps embedding and does not have capability to leverage natural languages.
-
The in-context forecasting part is confusing: 1)What are the use cases for such a method? Table 16 shows that it's more effective to extend the lookback window. So what is the case that we can not get a longer lookback window but can get a window a long time ago? 2)Such a method can not even ensure the input series is continuous, so where does the promotion come from?
问题
-
Is LLM necessary for timestamp embedding? What if replacing it with ordinary nn.Embedding like Informer?
-
How does the proposed model reduce the error accumulation of autoregressive models, given that there is no specifical design for this?
-
Please report the number of learnable parameters in Table 5. Larger models have larger hidden states so that patch embedding/output layers for them have more learnable parameters. Therefore, it is uncertain whether the performance improvement comes from the scaling behavior of LLMs or from having more learnable parameters for tuning.
局限性
No potential negative societal impact.
Response to Reviewer 1PWH
Many thanks to Reviewer 1PWH for providing a detailed and insightful review.
Q1: Whether AutoTimes truly leverages the capabilities of the pre-trained LLMs.
We noticed that the recent work[1] has raised questions about non-autoregressive LLM4TS methods. It is also the main claim of our paper that the inconsistent model structure and generative approach will cause insufficient utilization of LLMs for forecasting. We thoroughly conduct all types of ablations[1] (Random Init is the suggested ablation by the reviewer):
| ETTh1 (MSE|MAE) | AutoTimes | Random Init | w/o LLM | LLM2Attn | LLM2Trsf |
|---|---|---|---|---|---|
| Pred-96 | 0.360|0.400 | 0.373|0.408 | 0.365|0.399 | 0.383|0.404 | 0.377|0.401 |
| Pred-192 | 0.388|0.419 | 0.394|0.421 | 0.405|0.425 | 0.414|0.422 | 0.406|0.420 |
| Pred-336 | 0.401|0.429 | 0.405|0.430 | 0.429|0.441 | 0.431|0.432 | 0.421|0.431 |
| Pred-720 | 0.406|0.440 | 0.418|0.447 | 0.450|0.468 | 0.456|0.454 | 0.449|0.452 |
| ECL (MSE|MAE) | AutoTimes | Random Init | w/o LLM | LLM2Attn | LLM2Trsf |
|---|---|---|---|---|---|
| Pred-96 | 0.129|0.225 | 0.148|0.245 | 0.171|0.263 | 0.156|0.255 | 0.162|0.263 |
| Pred-192 | 0.147|0.241 | 0.163|0.259 | 0.192|0.282 | 0.178|0.276 | 0.189|0.287 |
| Pred-336 | 0.162|0.258 | 0.179|0.274 | 0.216|0.304 | 0.198|0.295 | 0.216|0.309 |
| Pred-720 | 0.199|0.288 | 0.217|0.305 | 0.264|0.342 | 0.230|0.320 | 0.258|0.340 |
The above results highlight that the autoregression way of AutoTimes can truly utilize the LLM. The core difference: instead of regarding LLMs as representation extractors in a BERT-style, we figure out the mechanism of LLM4TS that the general-purpose token transition is transferable among time series and natural language, such that the generation ability of LLMs can be fully revitalized.
Q2: The description of multimodal in Table 1 seems to be overselling.
Thanks for your suggestion. In the initial version, we demonstrate that AutoTimes can take advantage of textual timestamps, which are the most accessible ones in real-world applications. Considering the scope of multimodal models, we will remove the point in Table 1 unless being evaluated on well-acknowledged multimodal datasets.
Q3: The use cases for in-context forecasting and how in-context forecasting gets the promotion.
The value of the proposed in-context forecasting is to extend the input context of time series forecasting beyond a continuous lookback window. As the reviewer mentioned, shows that extending the lookback window (P.2) or using trivial prompts (P.3) respectively excel at different subsets but the overall difference is small.
Since the essence of prompts is to incorporate useful domain-specific knowledge, here is one use case of in-context forecasting: Considering predicting the weather of one day, one approach is to extend the lookback length from days to weekends. However, it can also introduce noisy information since non-stationary meteorological conditions can change with seasons. Another practical way is to consider how the weather changes on the same day in the last year (or years). Although the input is not continuous, the input context becomes more relevant based on prior knowledge about the periodicity (yearly). Therefore, in-context forecasting makes the prior knowledge incorporatable and gets the promotion.
We also provide an exploration of prompt engineering in , in which the usage of discontinuous lookback windows can indeed outperform continuous lookback windows at well-acknowledged datasets.
Q4: Ablations about the timestamp embedding.
As per your suggestion, we compare the ways of embedding timestamps in AutoTimes. Here are the results:
| ETTh1 (MSE|MAE) | LLM Embedding | nn.Embedding | w/o Embedding |
|---|---|---|---|
| Pred-96 | 0.360|0.400 | 0.370|0.405 | 0.368|0.402 |
| Pred-192 | 0.388|0.419 | 0.396|0.422 | 0.395|0.421 |
| Pred-336 | 0.401|0.429 | 0.408|0.430 | 0.413|0.433 |
| Pred-720 | 0.406|0.440 | 0.422|0.448 | 0.439|0.459 |
| ECL (MSE|MAE) | LLM Embedding | nn.Embedding | w/o Embedding |
|---|---|---|---|
| Pred-96 | 0.129|0.225 | 0.132|0.231 | 0.131|0.227 |
| Pred-192 | 0.147|0.241 | 0.150|0.243 | 0.149|0.243 |
| Pred-336 | 0.162|0.258 | 0.165|0.260 | 0.166|0.261 |
| Pred-720 | 0.199|0.288 | 0.203|0.291 | 0.204|0.293 |
Results show that using timestamp embeddings from LLMs achieves better performance, which indicates better alignment with learned series embeddings in AutoTimes.
Q5: How does the method overcome error accumulation?
It is true that there is no specific design for it in AutoTimes. Actually, the performance degradation during rolling forecast not only comes from the gap between the ground truth and prediction (error accumulation) but also comes from the dropping of lookback time series (lookback cut-off).
To be more precise, AutoTimes for one-for-all scenarios mainly copes with the second issue: predicting the next token of each position to keep our LLM-based forecaster feasible on prolonged inputs, while non-autoregressive models have a fixed input length. We will rephrase the relevant statements in our paper.
Q6: Report the number of learnable parameters in Table 5 and confirm the scaling behavior of LLMs.
Thanks for your scientific rigor. We will include the following results in our revision, where a larger LLaMA-7B with fewer trainable parameters can still achieve better performance compared to OPT (1.3B~6.7B).
| Datasets | GPT-2 | OPT-350M | OPT-1.3B | OPT-2.7B | OPT-6.7B | LLaMA-7B |
|---|---|---|---|---|---|---|
| Hidden Dim. | 768 | 1024 | 2048 | 2560 | 4096 | 4096 |
| Embedding layer | 2-layer MLP | 2-layer MLP | 2-layer MLP | 2-layer MLP | 2-layer MLP | nn.Linear |
| Trainable Param. (M) | 0.44 | 0.58 | 1.10 | 1.36 | 2.15 | 0.79 |
| MSE (Avg) | 0.397 | 0.401 | 0.396 | 0.394 | 0.394 | 0.389 |
[1] Tan et al. Are Language Models Actually Useful for Time Series Forecasting?
[2] Woo et al. Unified Training of Universal Time Series Forecasting Transformers.
Thanks for your response. The rebuttal addressed my concerns about the role of LLM and timestamp embedding, as well as the practical application of in-context forecasting. Therefore, I decided to raise my score to 7 and recommend this work to be accepted.
Moreover, I hope the authors can include these experimental results for rebuttal in the final version to make the work more comprehensive, especially for the Q6 Table regarding scaling behavior.
Thank you for your positive feedback and for raising the score to 7. We are glad to hear that our rebuttal addressed your concerns regarding the true ultilization of LLMs in our method, effectiveness of timestamp embeddings, and the practical application of in-context forecasting.
We appreciate your suggestion to include the experimental results in the final version, particularly for the Q6 Table on scaling behavior. We will ensure that these results are incorporated to enhance the comprehensiveness of our work.
Thank you once again for your support and constructive feedback. We look forward to finalizing the manuscript.
The authors present in this paper an interesting approach where LLMs are leveraged to be fledged as time series forecasters. This proposed approach is based on freezing the LLM backbone to update a small amount of parameters to generate suitable time series embeddings which, together with time stamps as positional embeddings, provide suitable forecasts. Moreover, the authors provide an interesting notion of in-context forecasting, where the corresponding model is taught with examples how to forecast without gradient updates. Finally, the authors present numerical evaluations on diverse datasets.
优点
The authors provide an interesting view and current challenge in the field of time series forecasting: how to leverage LLMs for time series forecasting with minimal effort and minimal model modification. The authors provide an approach that basically freezes the LLM backbone to only update parameters related to a suitable embedding of time series, together with the usage of timestamps prompts as positional encoders. Moreover, the emphasise that the number of updated parameters is around 0.1% of those of the LLM backbone.
The authors provide an interesting paradigm that naturally arises from LLMs: how can we do forecaster fine-tuning without gradient updates. This brings in-context forecasting as a contribution of the authors, which seems to have potentially an interesting resourceful idea for the community.
缺点
Perhaps the main weakness comes from the evaluations. The main points that I would invite the authors to aggressively address are the following:
- the amount of baselines used for comparison is very limited. It is completely understandable that the authors miss some baselines because the field is just moving extremely quickly. Nevertheless, the authors should at least cite a relevant fraction of these recent works, for instance :
-
- Woo et al, 2024: Unified Training of Universal Time Series Forecasting Transformers
-
- Ansari et al, 2024: Chronos: Learning the Language of Time Series
-
- Dooly et al, 2023: ForecastPFN: Synthetically-Trained Zero-Shot Forecasting
-
- Goswami et al 2024: Moment: A family of open time-series foundation models
- The amount of datasets used is limited. It is understandable that the authors provide a limited amount of datasets for evaluations, as one of the main hindrances in the field is the lack of publicly available data (at least in comparison to other fields). Yet, one can see that other contributions have shared datasets and made them freely available. See for instance:
-
- Woo et al, 2024: Unified Training of Universal Time Series Forecasting Transformers
-
-
- and the corresponding HF source: https://huggingface.co/datasets/Salesforce/lotsa_data
-
- Evaluations of in-context forecasting are limited. Following the previous points, I think this is one of the most exciting points of the paper and the reader would appreciate to see more systematic evaluations and explorations of this notion. Currently there is only one evaluation which involves M3 and M4 datasets, perhaps following the approach presented in One-Fits-All paper.
问题
- In Section 3 the authors mention that they aim to forecast covariates as well. Is this correct? In general, one can assume that certain kind of covariates are available in the future, like timestamps or boolean variables indicating that something specific will happen in the future. But in general, I am not sure that the authors truly want to forecast covariates.
局限性
The authors have devoted a section describing sensible limitations of their work.
Response to Reviewer acJd
Many thanks to Reviewer acJd for providing a detailed review and recognizing our contributions.
Q1: More baseline models for evaluation.
We acknowledge the importance of the recent works and will certainly incorporate the suggested references in our revision. As per your suggestion, we include the mentioned models based on the official code to enlarge our time series forecasting baselines. Here are the results:
| ETTh1 (MSE) | AutoTimes | MOIRIA | MOMENT | Chronos |
|---|---|---|---|---|
| Pred-96 | 0.360 | 0.384 | 0.387 | 0.571 |
| Pred-192 | 0.388 | 0.425 | 0.410 | 0.654 |
| Pred-336 | 0.401 | 0.456 | 0.422 | 0.712 |
| Pred-720 | 0.406 | 0.470 | 0.454 | 0.774 |
| ECL (MSE) | AutoTimes | MOIRIA | MOMENT | Chronos |
|---|---|---|---|---|
| Pred-96 | 0.129 | 0.158 | 0.136 | - |
| Pred-192 | 0.147 | 0.174 | 0.152 | - |
| Pred-336 | 0.162 | 0.191 | 0.167 | - |
| Pred-720 | 0.199 | 0.229 | 0.205 | - |
| Traffic (MSE) | AutoTimes | MOIRIA | MOMENT | Chronos |
|---|---|---|---|---|
| Pred-96 | 0.343 | - | 0.391 | 0.770 |
| Pred-192 | 0.362 | - | 0.404 | OOM |
| Pred-336 | 0.379 | - | 0.414 | OOM |
| Pred-720 | 0.413 | - | 0.450 | OOM |
As shown above, MOIRIA and Chronos follow the paradigm of pre-training -> zero-shot forecasting (- indicates that the test set is included in the pre-training dataset and thus not reported). MOMENT follows pre-training -> fine-tuning on each dataset and length. AutoTimes does not involve pre-training on time series since it adopts the pre-trained LLM and fine-tunes it on each dataset, using one model for all prediction lengths.
In terms of performance, AutoTimes consistently achieves the best. Still, we also appreciate the zero-shot forecasting ability of natively trained large time series models, which provide an out-of-the-box experience that is free from training/tuning the models.
Q2: More benchmark datasets for evaluation.
We appreciate your suggested works that have contributed valuable data resources. Thus, we conduct evaluations on several datasets from [1], which come from various domains and applications.
| Australian Electricity Demand (MSE) | AutoTimes | PatchTST | iTransformer | DLinear |
|---|---|---|---|---|
| Pred-96 | 0.150 | 0.163 | 0.153 | 0.167 |
| Pred-192 | 0.203 | 0.216 | 0.214 | 0.211 |
| Pred-336 | 0.236 | 0.255 | 0.244 | 0.237 |
| Pred-720 | 0.264 | 0.289 | 0.267 | 0.269 |
| Bdg-2 Panther (MSE) | AutoTimes | PatchTST | iTransformer | DLinear |
|---|---|---|---|---|
| Pred-96 | 0.537 | 0.565 | 0.546 | 0.581 |
| Pred-192 | 0.663 | 0.707 | 0.694 | 0.693 |
| Pred-336 | 0.741 | 0.807 | 0.774 | 0.781 |
| Pred-720 | 0.802 | 0.911 | 0.832 | 0.829 |
| Oikolab Weather (MSE) | AutoTimes | PatchTST | iTransformer | DLinear |
|---|---|---|---|---|
| Pred-96 | 0.603 | 0.635 | 0.630 | 0.663 |
| Pred-192 | 0.643 | 0.678 | 0.660 | 0.694 |
| Pred-336 | 0.666 | 0.685 | 0.677 | 0.711 |
| Pred-720 | 0.697 | 0.710 | 0.698 | 0.727 |
The above results show that AutoTimes still outperforms the state-of-the-art deep models, which are good enhancements for the robustness of experiments. We will include these in the revision and make more complete evaluations of the lotsa dataset [1].
Q3: Systematic evaluations on the proposed in-context forecasting.
Thanks a lot for your scientific rigor. We adopt M3 and M4 datasets, which are consistent with the zero-shot experiment of One-Fits-All paper, to present the promotion of our in-context paradigm. As per your request, we extend the evaluation to widely recognized datasets. Details of the experiment are as follows:
By using a trained model checkpoint on a source domain (Traffic), we conduct forecasting without gradient update on target ETT datasets. We evaluate the Pred-96 performance on the last variate (OT).
- For the zero-shot scenario, the input is Length-288 lookback series.
- For in-context forecasting, the input is (Length-384 series prompt + Length-288 lookback series). Considering the dataset periodicity, the prompt is uniformly selected as the Ahead-24 series of the original lookback series.
- To eliminate the performance boost that comes from extending the input length, we also provide the results of Length-672 lookback series in the zero-shot scenario.
| Dataset (MSE) | In-Context (Prompt-384 + Input-288) | Zero-Shot (Input-288) | Zero-Shot (Input-672) |
|---|---|---|---|
| ETTh1-OT | 0.0645 | 0.0673 | 0.0657 |
| ETTh2-OT | 0.1513 | 0.1637 | 0.1538 |
| ETTm1-OT | 0.0399 | 0.0424 | 0.0415 |
| ETTm2-OT | 0.1629 | 0.1669 | 0.1701 |
Moreover, we further delve into the effect of different strategies to select time series prompts:
| Dataset (MSE) | Ahead-Period | Ahead-Random | Fixed Prompt | Other-Variates | Baseline (Zero-Shot) |
|---|---|---|---|---|---|
| ETTh1-OT | 0.0645 | 0.0666 | 0.0769 | 0.1263 | 0.0657 |
| ETTh2-OT | 0.1513 | 0.1621 | 0.1859 | 0.1780 | 0.1538 |
| ETTm1-OT | 0.0399 | 0.0407 | 0.0512 | 0.0852 | 0.0415 |
| ETTm2-OT | 0.1629 | 0.1719 | 0.2104 | 0.2297 | 0.1701 |
- Ahead-Period: The prompt is uniformly selected as the Ahead-24 series of the original lookback series where 24 is one of the periods of ETT.
- Ahead-Random: The prompt is randomly selected as the previous series of the original lookback series.
- Fixed Prompt: The prompt is fixed as the first piece of the series in the variate-OT.
- Other Variate: The prompt is uniformly selected as the Ahead-24 series, but comes from other variate of ETT.
The above results demonstrate the effectiveness of using suitable time series prompts and highlight the influence of prompt engineering. Using in-period time series prompts can even outperform extending the lookback window. We also provide a detailed explanation in . Thus, the in-context forecasting paradigm is meaningful for the real-world application.
Q4: Whether we aim to forecast covariates or not?
In Section 3, we use the timestamps as the covariate to improve forecasting, but we do not predict the covariate. As shown in , we are concerned about the multivariate scenario, where each time series needs to be predicted.
[1] Woo et al. Unified Training of Universal Time Series Forecasting Transformers.
Dear Reviewer acJd:
We sincerely appreciate your insightful pre-rebuttal review, which has inspired us to improve our paper further substantially.
According to your suggestions, we have made every effort to complete the evaluations, including more baseline models, benchmark datasets, and evaluations/explorations on in-context forecasting. Experimentally, we verify that our method can achieve the best performance on new baselines and benchmarks, an in-context forecasting paradigm deserves a meaningful notion for real-world forecasting.
Due to the word limit of the rebuttal, we provide the complete results of the previous questions here:
1. More baseline models for evaluation.
| ETTh1 (MSE|MAE) | AutoTimes | MOIRIA | MOMENT | Chronos |
|---|---|---|---|---|
| Pred-96 | 0.360|0.400 | 0.384|0.402 | 0.387|0.410 | 0.571|0.464 |
| Pred-192 | 0.388|0.419 | 0.425|0.429 | 0.410|0.426 | 0.654|0.504 |
| Pred-336 | 0.401|0.429 | 0.456|0.450 | 0.422|0.437 | 0.712|0.530 |
| Pred-720 | 0.406| 0.440 | 0.470|0.473 | 0.454|0.472 | 0.774|0.570 |
| Average | 0.389|0.422 | 0.434|0.439 | 0.418|0.436 | 0.678|0.517 |
| ECL (MSE|MAE) | AutoTimes | MOIRIA | MOMENT | Chronos |
|---|---|---|---|---|
| Pred-96 | 0.129|0.225 | 0.158|0.248 | 0.136|0.233 | - |
| Pred-192 | 0.147|0.241 | 0.174|0.263 | 0.152|0.247 | - |
| Pred-336 | 0.162|0.258 | 0.191|0.278 | 0.167|0.264 | - |
| Pred-720 | 0.199|0.288 | 0.229|0.307 | 0.205|0.295 | - |
| Average | 0.159|0.253 | 0.188|0.274 | 0.165|0.260 | - |
| Traffic (MSE|MAE) | AutoTimes | MOIRIA | MOMENT | Chronos |
|---|---|---|---|---|
| Pred-96 | 0.343| 0.248 | - | 0.391|0.282 | 0.770|0.552 |
| Pred-192 | 0.362|0.257 | - | 0.404|0.287 | OOM |
| Pred-336 | 0.379| 0.266 | - | 0.414|0.292 | OOM |
| Pred-720 | 0.413| 0.284 | - | 0.450|0.310 | OOM |
| Average | 0.374|0.264 | - | 0.415 |0.293 | OOM |
2. More benchmark datasets for evaluation.
| Australian Electricity Demand (MSE|MAE) | AutoTimes | PatchTST | iTransformer | DLinear |
|---|---|---|---|---|
| Pred-96 | 0.150|0.228 | 0.163|0.242 | 0.153|0.233 | 0.167|0.250 |
| Pred-192 | 0.203|0.268 | 0.216|0.284 | 0.214|0.270 | 0.211|0.283 |
| Pred-336 | 0.236|0.293 | 0.255|0.312 | 0.244|0.295 | 0.237|0.302 |
| Pred-720 | 0.264|0.315 | 0.289|0.343 | 0.267|0.318 | 0.269|0.332 |
| Average | 0.213|0.276 | 0.231|0.295 | 0.220|0.279 | 0.221|0.292 |
| Bdg-2 Panther (MSE|MAE) | AutoTimes | PatchTST | iTransformer | DLinear |
|---|---|---|---|---|
| Pred-96 | 0.537|0.458 | 0.565|0.476 | 0.546|0.462 | 0.581|0.499 |
| Pred-192 | 0.663|0.511 | 0.707|0.543 | 0.694|0.524 | 0.693|0.547 |
| Pred-336 | 0.741|0.544 | 0.807|0.584 | 0.774|0.564 | 0.781|0.584 |
| Pred-720 | 0.802|0.575 | 0.911|0.649 | 0.832|0.597 | 0.829|0.615 |
| Average | 0.686|0.522 | 0.748|0.563 | 0.712|0.537 | 0.721|0.561 |
| Oikolab Weather (MSE|MAE) | AutoTimes | PatchTST | iTransformer | DLinear |
|---|---|---|---|---|
| Pred-96 | 0.603|0.577 | 0.635|0.603 | 0.630|0.591 | 0.663|0.611 |
| Pred-192 | 0.643|0.602 | 0.678|0.630 | 0.660|0.609 | 0.694|0.633 |
| Pred-336 | 0.666|0.615 | 0.685|0.634 | 0.677|0.620 | 0.711|0.643 |
| Pred-720 | 0.697|0.632 | 0.710|0.647 | 0.698|0.633 | 0.727|0.654 |
| Average | 0.652|0.607 | 0.677|0.629 | 0.667|0.613 | 0.699|0.635 |
3. Systematic evaluations of in-context forecasting.
| Dataset (MSE|MAE) | In-Context(Prompt-384 + Input-288) | Zero-Shot (Input-288) | Zero-Shot(Input-672) |
|---|---|---|---|
| ETTh1-OT | 0.0645|0.1951 | 0.0673|0.1996 | 0.0657 |0.1969 |
| ETTh2-OT | 0.1513|0.3009 | 0.1637|0.3133 | 0.1538 |0.3026 |
| ETTm1-OT | 0.0399|0.1512 | 0.0424|0.1567 | 0.0415 |0.1534 |
| ETTm2-OT | 0.1629|0.3143 | 0.1669|0.3137 | 0.1701 |0.3197 |
| Dataset (MSE|MAE) | Ahead-Period | Ahead-Random | Fixed Prompt | Other-Variates | Baseline (Zero-Shot) |
|---|---|---|---|---|---|
| ETTh1-OT | 0.0645|0.1951 | 0.0666|0.1988 | 0.0769|0.2109 | 0.1263|0.2796 | 0.0657 |0.1969 |
| ETTh2-OT | 0.1513|0.3009 | 0.1621|0.3141 | 0.1859|0.3346 | 0.1780|0.3338 | 0.1538 |0.3026 |
| ETTm1-OT | 0.0399|0.1512 | 0.0407|0.1529 | 0.0512|0.1733 | 0.0852|0.2284 | 0.0415 |0.1534 |
| ETTm2-OT | 0.1629|0.3143 | 0.1719|0.3216 | 0.2104|0.3649 | 0.2297|0.3738 | 0.1701 |0.3197 |
Given the limited timeframe for author-reviewer discussion, please kindly let us know if our response has addressed your concerns. Your feedback is invaluable in helping us improve the communication. We'd be very happy to answer any further questions.
All the best,
Authors
I would like to thank the authors for putting this significant amount of effort in such a small amount of time.
Given that LOTSA is a fairly large collection of time series datasets, the authors are presenting results on a subset of these datasets. For instance, some of them belong to the monash repository. It is not expected to conduct experiments in all datasets. Yet, I would like to ask the authors: what was the criteria to choose these datasets? was it limitation on time execution, memory, compute in general?
Thank you for your prompt response and for acknowledging the effort we put into this work. We appreciate your interest in our dataset selection in the rebuttal.
The criteria for choosing the datasets from the LOTSA collection are based on the size of the datasets. Given the limited time for rebuttal, we are only able to cover partial datasets in LOTSA, since the overall time points of this collection are up to 27 billion.
We aimed to select datasets that would provide the diversity of evaluation while ensuring that our experiments remained manageable within the given timeframe. Concretely, the time points of the selected datasets range from 0.5~2 million, which is comparable to existing datasets. If the evaluation is favored, we will be glad to continue to run evaluations on larger datasets and include them in the final revision.
We hope this clarifies our approach, and we appreciate your understanding of the limitations inherent in working with such a large collection and other experimental analyses. Thank you again for your thoughtful feedback!
Thanks for this clarifying answer.
I will increase my score to the next rating value, i.e. 4: Borderline Reject.
The main reason to provide this score is that on one side I believe the idea is interesting and it shows a path of interest in the field of time series. On the other side, the amount of experiments and following results and discussions would correspond to almost a major revision of the paper.
Thank you for your feedback, but we respectively disagree that the discussion would correspond to almost a major revision of the paper:
1. The original baselines compared in our paper are the most up-to-date and advanced deep-learning approaches.
-
For LLM4TS methods, we compared AutoTimes with the state-of-the-art models: TimeLLM (ICLR 2024, cite 157), and FPT (NeurIPS 2023 Spotlight, cite 147).
-
For deep time series forecasters, we compared with the most prevalent ones, covering various architectures with state-of-the-art performance: iTransformer (ICLR 2024 Spotlight, cite 189), DLinear (AAAI Oral, cite 978), PatchTST (ICLR 2023, cite 615), TimesNet (ICLR 2023, cite 466).
2. Our method is evaluated on diverse tasks and extensively analyzed in the original submission.
- Our evaluations include long-term time series forecasting, short-term time series forecasting, zero-shot time series forecasting, and in-context time series forecasting, covering 10 datasets beyond those in previous LLM4TS methods.
- Our analysis covers method generality, scaling behavior of LLMs, method efficiency, and hyperparameter sensitivity, which are hardly explored in previous LLM4TS approaches.
- Our ablations cover almost all components in the proposed method: textual timestamp embedding (Appendix D.5), the LLM backbone ( LoRA Adaptation in Section 4.4), and autoregressive forecasting (Figure 7 and Table 9).
Even if it is indispensable to include the suggested evaluations (datasets of LoTSA, comparison with concurrent Large Time Series Models, and exploration of our proposed in-context forecasting) that have not been included in any previous works in the LLM4TS direction, these experiments can be easily added by lines in existing tables and incorporate them in the Appendix, such that the work do not need a major revision.
We believe that such adjustments can enhance the integrity of the paper without too much impact on the overall structure. We hope you will reconsider this point.
Thank you for your understanding and support.
Summary of Rebuttal
We sincerely thank all the reviewers for their insightful reviews and valuable comments, which are instructive for us to improve our paper further.
In this work, we proposed an effective approach (AutoTimes) to repurpose LLMs as autoregressive forecasters. Unlike previous works that adopt LLMs as non-autoregressive models, we maintain consistency with the training and inference of LLMs. AutoTimes exhibits variable-length feasibility, scalability with larger LLMs, and utilization of textual timestamps, achieving state-of-the-art performance with minimally trainable parameters. Further, we propose in-context forecasting for the first time, extending the conventional forecasting context to discontinuous time series windows as task demonstrations.
The reviewers generally held positive opinions of our paper, in that the proposed method is "well-motivated", "intuitive", "novel", and "more reasonable than previous ones", the paper is "well-written" and "easy to follow", in-context forecasting is "resourceful idea for the community", "one-for-all benchmark is innovative", "numerous experiments were conducted" and "achieve continuous improvement over previous SOTA methods".
The reviewers also raised insightful and constructive concerns. We made every effort to address all the concerns by providing sufficient evidence and requested results. Here is the summary of the major revisions:
- More evaluations (Reviewer acJd, EAH7): We extensively include the mentioned baseline models and datasets. By making great efforts to complete the evaluations, we verify that our method still achieves the best performance and good generality on new baselines and benchmarks.
- Systematic exploration of in-context forecasting (Reviewer acJd, 1PWH): We delve into in-context forecasting, including more evaluated datasets and different strategies to retrieve time series prompts. It highlights the significance of incorporating prior knowledge (such as periodicity) into the prompt engineering of time series.
- Ablation study (Reviewer 1PWH, k2uw): We conduct comprehensive ablations to confirm that AutoTimes truly utilize the ability of LLMs, and highlight the improvement and significance of autoregression. We also provide ablation on alternative embeddings of timestamps and confirm the scaling behavior of our methodology.
- Technical contributions (Reviewer k2uw): We highlight the contribution of introducing autoregression into LLM4TS first, which facilitates the full abilities and efficiency of LLMs. By analyzing autoregressive and non-autoregressive approaches, we illustrate the advantages in both theoretical and experimental aspects.
- Polished writings (Reviewer EAH7): We summarize the revisions and future directions with helpful suggestions from the reviewers. The peculiarities and limitations of our work are clarified more clearly.
The valuable suggestions from reviewers are very helpful for us to revise the paper to a better shape. We'd be very happy to answer any further questions.
Looking forward to the reviewer's feedback.
This paper introduces AutoTimes, a novel approach for time series forecasting using Large Language Models (LLMs). It leverages the inherent autoregressive property and decoder-only architecture of LLMs, generating time series segments autoregressively and incorporating techniques like positional embeddings and in-context learning for improved predictions.
The proposed method shows strong performance in experiments. The paper is well-written.
The original reviews raised key concerns regarding the limited baselines, overselling claims, unclear benefits of in-context forecasting, and scaling behaviors. The rebuttal has included an extensive amount of new results and analysis, addressing most of the concerns.
Overall, the strengths of the paper overweighs the weaknesses.