Retrieval Augmented Time Series Forecasting
摘要
评审与讨论
The paper introduced RAFT - a retrieval augmented time series forecasting model. The main idea is to use the a similarity function between the current time series context (query) and context and forecast horizon pairs (key/value) to retrieve similar time series patches from the training for forecasting. A similarity function between the query and key patches is used to compute attention weights for the value patches. These retrieved patches are then concatenated with the input and the forecast is generated with a linear model. Optionally, the model also uses downsampling of the input to different resolutions and using multiple downsampled retrieval results for forecasting. The authors compare their model with different baseline models on eight datasets. They also perform synthetic experiment to provide evidence when retrieval is useful. The authors's experiments suggest that retrieval is useful when keys are similar to the query, rate patterns are present, and the patterns are temporally less correlated.
优点
Originality and Significance
The authors introduce a conceptually interesting model with a straightforward instantiation. As the proposed method is mainly a simplification of existing attention-based forecasting ideas, the significance in this work is mainly into the experimental evidence whether these simplified instantiations work well for time series forecasting.
Quality
The paper evaluates the model on eight datasets and several baselines from literature. Additionally, the authors discuss their results with additional synthetic experiments to demonstrate under which conditions the retrieval module in RAFT benefits brings a benefit to forecasting. However, I have concerns regarding the evaluation which I elaborate in the Weaknesses section.
Clarity
The paper is clearly written. The proposed idea and the experiments are easy to follow through the presentation of the proposed idea by the used notation and supporting illustrations.
缺点
The main two weaknesses in the paper are the presentation of the related work and the empirical evaluation.
Related work
The model uses a similarity function and attention weights to get a weighted average candidate value patches from the training dataset and an MLP network to forecast. This is conceptually similar to transformer variants for forecasting as the main difference is mainly in the interaction of learnable weights in the transformer architecture. Such a simplification might be effective, but this is not discussed in the related work. Moreover, there is related work on few-shot forecasting that presents an (arguably more complex) instantiation of the model proposed in this paper (Iwata and Kumagai, 2020; https://arxiv.org/abs/2009.14379). This work should be compared and discussed. In light of this prior work, the novelty of this work is limited.
Another concern I have with the related work is that recent work on pretrained/foundational time series model are not mentioned (Woo et al., ICML 2024: https://arxiv.org/abs/2402.02592; Das et al., ICML 2024: https://openreview.net/forum?id=jn2iTJas6h; Ansari et al., preprint 2024: https://arxiv.org/abs/2403.07815). I would argue that the proposed idea of using RAG for forecasting could benefit exactly this model class to forecast time series that are not in the pretrainig corpus. This should be discussed in the related work.
Evaluation
The main concern I have on the paper is on the evaluation in the Experiments section.
1.The authors evaluate their method on only eight different datasets. While these datasets have been used widely, many more datasets became available, which allows for a more thorough evaluation. Several of these datasets (Woo et al., ICML 2024: https://arxiv.org/abs/2402.02592; Ansari et al., preprint 2024: https://arxiv.org/abs/2403.07815) from recent papers are publicly available. Running the evaluation on more datasets would give stronger evidence on the performance of the model.
- I also have concerns about the choice and setup of the baselines. The authors mention that they use the same experimental setup for each baseline, which includes number of training epochs. However, the number of training epochs are arguably an important hyperparameter for tuning a model. I would argue that the real-world performance is best compared by choosing the best possible setting for the baselines, rather than a uniform setting that might result into suboptimal performance of several baselines. Additionally, I think a strong baseline (PatchTST; Nie et al., ICLR 2023: https://arxiv.org/abs/2211.14730) is missing and should be included. When I compare the MSE/MAE of the models in this work with the models in the PatchTST paper, I find that the models in this work perform much weaker. This might suggest that the chosen setting leads to suboptimal performance of the baselines, which might affect the conclusion of this paper. For example: D-Linear has a 0.0764 MSE for the univariate setup in this work and a 0.056 MSE for ETTh1 in the PatchTST paper (Nie at al., ICLR 2023). This suggests that there is actually a better setting to run this baseline and this setting should be used for comparison. I noticed that both papers cite different sources of the datasets, but I checked briefly and at least the ETT datasets seem identical. There might be something different in the setup that I'm not aware off that also explains this difference.
I consider this point critical and this needs to be addressed for me to consider to change my score. Specifically, I would ask the authors to revise the setup and use the settings for the baselines so they are comparable to the results in Nie at al. 2023. I would also like to ask the authors to include the PatchTST baseline.
- I also have concerns on the win matrix to compare the results. I would argue that in this setup it is not relevant how often RAFT outperforms other baselines, but rather how it compares to the strong baselines. Thus, the reported average win ratio of 86% is somewhat misleading and it would be more useful for the reader to report the win-rate to the next best model and the absolute/relative improvement in MSE/MAE when averaged over the datasets. I would also argue that in the context of the cited paper (Bahri et al., ICLR 2022), the win matrix is used over 65 datasets, while here only over 8 datasets with different forecast horizons. Hence, this introduces redundancy when aggregating the results in a win matrix. It is also not clear how ties are handled. Bahri et al., ICLR 2022 explicitly mention that ties are broken by a statistical test. How is this handled in this work?
I would kindly ask the authors to report the win-rate to the next best model and the absolute/relative improvement in MSE/MAE when averaged over the datasets. I think this gives a more complete picture on performance of the model.
问题
-
It is not clear how ties are handled in the statistical test. Bahri et al., ICLR 2022 explicitly mention that ties are broken by a statistical test. How is this handled in this work?
-
Is the setup/datasets in this work noticeably different from the setup/datasets in the PatchTST paper?
-
In several points of the paper the authors mention the inductive bias of RAFT and that it is more suited for forecasting. It is unclear to me what this inductive bias specifically means. In particular, there is one argument that existing models make i.i.d. assumptions and this is a limitation. How does RAFT overcome this limitation, especially with the empirical result that it is more effective when temporal correlation is lacking? I would kindly ask the authors what the specific inductive bias is that RAFT introduces and how it is a different inductive bias from existing models.
6. Concerns on performance comparison with win matrix
Thank you for pointing out this issue. We agree with the reviewer’s concern and have moved the average result table (which was in the Appendix of the previous manuscript) to the main manuscript to provide a clearer comparison of forecasting performance across models. Table 1 in the revised manuscript presents the average MSE performance evaluated across different forecasting horizon lengths. We observe that our model consistently outperforms other contemporary baselines, highlighting the effectiveness of retrieval in time-series forecasting.
7. Comparison with PatchTST
Our work has the same setup and similar set of datasets as in the PatchTST paper. As responded in the previous question, we experimented with all baselines in the same environmental setting, including the learning rate and batch size, and they may be different from the original paper. Thus, we re-evaluated all models again with their optimal parameters searched over the validation set and reported the updated results in the experiment section (Table 1).
8. The meaning of inductive bias and Its relationship with retrieval
By incorporating retrieval to provide relevant inputs, we eliminated the need for the forecasting model to learn every short-term pattern during inference. We believe this process introduces a bias or assumption—an inductive bias—that helps prevent overfitting by constraining the model's parameter space search. This bias is particularly beneficial in scenarios with short-term patterns, where temporal correlation is minimal, such as random walk, and the model would otherwise need to memorize all changes for accurate predictions. Instead, retrieval allows the model to reference relevant information, freeing up learning capacity to focus on other features. This is empirically demonstrated through our analyses over the synthetic datasets in Section 5.2-5.3.
Thank you for your thoughtful comments, which are valuable for enhancing the quality of our paper. Below, we address your concerns in detail.
1. Difference between RAFT and transformer variants
Thank you for raising this question. Regarding the conceptual difference between transformer variants and ours, transformers learn relationships within a fixed lookback window through attention mechanisms. Even though increasing the lookback window size of a Transformer allows for the use of more past information, using long sequences as input has significant limitations, including a reduction in available data points and a substantial increase in task and computational complexity. In contrast, our model extends beyond the lookback window by retrieving relevant data points from the entire time-series and incorporating them into the input. This also allows for efficient reference to more past information without increasing the input length, distinguishing it from traditional transformer variants. We also showed that our model outperforms other transformer variants across various datasets (see Figure 4). We have added this discussion to the revised manuscript (Line 161).
2. Comparison of related work on few-shot forecasting
Thank you for your recommendation. As noted in our general response, our work assumes a single time-series possibly with multiple channels. We propose performing retrieval directly on the training data of a single time-series for forecasting. The model suggested by the reviewer assumes the existence of multiple types of time-series datasets and retrieves additional information from external data to improve the prediction of the target time-series. These approaches differ in experimental setup, underlying motivation, and the effects achieved through retrieval. For instance, our model aims to reduce the learning burden through training data retrieval, enabling the discovery of more generalizable features in a supervised setting. In contrast, the suggested literature focuses on improving prediction performance in label-scarce scenarios by leveraging external data. Note that a more recent work in a similar setting to the suggested model [1] is already discussed in our related work section (Line 140). As suggested, we have added the recommended literature alongside [1] in the related work section (Line 142).
[1] Mq-retcnn: Multi-horizon time series forecasting with retrieval-augmentation (2022)
3. Comparison of related work on pretrained/foundational time series models
Since our model performs forecasting on a given time-series in a supervised setting, we did not include comparisons with large foundation models specialized for zero-shot or few-shot scenarios. However, we agree with the reviewer that these models have brought advancements to the time-series forecasting field, and we have added a discussion about them in the related work section (Lines 115-117). The key contribution of our paper is demonstrating that retrieval from the training dataset reduces the learning burden during forecasting. We would like to clarify that this is orthogonal to approaches that retrieve additional information from external datasets, as our focus is on leveraging the training data itself for generalizable training.
4. Limited datasets
Thank you for the comment. We conducted our experiments following the evaluation settings and benchmark datasets used in the latest supervised time-series forecasting studies [2,3,4]. To further demonstrate our model's performance, as suggested, we additionally evaluated it on two benchmark datasets discussed in [2,3,4] and the recommended papers that were not included in our original paper. Accordingly, we updated our results in the experiments section. Our findings confirm that our model continues to exhibit strong performance on these newly added datasets.
[2] TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting (ICLR 2024)
[3] SparseTSF: Modeling Long-term Time Series Forecasting with 1k Parameters (ICML 2024)
[4] A Time Series is Worth 64 Words: Long-term Forecasting with Transformers (ICLR 2023)
5. The choice and setup of the baselines
Thank you for raising this critical point to help improve our paper. We also observed that baselines often vary in performance and settings across different papers. Instead of tailoring settings for each dataset and baseline, we aimed to ensure a fair comparison by evaluating all methods in the same environment. However, we agree with your perspective and have conducted an additional evaluation by comparing our model's best results against the best results reported in the original papers for the baselines, including PatchTST. We updated the results accordingly in the experiment section (Table 1). In most settings, we confirmed that our model outperforms PatchTST.
I would like to thank the authors for their updated paper. I will respond to the individual points below and only touch on points that require further discussion.
1. Difference between RAFT and transformer variants I still have the concern is that the proposed method is conceptually similar to transformer variants (and CVZn shares this concern). A transformer uses learned attention weights to attend to learned patterns (or even memorized examples) using the lockback window is a query. Through this mechanism, a transformer would attend to the training set, and maybe even to an individual time series. I agree with the authors that the lookback window is an important hyperparameter to tune but I do not agree the with the argument that it would require the increase of the lockback window to have give the transformer the ability to attend on the training set, but rather having enough capacity for the model to retrieve learned patterns at inference time (encoded in the learned weights). Thus, I keep this concern.
2. Comparison of related work on few-shot forecasting I understand that this is a different setup (several datasets vs single time series). Still, the ideas are conceptually similar, even if applied to a different setting. This needs to be clearly highlighted in the paper.
4. Limited datasets Thank you for including additional datasets. I agree that these datasets have been used in many several papers in the field, but I would still highlight that recent work raised the bar by evaluating on much more and larger datasets, which provides more empirical evidence to model performance. Compared to these recent studies, the number of datasets/observations evaluated in this study are still limited.
5. The choice and setup of the baselines Thank you for revising the approach of tuning baselines and including additional ones. I have looked into the results for PatchTST specifically. It appears that the results in the updated manuscript deviate from the original work (https://arxiv.org/abs/2211.14730). For example, for ETTh1 and ETTh2, the averaged MSE for PatchTST is 0.413 (ETTh1) and 0.330 (ETTh2) in the original paper but 0.516 (ETTh1) and 0.391 (ETTh2) in this work. The authors mention that they are " comparing our model's best results against the best results reported in the original papers for the baselines, including PatchTST". From what I understand, the PatchTST results here stem from the tuning of learning rate/batch size that the authors have employed. Thus, the performance improvements that RAFT shows over PatchTST are only valid for the specific setup in this work. I would still argue that the best performing setting identified in earlier work should be the appropriate reference point for comparing baselines.
I would like to thank again the authors for their revision. I will maintain my score because the points mentioned above are still not sufficiently addressed.
Thank you for your follow-up questions. We further provide the responses to your questions below:
1. Difference between RAFT and transformer variants
We want to clarify that Transformers can only apply attention to time-series frames within the lookback window provided as input. In other words, the retrieval set they use to best explain the given input is restricted to the lookback window and does not encompass the entire training set (i.e., query, key, value can be only chosen from the lookback window.) As a result, Transformers cannot directly attend to the entire training set during inference; their attention mechanism is limited to patterns within the provided lookback window. In contrast, our model extends beyond the lookback window by traversing the training set to include additional relevant frames, thereby broadening the scope of attention. The reason we referenced lookback window size in the previous response is that for a Transformer to mimic our model’s ability to attend to the entire training set, its lookback window would need to be set to the same length as the training set, which we mentioned as an illustrative example.
2. Comparison of related work on few-shot forecasting
We have included a discussion about related works utilizing the retrieval concept for few-shot forecasting, which is highlighted in Lines 140–142 of the revised manuscript.
4. Limited datasets
Thank you for your comment. We acknowledge that some recent works on developing the foundation model for time-series data have evaluated their models on a wide range of datasets, although our approach aligns with the conventions commonly followed in the mainstream time series forecasting domain [1,2,3,4,5,6,7]. Unfortunately, it is not feasible for us to complete evaluations on new datasets within the discussion period due to time constraints. However, we will include evaluations on additional datasets in the camera-ready version of our paper.
[1] TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting (ICLR 2024)
[2] SparseTSF: Modeling Long-term Time Series Forecasting with 1k Parameters (ICML 2024)
[3] PatchTST: A Time Series is Worth 64 Words: Long-term Forecasting with Transformers (ICLR 2023)
[4] TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis (ICLR 2023)
[5] MICN: Multi-scale Local and Global Context Modeling for Long-term Series Forecasting (ICLR 2023)
[6] Are Transformers Effective for Time Series Forecasting? (AAAI 2022)
[7] FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting (ICML 2022)
5. The choice and setup of the baselines
We excerpted the baseline results from the original TimeMixer paper. According to the TimeMixer paper, a unified hyper-parameter search setting was applied for each baseline, and we followed the same approach for tuning in our work. The difference in reported performance is likely due to the broader parameter search range used for PatchTST in their experiments. We will run experiments within the same search space as PatchTST and include the results in the revised manuscript if possible within the limited time available during the discussion period. Thank you for pointing this out.
Thank you for the update and clarifying the source of the baselines. I would suggest that the paper clearly states where the results from other models are taken from and why exactly this choice has been made. My suggestion would be to take the results from the original paper, whenever possible.
I would like to thank the authors for their revision and have raised my score.
Thank you for your thoughtful comments. They are all valuable for enhancing the quality of our paper.
This paper proposes RAFT (Retrieval-Augmented Forecasting of Time-series), a novel method for time series forecasting that leverages a retrieval module to provide the model with relevant historical patterns. The key idea is to retrieve the most similar historical data to the current input and utilize the future values of these retrieved candidates to enhance predictions. This reduces the burden on the model to memorize all possible patterns, especially rare or complex ones. The retrieval module operates by finding the most similar key patches to the input query from the entire training time series, and then retrieving the corresponding future value patches. An attention-like mechanism is used to weigh the retrieved value patches based on their similarity to the input. RAFT extends this idea to multiple time series generated by downsampling the original series at different periods, allowing it to capture patterns at various temporal resolutions. The method is built upon a simple MLP architecture, demonstrating that a well-designed retrieval module can provide an effective inductive bias for time series forecasting. Extensive experiments on eight benchmark datasets show that RAFT consistently outperforms state-of-the-art baselines, achieving an average win ratio of 86% for multivariate forecasting and 80% for univariate forecasting tasks. Further analyses using synthetic datasets reveal that RAFT is particularly beneficial when rare patterns repeat in the time series or when patterns are less temporally correlated. The retrieval module enables the model to directly leverage relevant historical patterns in such scenarios. Overall, the paper presents a novel perspective on enhancing time series forecasting models with retrieval-based methods, opening up new possibilities in this domain.
优点
This paper presents several noteworthy strengths across various dimensions:
Originality: The proposed RAFT method offers a novel approach to time series forecasting by incorporating a retrieval module. While retrieval-augmented methods have been explored in other domains like natural language processing, their application to time series forecasting is innovative. By directly leveraging relevant historical patterns, RAFT introduces a new paradigm for handling complex and rare patterns in time series data. Quality: The paper is well-structured and thoroughly evaluates the proposed method. The authors provide a clear description of the retrieval module architecture and how it is integrated into the overall forecasting model. The experimental setup is comprehensive, considering both multivariate and univariate forecasting tasks across eight diverse benchmark datasets. The results convincingly demonstrate the superiority of RAFT over state-of-the-art baselines. Clarity: The paper is well-written and easy to follow. The authors provide a clear motivation for their approach and explain the technical details of RAFT in a concise and understandable manner. The use of illustrative figures, such as the retrieval module architecture (Figure 2) and the overall RAFT architecture (Figure 3), enhances the clarity of the proposed method. The experimental results are presented in a readable format, making it easy to compare RAFT's performance against the baselines. Significance: The paper makes a significant contribution to the field of time series forecasting. By demonstrating the effectiveness of retrieval-augmented methods, RAFT opens up new research directions and possibilities for improving forecasting models. The analyses using synthetic datasets provide valuable insights into the scenarios where retrieval is particularly beneficial, such as handling rare patterns or less temporally correlated data. These findings have important implications for real-world applications where such characteristics are common.
Moreover, the paper's results challenge the current reliance on increasing model capacity to capture complex patterns. RAFT shows that a simpler MLP architecture, when augmented with a well-designed retrieval module, can outperform more sophisticated models. This highlights the potential of retrieval-based methods as a complementary approach to improving time series forecasting. In summary, the paper's originality, quality, clarity, and significance make it a valuable contribution to the field, offering new insights and directions for future research in time series forecasting.
缺点
While the paper presents a novel and effective approach to time series forecasting, there are a few areas that could be improved or require further clarification:
Retrieval module design: The current retrieval module uses a simple similarity measure (Pearson correlation) to find the most relevant historical patterns. However, the authors do not provide a thorough justification for this choice or explore alternative similarity measures. Time series data often exhibit complex, nonlinear, and nonstationary characteristics, which may not be well-captured by linear correlation. Exploring more sophisticated similarity measures or learning adaptive similarity functions could potentially improve the retrieval process and the overall forecasting performance. Computational efficiency: The paper does not provide a detailed analysis of the computational complexity and efficiency of the proposed method. The retrieval process involves comparing the input query with all key patches in the training data, which could be computationally expensive for large datasets. While the authors mention that the stride for the sliding window can be adjusted for computational efficiency (Section 3.2), they do not provide empirical results or discussions on the trade-off between computational cost and forecasting accuracy. A more comprehensive analysis of the method's scalability and efficiency would be valuable for practical applications. Sensitivity to hyperparameters: The performance of RAFT may be sensitive to the choice of hyperparameters, such as the number of retrieved patches (m), the temperature (τ), and the set of periods (P) used for generating multiple time series. While the authors provide the chosen values of m for each experiment setting (Appendix B), they do not discuss how these values were determined or provide insights into the sensitivity of the results to these hyperparameters. A more systematic analysis of the impact of these hyperparameters on the forecasting performance would enhance the robustness and reproducibility of the proposed method. Limited ablation studies: The paper would benefit from more extensive ablation studies to better understand the individual contributions of the proposed components. For example, the authors could evaluate the performance of RAFT without the multi-period extension to assess the impact of capturing patterns at different temporal resolutions. Similarly, comparing the performance of RAFT with and without the attention-like weighting of the retrieved value patches could provide insights into the importance of this mechanism. Evaluation on more diverse datasets: While the paper evaluates RAFT on eight benchmark datasets, these datasets are primarily from the energy, traffic, and weather domains. To demonstrate the generalizability of the proposed method, it would be valuable to include datasets from a wider range of application domains, such as finance, healthcare, or social media. Moreover, the paper could benefit from evaluations on datasets with different characteristics, such as varying lengths, missing values, or irregularly sampled time series.
Addressing these weaknesses would further strengthen the paper's contributions and provide a more comprehensive understanding of the proposed retrieval-augmented forecasting method. However, it is important to note that these weaknesses do not diminish the overall value and novelty of the work, and the authors have already made significant contributions to the field of time series forecasting.
问题
- Choice of similarity measure: Can you provide more insights into the choice of Pearson correlation as the similarity measure in the retrieval module? Have you considered or experimented with other similarity measures, such as dynamic time warping (DTW), cross-correlation, or learned similarity functions? How do you think the choice of similarity measure affects the retrieval process and the overall forecasting performance?
- Computational efficiency: Can you provide more details on the computational complexity and efficiency of the proposed method, particularly the retrieval process? How does the computational cost scale with the size of the dataset and the length of the time series? Have you considered any techniques to improve the efficiency of the retrieval process, such as indexing or approximate nearest neighbor search?
- Hyperparameter sensitivity: How sensitive are the results to the choice of hyperparameters, such as the number of retrieved patches (m), the temperature (τ), and the set of periods (P)? Can you provide more details on how these hyperparameters were determined for each experiment setting? Have you considered using techniques like cross-validation or Bayesian optimization to tune these hyperparameters? Ablation studies: Can you provide more ablation studies to investigate the individual contributions of the proposed components? For example, how does the performance of RAFT change when the multi-period extension is removed? How important is the attention-like weighting of the retrieved value patches compared to using a simple average or the most similar patch? Evaluation on diverse datasets: Have you considered evaluating RAFT on datasets from a wider range of application domains beyond energy, traffic, and weather? How do you expect the proposed method to perform on datasets with different characteristics, such as varying lengths, missing values, or irregularly sampled time series? Providing results on more diverse datasets could strengthen the claims of generalizability. Handling multiple retrieved patterns: In the current implementation, RAFT retrieves the top-m most similar patterns and aggregates them using an attention-like weighting scheme. Have you considered other approaches to handle multiple retrieved patterns, such as clustering similar patterns or using a more sophisticated aggregation method? How do you think these alternative approaches would impact the forecasting performance? Comparison with other retrieval-based methods: While the paper compares RAFT with several state-of-the-art forecasting methods, it would be interesting to see a comparison with other retrieval-based methods, such as those mentioned in the related work. How does RAFT differ from these existing retrieval-based approaches, and how does it compare in terms of performance and efficiency? Visualization of retrieved patterns: Can you provide more visualizations of the retrieved patterns and their corresponding future values? It would be helpful to see examples of how the retrieved patterns contribute to the final forecasting results, particularly in cases where RAFT significantly outperforms the baselines. Such visualizations could provide additional insights into the effectiveness of the retrieval process.
7. Comparison with other retrieval-based methods.
We want to emphasize that our method is the first to utilize a retrieval process within the training dataset to reduce the learning burden of the forecasting model, as clarified in the general response. The other retrieval-based algorithms referenced in our related work aim to improve inference performance by transferring knowledge across different types of time-series data within a dataset containing diverse time-series. Specifically, in [4] from our related work, an additional process is employed to identify relevant time-series among various time-series, and only frames from other time-series data with the same timestamp as input are used as the target for retrieval. This differs from our setting, which focuses on forecasting for a single time-series and retrieves patches with different timestamps from the input. Therefore, we did not conduct direct performance comparisons but instead discussed these methods in the related work section.
8. Visualization of retrieved patterns.
Our Appendix Figures 8–10 provide visualizations of the retrieved patterns and their corresponding future values. We included examples from both cases where our model performs well (e.g., ETTh1, Traffic) and cases where it performs less effectively (e.g., Exchange Rate) under the univariate setting. If there are any specific dataset and setting combinations that should be visualized, please let us know, and we would be happy to provide additional results.
Thank you for your thoughtful reviews and for recognizing the technical novelty of our proposed retrieval approach. Below, we address your concerns in detail.
1. Choice of similarity measure.
Thank you for this important question. We provided an analysis of different similarity measure choices in Appendix C, including learnable similarity functions. Among the candidates, Pearson’s correlation demonstrated the best performance (see Table 5). We did not include cross-correlation or dynamic time warping (DTW) which consider offset differences between two segments, as our sliding window approach inherently addresses the offset differences.
2. Computational efficiency.
As clarified in the general response, we assume a single time-series possibly with multiple channels, and the retrieval process is restricted to the training set of the same time-series. Thus, in our setting, the length of the time-series matters, not the size of the dataset. Our model incorporates a retrieval process to find similar patches in the given data. For efficient training, the retrieval process is pre-computed for the training and validation data, requiring computation only once during training. The pre-computation speed for retrieval of our model is O(), where N denotes the length of the time-series in the training data. To reduce this retrieval time, one approach is to increase the stride of the sliding window beyond 1, speeding up the search process. We confirmed the trade-off between the stride of the sliding window and performance in Appendix E. Additionally, methods such as clustering time-series segments and using centroids for search or applying tree-based techniques and approximate nearest neighbor (ANN) algorithms can also be considered. Thank you for the suggestions!
3. Sensitivity to hyperparameters.
For the set of periods (), we followed the approach proposed in the TimeMixer paper (), which also introduced the concept of multi-periodicity. We determined other hyper-parameters through grid search on a held-out validation set with the data split into train, validation, and test sets chronologically in a 7:2:1 ratio. The range of hyper-parameters used for grid search is now provided in Appendix B. As suggested, we added analysis on the number of retrievals () and the temperature () in Appendix C.
4. Limited ablation studies.
As suggested, we conducted an additional ablation study (see Appendix C.2). We observed that each component in our retrieval module contributes to performance improvements.
5. Evaluation on more diverse datasets.
We conducted our experiments following the evaluation settings and benchmark datasets used in the latest supervised time-series forecasting studies [1,2,3]. To further demonstrate our model's performance, as suggested, we additionally evaluated it on two benchmark datasets discussed in [1,2,3] that were not included in our original paper. Accordingly, we updated our results in the experiments section. Our findings confirm that our model continues to exhibit strong performance on these newly added datasets.
[1] TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting (ICLR 2024)
[2] SparseTSF: Modeling Long-term Time Series Forecasting with 1k Parameters (ICML 2024)
[3] A Time Series is Worth 64 Words: Long-term Forecasting with Transformers (ICLR 2023)
6. Handling multiple retrieved patterns.
Since retrieval is a widely used method across various domains, its design can take various forms. To demonstrate the effectiveness of retrieving relevant frames from the training set in time-series data, we employed the simplest form of retrieval. As you expected, there is still room for improvement in retrieval techniques specifically tailored for time-series data, including exploring when, where, and how to apply retrieval. We discussed this interesting direction for future work in the conclusion section.
The paper proposes to incorporate a retrieval module for the time series forecasting task. It proposes to perform retrieval in observation space, selecting the m most similar lookback windows, and calculating a weighted average of their corresponding horizons, where the weights are a softmax of the similarity scores previously computed. Following this, several linear layers are used to project both the original lookback window and retrieved time series into the prediction. A "multiple period" extension is also proposed, which performs the same process on downsampled versions of the time series, which are incorporated into the final forecast.
The paper performs experiments on the long sequence forecasting setting and shows improved performance compared to several (older) baselines.
优点
Retrieval is an interesting idea to explore in the context of time series forecasting. The paper shows that the proposed model improves over some baselines, and presents some analysis surrounding the proposed method.
缺点
From my reading, I am unclear on what is the set of datapoints the retrieval is being performed on. Is it on the entire time series, i.e. the whole training + validation set + the test set that is becoming available during rolling window evaluation? Or is there a limit to how much data is being searched over, and how is this set?
Related to this, is my concern regarding the motivation of the paper - the introduction argues that existing methods are memorizing the training set, and retrieval is a solution to help generalize by extracting information relevant to the query. However, it turns out that the proposed approach is still retrieving from the train set. I see no big difference between the proposed approach and a Transformer model which is also attending to the lookback window. Retrieval seems to only make sense in the zero-shot setting, where we are trying to make predictions on a time series from a completely new dataset, and performance can be improved by retrieving related time series from that dataset, and critically, this dataset wasn't in the training set, so that we can show that it is not just memorization.
The experiment design for ablating the effects of the different components of the proposed method, especially to isolate the improvements from the retrieval component can be improved. The current experiments simply take NLinear as "without retrieval", but the design space is much larger and this is an important set of experiments to better understand the effects of retrieval.
问题
- Question from the weaknesses section regarding what is the set of datapoints for retrieval.
- The introduction states: ``This paper examines a critical open question in time-series forecasting: “do current models possess the necessary inductive biases and learning capacity to extract generalizable patterns from training data and achieve high accuracy?” '' -- what are the conclusions regarding this research question?
Thank you for your thoughtful comments. Below, we address your concerns.
1. Clarification on the retrieval process
Thank you for the clarification question. In our experiment, we assume a single time-series possibly with multiple channels, which is split into train, validation, and test sets chronologically. The retrieval process is restricted to the training set of the same time-series, as shown in Figure 1. Our model has no limitations on the length of the training data that can be searched. For other settings, we used the problem setting and the benchmark datasets of prior works in the forecasting domain [1, 2, 3].
[1] TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting (ICLR 2024)
[2] SparseTSF: Modeling Long-term Time Series Forecasting with 1k Parameters (ICML 2024)
[3] A Time Series is Worth 64 Words: Long-term Forecasting with Transformers (ICLR 2023)
2. Difference between RAFT and transformer-variants. Retrieval seems to only make sense in the zero-shot setting.
As clarified in the first question’s response, our problem setting only assumes a single time-series, thus retrieval to other external time-series data is not available. We also want to clarify that our retrieval process is not limited to the lookback window but instead operates over the entire training dataset of the time series. Transformers, on the other hand, learn relationships only within a fixed lookback window through attention mechanisms. Our model extends beyond this limitation by retrieving relevant data points from the entire time-series and incorporating them into the input. While increasing the lookback window size of a Transformer can provide access to more past information, this approach comes with significant drawbacks, including a reduction in the number of available data points and a substantial increase in task and computational complexity.
Our retrieval enables the model to directly reference relevant timestamps in the training set, eliminating the need for the model to memorize every pattern in the training data. By doing so, the retrieval module reduces the learning burden of the forecasting model of memorizing all patterns in the training set and helps the model learn more generalizable features. This would be particularly beneficial for long-term time-series forecasting tasks involving multiple channels. In the discussion section (Sec 5.2~5.3), we also showed that when the time series has rare or less-correlated patterns, the retrieval-augmented model can handle these cases more effectively than a model without retrieval, which would need to memorize all such patterns to make accurate predictions.
3. Ablation study is limited.
The current experiments simply take NLinear as "without retrieval", but the design space is much larger and this is an important set of experiments to better understand the effects of retrieval.
Our model becomes identical in structure to NLinear if the retrieval module is removed, leaving only the linear predictor. Therefore, we used NLinear as an ablation baseline. Additionally, by assuming only a single period, we minimized the influence of factors like multi-periodicity, simplifying the setup to isolate and evaluate the sole effect of the retrieval module.
To further demonstrate our retrieval module design, we conducted additional ablation studies (e.g., the number of retrievals or retrieval without attention) and reported results in Appendix C. If the reviewer's suggestion regarding the "design space" refers to testing the retrieval module in an ad-hoc manner across other model structures beyond NLinear, such as transformers or other baselines, we acknowledge that there are numerous design choices for integrating retrieval into different architectures. Instead of exploring all these possibilities, we focused on demonstrating its effect in the simplest form. That said, we also agree that our model can enhance performance when integrated into structures beyond simple models. To verify this, we added the retrieval module to the Transformer-variant model (e.g., AutoFormer) and observed the performance improvements. These results have been included in the Appendix D.
4. Conclusion of the research question - “Do current models possess the necessary inductive biases and learning capacity to extract generalizable patterns from training data and achieve high accuracy?”
Thank you for the question. We believe that our answer is “no” for current models, and the proposed retrieval module serves as an effective approach to achieve the goal in the research question. Our retrieval module allows traditional time-series models to overcome the limitation of having to learn every specific pattern in the training data by replacing memorization with retrieval. This enables the forecasting model to focus on generalizable patterns across the entire time series, effectively reducing the learning burden and optimizing the model’s required capacity.
Can you explain what is the difference between increasing the lookback window size vs retrieval on the whole train set? Why does your proposed method not face the same drawbacks as the Transformer model's attention mechanism? As far as I'm aware, the main difference between the proposed retrieval mechanism and attention is the top-k operation. Transformer models can take in arbitrary length input, extending the lookback length does not necessarily decrease the number of available data points.
Regarding the design space of the ablation study, I'm not referring to other model architectures. It is ambiguous what "the proposed model without the retrieval module" should be. The proposed model has projections f, g, and h. While I am comfortable with saying that g is part of the retrieval module, I am not, with f. These are additional learnable parameters which challenges the notion that NLinear is the appropriate ablation.
Could you please explain how the experiments and discussions in the paper answer the research question presented in the introduction? Could you also discuss what you mean by inductive biases and learning capacity? They are typically opposing forces - architectures with high inductive bias typically have lower learning capacity, and vice versa.
Thank you for your follow-up questions. We further provide the responses to your questions below:
1. Can you explain what is the difference between increasing the lookback window size vs retrieval on the whole train set? Why does your proposed method not face the same drawbacks as the Transformer model's attention mechanism? As far as I'm aware, the main difference between the proposed retrieval mechanism and attention is the top-k operation. Transformer models can take in arbitrary length input, extending the lookback length does not necessarily decrease the number of available data points.
When training a Transformer with a large lookback window size, the sliding window method for constructing the training set results in the number of training data points being reduced to “training set length - lookback window size + 1”. Therefore, as the length of the training set increases, the number of usable input-target pairs for training naturally decreases. To attend to the entire training set with the Transformer, the lookback window size becomes equal to the length of the training set, which reduces the number of usable data points to just one. In contrast, our approach avoids this drawback by retrieving and attaching relevant frames from the training dataset to the input without increasing the lookback window size. This allows us to leverage the entire training set effectively without reducing the number of usable data points.
2. Regarding the design space of the ablation study, I'm not referring to other model architectures. It is ambiguous what "the proposed model without the retrieval module" should be. The proposed model has projections f, g, and h. While I am comfortable with saying that g is part of the retrieval module, I am not, with f. These are additional learnable parameters which challenges the notion that NLinear is the appropriate ablation.
As you understood, is part of the retrieval module, while and are not. Therefore, as described in Eq. (6), 'the proposed model without the retrieval module' reduces to . Since both and are linear projections, this can be simplified to a single MLP layer. Consequently, we considered this model to be equivalent to NLinear.
3. Could you please explain how the experiments and discussions in the paper answer the research question presented in the introduction? Could you also discuss what you mean by inductive biases and learning capacity? They are typically opposing forces - architectures with high inductive bias typically have lower learning capacity, and vice versa.
Inductive bias represents 'the set of assumptions that the learner uses to predict outputs of given inputs.' We consider the retrieval-based structure in our model to provide a form of learning bias. By enabling the forecasting model to access relevant information through retrieval, the model is relieved of the need to memorize all patterns in the training data, as the necessary information is retrieved on demand. The term 'learning capacity' in our manuscript does not refer to the breadth of the model's parameter search space. Instead, it reflects the model's ability to leverage patterns in the training data for inference. Therefore, providing additional information through retrieval does not increase the model's inherent learning capacity but rather reduces its 'learning burden.' The experiments in our discussion demonstrate how the retrieval module assists the forecasting model by supplying patterns that are otherwise challenging to memorize, thereby showcasing how this inductive bias supports learning. To avoid confusion, we will clarify the intended meaning of 'learning capacity' in the revised manuscript.
-
I'd like to point out that Transformers (along with RNNs, certain CNN architectures, and others) have flexible input size. The idea that increasing the maximum context length leading to few training samples based on the "train set length - lookback length + 1" is mainly due to laziness in engineering - the field should move past this quickly. Also, I still do not understand why "retrieval" is not considered increasing the lookback window size, if it is basically considering the past time steps as inputs to the retrieval module.
-
While there exists an equivalence, the learning dynamics are different, and the models may learn different weights and have different performance. The number of stored parameters is also different.
-
I think the experiments and discussion answers a different question than the RQ posed in the introduction. The RQ is about existing models and whether they are memorizing or generalizing. The experiments and discussions only make claims about the proposed model.
My rating remains as my concerns have not been addressed.
- I'd like to point out that Transformers (along with RNNs, certain CNN architectures, and others) have flexible input size. The idea that increasing the maximum context length leading to few training samples based on the "train set length - lookback length + 1" is mainly due to laziness in engineering - the field should move past this quickly. Also, I still do not understand why "retrieval" is not considered increasing the lookback window size, if it is basically considering the past time steps as inputs to the retrieval module.
Even though Transformers can accommodate longer lookback window sizes, prior time-series research [1, 2] has reported that Transformers perform worse with longer inputs. This suggests that relying solely on attention becomes insufficient for capturing all patterns in the data as the input length increases. Our approach, on the other hand, demonstrates the efficiency of maintaining a shorter input while searching for relevant frames from the training data. We also conducted experiments by adding training samples with varied input lengths for Transformer models to assess the increase in prediction performance. However, the results below showed no improvement in MSE on average, sacrificing computational efficiency. We believe these findings may explain why other time-series works have not adopted this approach.
| Data | Model | 96 | 192 | 336 | 720 | Average |
|---|---|---|---|---|---|---|
| ETTh1 | AutoFormer | 0.449 | 0.500 | 0.521 | 0.514 | 0.496 |
| + Varied input length | 0.539 | 0.588 | 0.526 | 0.519 | 0.543 | |
| ETTh2 | AutoFormer | 0.346 | 0.456 | 0.482 | 0.515 | 0.450 |
| + Varied input length | 0.508 | 0.551 | 0.665 | 0.532 | 0.564 | |
| ETTh1 | InFormer | 0.865 | 1.008 | 1.107 | 1.181 | 1.040 |
| + Varied input length | 1.606 | 1.536 | 1.522 | 1.204 | 1.467 | |
| ETTh2 | InFormer | 3.755 | 5.602 | 4.721 | 3.647 | 4.431 |
| + Varied input length | 3.046 | 5.591 | 5.829 | 3.826 | 4.573 |
[1] Are Transformers Effective for Time Series Forecasting? (AAAI 2023)
[2] SparseTSF: Modeling Long-term Time Series Forecasting with 1k Parameters (ICML 2024)
- While there exists an equivalence, the learning dynamics are different, and the models may learn different weights and have different performance. The number of stored parameters is also different.
We want to clarify that and are linear layers in our setting. Thus, simply concatenating these two functions without a nonlinear activation is architecturally equivalent to a single linear layer. Both the structure and objective of our model are identical to NLinear, and most importantly, running the model in the same setting produces the same performance results. If the reviewer has any suggestions for additional ablation studies, we would be happy to implement them.
- I think the experiments and discussion answers a different question than the RQ posed in the introduction. The RQ is about existing models and whether they are memorizing or generalizing. The experiments and discussions only make claims about the proposed model.
Our proposed research question aimed to emphasize our motivation of addressing trivial patterns through retrieval, allowing the model to focus on learning generalizable features. If this question felt too broad, we will refine and scope it down in the final version to more specifically describe our motivation.
Thank you for providing valuable feedback to improve our paper. Our problem setting assumes forecasting for a single time-series, possibly with multiple channels. We would like to clarify that our problem setting and the benchmark datasets used follow prior works in the forecasting domain [1, 2, 3]. The contribution of our paper lies in enabling the forecasting model to directly reference relevant timestamps in the training set through retrieval, eliminating the need to memorize every pattern in the training data. By doing so, the retrieval module reduces the learning burden of memorizing all patterns and helps the model learn more generalizable features. In response to the reviewers' comments, we made the following major modifications during the rebuttal period (the revised manuscript highlights the changes in red):
- We included additional datasets and baselines. All hyper-parameters for both baselines and our model were tuned based on validation set performance following prior works, and the best performance results were compared (Table 1 in the revised manuscript).
- We performed and included a detailed ablation study and hyper-parameter analysis (Appendix C in the revised manuscript).
- We analyzed the computational complexity of the retrieval process and proposed alternatives to improve efficiency. Related experiments were added to the paper (Appendix E in the revised manuscript).
- We incorporated the suggested related work and discussed how it differs from our model (Lines 115–117, Line 142, Line 161 in the revised manuscript).
- We confirmed that the retrieval module can be extended beyond linear models to transformer architectures and included this analysis (Appendix D in the revised manuscript).
[1] TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting (ICLR 2024)
[2] SparseTSF: Modeling Long-term Time Series Forecasting with 1k Parameters (ICML 2024)
[3] A Time Series is Worth 64 Words: Long-term Forecasting with Transformers (ICLR 2023)
This paper proposes a retrieval-augmented method for time series forecasting. Similar in spirit to nearest neighbor based models, it searches for candidates from historical data that are similar to the current input and then uses them to help to make future prediction.
Major strengths:
- The proposed method is somewhat novel, in the sense that incorporating a retrieval module is not often considered in the context of time series forecasting, although it is similar in spirit to nearest neighbor classification.
- The performance of the proposed method in terms of accuracy seems to be good according to the experiments reported.
Major weaknesses:
- This approach incurs a higher test-time computational demand.
- The simple similarity measure used may not be adequate for time series data with more complex dynamics.
- This approach may not perform well when there exist significant concept drifts in the time series, which are common in the real world.
- The optimal number of candidates to retrieve may depend on the dataset and hence needs to be determined separately.
We appreciate the effort of the authors for responding to the comments raised by the reviewers and conducting additional experiments. Although this work has potential to make impact, it would be more ready for publication if some outstanding issues which include the weaknesses listed above could be addressed more thoroughly. The authors are encouraged to improve their paper for future submission by considering the comments and suggestions of the reviewers.
审稿人讨论附加意见
The authors engaged in discussions with the reviewers and provided more experiment results to revise their paper. However, even with the revisions, some reviewers still do not feel comfortable with accepting it as it is due to the outstanding issues that still need to be addressed more comprehensively. The only reviewer who is inclined to acceptance only puts this work “marginally above the acceptance threshold” and this reviewer did not participate in discussions.
Reject