In-context Fine-tuning for Time-series Foundation Models
We design a time-series foundation model that can be prompted with related examples and can perform better forecasting compared to supervised models, statistical models and even fine-tuned time-series foundation models.
摘要
评审与讨论
The paper introduces a novel in-context fine-tuning methodology for time-series foundation models, specifically implemented on a pretrained TimesFM model. By leveraging multiple related time-series examples as context at inference time, the model adapts effectively to the target domain’s distribution, leading to competitive zero-shot forecasting performance. The results indicate that in-context prompting can serve as a compelling alternative to explicit fine-tuning.
优点
- Novelty and Impact: The paper introduces an innovative in-context fine-tuning strategy for time-series foundation models, advancing beyond traditional fine-tuning approaches. This approach is both timely and relevant, given the recent rise of time-series foundation models.
- Comprehensive Evaluation: The authors present a thorough empirical evaluation, comparing their model against various supervised models and competing foundation models, and conducting detailed ablation studies to highlight the efficacy of the proposed in-context adaptation.
缺点
- Clarity in Defining Relevance: While the approach is innovative, a clearer definition or practical guidelines for defining relevance would strengthen the paper.
- Generalizability Across Architectures: The paper would benefit from experiments evaluating the method’s performance with different time-series foundation model architectures.
- Computational Efficiency: The in-context forecasting objective requires multiple examples as context, adding computational overhead. Since Section 6.4.1 suggests that increased context boosts performance, the computational cost could become substantial. A deeper analysis of this trade-off would be beneficial.
问题
- Could the authors provide practical guidance on determining the relevance of related time-series examples? Practical insights on this aspect would greatly help practitioners in applying this research.
- Have the authors tested the method with alternative foundation models to evaluate cross-model generalizability?
- Could the authors provide a comparison of the computational demands between
TimesFM (Base)andTimesFM-ICF?
The paper proposes an in-context fine-tuning method for TimesFM by:
- Pretraining with In-Context Examples: Training TimesFM with related sequences to adapt patterns within its context window, enhancing zero-shot capabilities.
- Model Architecture and Tokenization: The modified TimesFM architecture is decoder-only, leveraging causal attention for cross-series pattern recognition and handling positional encoding issues unique to time-series.
- Separators for Cross-Series Attention: A learned separator token distinguishes between related time-series examples in the input, preventing blending of unrelated patterns.
优点
-
Novelty in In-Context Fine-Tuning: This paper provides an innovative approach to time-series foundation models, extending in-context learning techniques from NLP to time-series data. The method of using related time-series as contextual inputs could allow the model to leverage additional patterns, improving zero-shot generalization.
-
Learned Separator Token: Introducing a learned separator token sigma is a creative idea to mark boundaries between time-series examples, which could improve model performance by making it aware of example boundaries. This separator approach is relatively novel and might serve as a building block for other multi-context models.
-
Scalability with Decoder-Only Architecture: The choice of a decoder-only stacked transformer for time-series forecasting allows scalability to long horizons and multiple in-context examples. This approach could be particularly beneficial when handling complex temporal dependencies across varied time-series.
缺点
-
Separation of Contextual Time-Series Examples: The authors propose using a learnable separator token ( sigma ) to distinguish between in-context time-series examples. While this can theoretically help prevent pattern blending, it lacks clear explanation or empirical evidence on how effectively this separation operates within the transformer layers. Specifically:
- Ambiguity in Attention Calculation: It is unclear how the transformer layers use ( sigma ) to separate information from different time-series examples. The mechanism by which the model differentiates patterns before and after ( sigma) is underspecified.
- Potential Overlap in Learned Representations: If the transformer layers rely heavily on self-attention without more sophisticated masking or boundary control, representations of each time-series could still overlap, leading to blending across series. This could be especially problematic when handling time-series with similar trends, potentially confusing the model.
-
Attention Mechanism over Non-Homogeneous Data: The model attempts to attend across series with possibly different scales, lengths, or temporal patterns, which introduces challenges:
- Lack of Normalization: Each time-series may have different scales, potentially confusing the model without explicit normalization between examples.
- Variable-Length Challenges: The methodology does not address how the attention mechanism handles variable lengths ( T_i ) of the in-context examples. Time-series with different patterns or temporal granularity could introduce biases unless carefully aligned.
-
Positional Encoding and Temporal Order: Although the authors suggest that causal attention in stacked transformers allows the model to “implicitly” capture positional information without explicit encoding (following Haviv et al., 2022), this assumption may lead to inaccuracies:
- Implicit Positional Information: Without positional encodings, the temporal order within each time-series may not be accurately preserved, especially when multiple examples with different temporal ranges are concatenated. This could dilute the temporal structure that is crucial for time-series forecasting.
- Temporal Invariance: Assuming the transformer can infer temporal order without encoding could lead to a loss of precise order-dependent relationships, which may reduce model performance on sequences where exact timing is critical.
问题
-
Separator Token Effectiveness: How does the model empirically distinguish between different in-context examples using the separator token ? Could you provide ablation studies or visualization of attention patterns to confirm that information from different examples does not blend?
-
Handling of Variable-Length Sequences: When handling in-context examples of varying lengths ( T_i ), how does the model adapt its attention mechanism? Do you apply any additional normalization or padding strategies to align examples of different lengths and scales?
-
Impact of Positional Encoding: Have you tested whether adding explicit positional encodings (or relative positional encodings) improves performance in this setup? If causal attention alone is used, how does the model ensure the correct temporal order within and between examples?
-
Effectiveness of Cross-Series Attention: Could the authors clarify if any boundary-sensitive attention masking is applied across examples, or if attention scores can freely flow across all tokens? If the latter, could blending of time-series data lead to potential inconsistencies or biases?
伦理问题详情
N/A
This paper presents a methodology for in-context fine-tuning of a time series foundation model (TimesFM) to enhance forecasting capabilities. The authors did begin with a base TimesFM checkpoint and adapted it to leverage not only the target time series' historical data but also related series examples at inference time. Their findings demonstrate that in-context fine-tuning can significantly boost zero-shot performance on popular forecasting benchmarks, surpassing both the base model that was fine-tuned directly on the target domain. Notably, it even outperforms a version of the base model that was fine-tuned directly on the target domain. Although this study focuses on one specific foundation model for time series forecasting, exploring in-context fine-tuning with other foundation models could be a promising area for future research. Additionally, investigating improved relative positional encoding, the authors mentioned NoPE for continued pre-training, particularly tailored for handling in-context examples and length generalization, would be another valuable direction!
优点
(1) The authors were motivated by the strong zero-shot performance of the GPT model in time series forecasting and proposed an adaptation of the TimesFM foundation model to leverage additional information through in-context examples. They initially pre-trained TimesFM in its original form to establish a base checkpoint, then modified the model’s architecture and continued pre-training with in-context examples to develop a new pre-trained foundation model, TimesFM-ICF!
(2) Regarding originality, the paper introduces the concept of in-context fine-tuning specifically adapted for time-series data. While leveraging context in machine learning is not new, the authors skillfully adapt this idea from NLP to address the unique challenges of time-series forecasting. This approach is especially compelling, as it combines established principles from LLMs with the nuances of temporal data, filling a gap in current literature. The authors’ methodology redefines the possibilities for time-series forecasting and opens new research avenues, suggesting that similar examples in time-series data can enhance model performance beyond what traditional methods achieve.
(3) The research quality is evident in the rigorous methodology applied throughout the study. The authors support their claims with extensive empirical evaluations on prominent forecasting benchmarks, showing clear performance improvements over both supervised deep learning methods and existing time-series foundation models. This empirical evidence reinforces the credibility of their approach, and the thorough experimental design—such as careful dataset selection and detailed result analysis—reflects a high standard of quality!
(5) Clarity is another strong point, as the authors clearly communicate their ideas and findings. The paper’s structure is logical and well-organized, leading the reader smoothly through the introduction of the problem, methodology, and results. Clear language and precise terminology help demystify complex concepts, making the paper accessible even to those less familiar with time-series forecasting. Illustrative examples and figures further aid understanding, showing the practical implications of in-context fine-tuning!
(6) The paper’s significance is substantial. By demonstrating that in-context fine-tuning significantly improves forecasting accuracy, the authors offer insights that could influence both research and practical applications. The findings suggest that this approach could be a game-changer for practitioners, enabling improved forecasting accuracy without the need for extensive task-specific training data. Moreover, the potential applications of in-context learning extend beyond time-series forecasting, expanding the relevance of this research to other domains
缺点
(1) The novelty of this work could be made clearer with a deeper comparative analysis against established methods. The authors claim their in-context fine-tuning approach outperforms traditional supervised learning and other foundation models. However, it would be helpful if they could further clarify what sets this approach apart, especially concerning few-shot learning techniques commonly applied in language models. In particular, it would be insightful for the authors to explain how their method handles the unique temporal dependencies in time series data, a challenge that typical language model techniques might not address! It would also be useful to compare the approach to well-known techniques, such as those used in [1] and [2]. It could provide a stronger context for understanding the distinctiveness and value of the method within the broader landscape of in-context learning for time series data!
(2) The experimental design would benefit from testing on a wider variety of datasets and forecasting scenarios. Currently, the benchmarks don’t seem to fully reflect the range of time-series data seen in real-world applications. It might help to include datasets that capture different characteristics—like high seasonality, irregular sampling, or multivariate setups—since these are common in practice. For instance, datasets like LOTSA (as used in [4]) show different levels of noise, seasonality, and trend, which would provide a more comprehensive picture of the model’s performance. It would also give insight into how well the model generalizes across different types of time series and adapts to diverse patterns!
(3) To strengthen the paper, I would suggest a more detailed analysis of the criteria used to select in-context examples during inference. The authors mention that using related time series examples improves model performance, but it would be helpful to explain how they determine which examples are " related " enough to be effective. An ablation study comparing different selection approaches, such as the use of domain knowledge, could clarify the impact of these choices on model performance and provide useful insights for future applications!
(4) To add balance to the discussion, it would be valuable to include a deeper look at the approach's limitations. Specifically, it could be helpful to examine situations where the method might struggle, such as when it encounters new or unusual time series patterns not represented in the training set or in the in-context examples. I think it would also be beneficial if the authors considered possible strategies for addressing these limitations or outlined some directions for future work. For example, they might explore ways to detect when the model encounters out-of-distribution patterns or consider methods to adapt the model to new data patterns without extensive retraining!
[1] Language Models are Few-shot Learners, Brown et al. (2020)
[2] Improving In-Context Few-Shot Learning via Self-Supervised Training, Chen et al. (2022)
[3] What Can Transformers Learn In-Context? A Case Study of Simple Function Classes, Garg et al. (2022)
[4] Unified Training of Universal Time Series Forecasting Transformers., Woo et al. (2024)
问题
(1) While the authors indicate that related time-series examples can improve forecasting accuracy, they do not clarify the criteria for selecting these examples! A more detailed explanation of the selection process, whether based on similarity metrics, temporal proximity, or other factors, would provide valuable insight into the model's adaptability. Additionally, understanding how the model performs when the in-context examples are less relevant or of lower quality would be useful! This could open a discussion on the model's robustness in real-world scenarios where data may be noisy or incomplete!
(2) Another potential area for improvement is the discussion of limitations regarding scalability and computational efficiency. As the number of in-context examples increases, so may the computational load, which could affect the model's feasibility for large-scale applications. Insight into how the model manages larger context windows, as well as any diminishing return in performance with an increasing number of examples, would be valuable. This consideration could also connect to the findings from the [1], which investigates context efficiency in LLMs. A comparison with the proposed approach might highlight areas for adaptation or further improvement!
(3) A more comprehensive look at the model's performance across different forecasting horizons would strengthen the evaluation. While the authors mention that the model adapts to various contexts, empirical results on how it performs across short and long-term forecasting horizons would be valuable. This could reveal whether in-context fine-tuning is consistently effective or better suited for specific forecasting tasks!
(4) The authors suggest that their approach rivals traditional fine-tuning on the target domain, but a deeper analysis of the trade-offs between in-context and conventional fine-tuning would be beneficial. Specifically, in which scenarios might one approach be preferable? Clarifying these nuances would aid practitioners in choosing the appropriate method for their specific forecasting needs!
(5) The paper could broaden its discussion on the implications for future research. While the authors suggest potential adaptations for other foundation models, a more detailed exploration of how these adaptations might extend to different domains, such as image or text data, would be valuable. Drawing connections to the broader landscape of foundation models, including references like [1], would offer a richer context for the implications of their work!
After addressing these suggestions and the weakness part, it would be possible to improve the score!
[1] Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting, Rasul et al. (2024) ,
伦理问题详情
N/A
The paper presents in-context finetuning of foundation models for time series forecasting with a few-shot foundation model that utilizes the in-context learning ability of the transformer architecture and related time series as in-context examples. The proposed work is an extension to the TimesFM architecture with modifications to handle related time series by learning a separator token to distinguish between different in-context examples. The in-context examples are either time-series level obtained by splitting a longer history into different examples or dataset level obtained by sampling examples from the other time series within the dataset. The proposed architecture is evaluated on a comprehensive benchmark of datasets.
优点
- The paper is well-written and easy to follow.
- The authors propose an interesting direction toward improving forecasting in foundation models without additional parameter updates during inference. The solution to utilize related series as in-context examples can be useful, particularly for domains not seen during pretraining.
- The ablation studies provide meaningful insights into the benefits of in-context examples for forecasting. I find it surprising that having a lot of in-context examples outperforms a long history.
缺点
- Unified Training of Universal Time Series Transformers. Although cited, the authors have not discussed this prior work in detail. The Moirai architecture is the most relevant to the problem the authors are trying to solve. Moirai incorporates variable patch sizes and also handles related time series and exogenous variables to improve forecasts. Moirai assigns a variate id to each time series and sequentially concatenates with the “main” time series similar to what the authors have proposed. Including a detailed comparison of Moirai would help in understanding the novelty of the proposed solution.
- The main claim of this paper is that in-context finetuning is better than zero-shot forecasting and dataset-specific finetuning of foundation models. The experiment section lacks a comparison with existing foundation models. Authors have mainly compared with TimesFM but it would be insightful to see how in-context finetuning fairs in comparison to various foundation models. The results on the Monash benchmark can be compared with LLM-based solutions, specifically with Chronos, as the history and the forecast horizon are smaller for these datasets, and ETT datasets can be compared with Moirai. These comparisons would provide stronger evidence for the effectiveness of in-context fine-tuning.
问题
- Can the authors provide a detailed comparison of their proposed solution with the Moirai architecture? Specifically, how does their method perform better in in-context learning than Moirai?
- Additionally, could the authors compare Moirai with in-context examples alongside their solution on the ETT datasets?
- It would be interesting if the authors could provide insights into how a longer history performs poorly in comparison to having in-context examples with a shorter history.
The paper proposes the concept of in-context fine-tuning for time series forecasting. The approach proposes to leverage an existing autoregressive transformer based foundation model for univariate time series forecasting, and perform a second stage training. The inputs to the model is modified in the second stage training, by appending multiple related time series which are separated by a "separator token" which is a learnable embedding. The standard autoregressive MSE loss is applied. 2 rudimentary approaches to obtaining related examples are introduced, which generally involve randomly sampling from related data. The model can then be used at inference time as per how it was trained in the second stage.
The paper presents several experiments, including an in-domain evaluation on the Monash TSF benchmark, as well as out-of-domain evaluation on some of the Long Sequence Forecasting datasets. The results show that the ICF paradigm produces improved performance, both compared to the base foundation model, as well as standard fine-tuned version of the foundation model.
优点
The paper presents a timely solution to the key problem of how to more efficiently and effectively adapt time series foundation models for downstream tasks/datasets. The presented approach is intuitive and importantly, shows promising performance. The paper also presents important ablations such as Figure 6, which validates the hypothesis that related examples improve performance, and increasing the number of examples available for in-context learning leads to improved performance.
缺点
While I am leaning towards acceptance, there are some shortfalls in the paper, mostly related to experiments and reproducibility. I would be very happy to increase my score if the experiments section is made more robust.
- Missing details regarding the second stage training. The following is a non-exhaustive list of details that should be provided. Typical hyper parameters such as batch size, learning rate, optimizer, etc. Were optimizer parameters loaded from the base foundation model checkpoint? What about learning rate scheduler?
- More details for fine-tuning per dataset should be provided. Was a best effort made to ensure suitable hyper parameters were used for this fine-tuning?
- For the full results table, the baseline of TimesFM-ICF (n=1) should be added for all datasets. Two things we want to see - how much of the improvement is coming from true "in-context learning" versus a second stage training (this is partially resolved by the first ablation study), second is we want to see whether this "in-context fine-tuning" specialization hurts the vanilla zero-shot ability of the model, or whether it is still competitive.
- Table 5 in the appendix seems to show slightly contradictory results from Section 6.4.1 - while it is not a monotonic decreases in MSE, it still shows a general trend of improvement as number of examples increase. Perhaps the wording in Section 6.4.1 should be rephrased.
- More analysis could be provided into the different context generation approaches.
问题
- Section 6.4.1 states that the number of examples ablation is only performed for ETT datasets - what about Table 5?
- I want to verify that context generation for both the training and evaluation phase ensured that there were no data leakages?
Dear authors,
I find your idea of introducing in-context learning (ICL) to time-series foundation models exciting. Still, I'll greatly appreciate it if you can help me clarify these points:
- What's your motivation for introducing ICL ability to the time-series foundation model? Do you believe current time-series foundation models cannot do ICL inference?
- How do you form in-context examples during the zero-shot evaluation on Monash and ETT? Specifically, are you using the 'time-series level' option and assuming the model is accessible to all previous histories as context candidates?
- As in sections 5 and 6.2 you have n=50 contexts, each with T=512+128 time steps for ETT. In this case, does the model have an input sequence containing n*T=32000 input time steps? It would be beneficial if you could release model hyperparameters like the input look-back window lengths
Thanks.
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.