TiRex: Zero-Shot Forecasting Across Long and Short Horizons with Enhanced In-Context Learning
摘要
评审与讨论
This paper introduces TiRex, a pre-trained time series forecasting model leveraging xLSTM, a modern LSTM variant, to address the limitations of existing zero-shot forecasting approaches. TiRex integrates Contiguous Patch Masking (CPM), a novel training-time masking strategy, and three data augmentation techniques to enhance state-tracking and generalization. Experiments on HuggingFace benchmarks GiftEval and Chronos-ZS demonstrate that TiRex outperforms significantly larger models (e.g., TabPFN-TS, Chronos Bolt) in both short- and long-term forecasting, with fewer parameters (35M) and faster inference speed. The model’s success stems from combining xLSTM’s in-context learning with CPM to mitigate autoregressive error accumulation and augmentations to handle diverse time series patterns.
优缺点分析
Strengths
-
By integrating xLSTM, TiRex bridges the gap between LSTM's state-tracking and transformer's in-context learning, enabling robust long-horizon forecasting.
-
CPM enhances multi-patch prediction stability, while augmentations (amplitude modulation, censor augmentation, spike injection) improve generalization to rare patterns.
-
Sets new SOTA on zero-shot benchmarks, outperforming larger models with 11× faster inference than TimesFM-2.0.
Weaknesses
-
Currently focuses on univariate time series analysis, with multivariate modeling explicitly noted as a direction for future research.
-
Omits evaluation on widely adopted benchmark datasets (e.g., ETTH1, ETTH2, ETTM1, ETTM2, Weather, Electricity) that are routinely used in state-of-the-art (SOTA) studies [1]. This gap limits direct comparability with prior works and may obscure the model's generalizability across diverse time series modalities.
-
Lacks publicly available codebase for independent verification, which hinders reproducibility and community validation of the reported findings.
[1] Woo, G., Liu, C., Kumar, A., Xiong, C., Savarese, S., & Sahoo, D. (2024). Unified Training of Universal Time Series Forecasting Transformers. In Proceedings of the Forty-First International Conference on Machine Learning.
问题
Please make responses to these weaknesses.
局限性
Yes.
最终评判理由
Since most experimental results on datasets are included in this research, and most concerns are addressed by authors, I would like to raise the score.
格式问题
NA.
We thank the reviewer for their thorough summary and for acknowledging the strengths of our work, including the novel integration of xLSTM, the benefits of CPM, and the model's state-of-the-art performance and efficiency. In the following, we address the reviewer's views on weakness as requested in the questions by the reviewer:
1.Currently focuses on univariate time series analysis, with multivariate modeling explicitly noted as a direction for future research.
The reviewer is correct that TiRex models each time series variate independently. We identified this as a limitation and an area for future work. However, we want to emphasize that this approach remains remarkably effective. Previous work (e.g., [1]) has shown that univariate models can be strong on multivariate tasks.
Our results on the GiftEval benchmark, which includes multiple multivariate datasets (8 out of 23), confirm this. TiRex outperforms models designed specifically for multivariate forecasting (e.g., Moirai, TTM) on the overall benchmark (Figure 1). To make this point more explicit, the table below shows the performance (Top 3 models by CRPS) on just the multivariate datasets within the GiftEval-ZS benchmark, where TiRex still achieves the top rank. This demonstrates that for many real-world problems, accurately modeling individual series dynamics is the most critical factor.
| Model | CRPS | MASE |
|---|---|---|
| TiRex | 0.40 | 0.61 |
| TabPFN-TS | 0.44 | 0.64 |
| TimesFM 2.0 | 0.45 | 0.63 |
We will add this extended discussion to the paper.
- Omits evaluation on widely adopted benchmark datasets (e.g., ETTH1, ETTH2, ETTM1, ETTM2, Weather, Electricity) that are routinely used in state-of-the-art (SOTA) studies. This gap limits direct comparability with prior works and may obscure the model's generalizability across diverse time series modalities.
We note that our evaluation already includes all the datasets the reviewer is concerned about. As detailed in our appendix, the GiftEval benchmark includes ETTh1, ETTh2, ETTm1, ETTm2, and Electricity. The Chronos-ZS benchmark includes the Weather dataset. The individual results are presented in Tables 9 to 18 in the appendix.
Our choice to use the GiftEval and Chronos-ZS benchmarks instead of the older LFS-benchmark (which consists of the datasets mentioned) was deliberate and made to ensure a more rigorous, transparent, and comprehensive evaluation. We will explain this in the final manuscript.
In short, recent work [e.g., 2,3] has criticized the LFS benchmark for issues with inconsistent evaluation protocols and limited diversity, which hinder generalizable comparison. In contrast, GiftEval and Chronos-ZS offer:
- Greater Diversity and Generalizability: GiftEval contains 97 evaluation settings across 23 diverse datasets, while Chronos-ZS contains 27 datasets --- both covering diverse domains. This provides a much broader test of a model's zero-shot capabilities than the 6 datasets in the LFS benchmark.
- Standardization and Reproducibility: Both benchmarks are hosted on public leaderboards with standardized evaluation code, ensuring that all models are compared fairly under the same conditions. This resolves comparability issues present in prior work.
Therefore, by using these comprehensive benchmarks, we are confident that our evaluation provides a more robust and comparable assessment of SOTA performance across a wide range of time series domains.
- Lacks a publicly available codebase for independent verification, which hinders reproducibility and community validation of the reported findings.
We fully agree with the reviewer on the importance of reproducible evaluation results. We will release our model, the respective code, and the evaluation scripts to allow for the full verification of our benchmark results. The reproducibility of our findings will be also ensured by our submissions to the public GiftEval and Chronos-ZS leaderboards, which require the necessary code and model to validate the results.
We thank the reviewer for the opportunity to address their feedback. Given that we have now addressed primary concerns --- e.g, the evaluation results for the noted datasets and committing to a public release of our model --- we hope the reviewer will consider adjusting the initial score accordingly.
References
[1] Nie, Y., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. ICLR 2022.
[2] Bergmeir, C. Fundamental limitations of foundational forecasting models: The need for multimodality and rigorous evaluation, Time Series in the Age of Large Models Workshop NeurIPS 2024
[3] Aksu, T., Woo, G., Liu, J., Liu, X., Liu, C., Savarese, S., Xiong, C., and Sahoo, D. GIFT-eval: A benchmark for general time series forecasting model evaluation. In NeurIPS Workshop on Time Series in the Age of Large Models, 2024.
I thank the author for their detailed responses and it is encouraged to incorporate your most significant experimental results into your main text. In addition, some more theoretical analysis is highly required to be included into your submission. Since most concerns are addressed, I would like to raise my score. Thanks.
This paper leverages the recently introduced xLSTM architecture to build a time-series foundation model and shows that it outperforms competing models in terms of both efficiency and forecast accuracy.
优缺点分析
Strengths:
- The experiments are well designed, well thought out, and extensive, which lends credibility to the results.
- The improvements over existing zero-shot models (Chronos, TimesFM, etc.) are quite substantial, and the underlying architecture of TiRex is quite different from transformer-based models, which will help move the field forward.
- The paper is well written.
Weaknesses:
- There is very little theoretical analysis, it is not clear on a fundamental level why TiRex does better than other models, and what its potential failure modes are.
问题
- The authors showed that TiRex can predict periodic spikes well, which is a common failure mode for other time-series foundation models. This is not surprising given the specific data augmentation techniques (e.g., spike injection) employed to train TiRex. But how well can TiRex generalize to situations not well covered in the training data. For example, what will TiRex predict when faced with aperiodic spike trains (e.g., from neuroscience)? It would be interesting to test TiRex on synthetic time series where different aspects of the signal can be controlled.
- What are some failure modes of TiRex that the authors have observed? Understanding these failure modes will help further improve the model.
- Did the authors find any interpretable strategies TiRex is using to make zero-shot forecasts?
局限性
Yes
最终评判理由
I appreciate authors' response, especially concerning the potential theoretical advantages of TiRex. I wish the authors included new data for some of my other questions, but this does not change the fact that this is a very good paper with potentially significant contribution to the field of time series foundation models.
格式问题
N/A
We thank the reviewer for their feedback and for positively highlighting the experimental design, the substantial performance improvements, the novelty of the architecture, and the quality of the writing. In the following, we address the reviewer's views on weaknesses and answer the questions by the reviewer, point by point:
- There is very little theoretical analysis, it is not clear on a fundamental level why TiRex does better than other models, and what its potential failure modes are.
We appreciate the reviewer's desire for a deeper theoretical understanding. While our paper focuses on the construction of the approach and its empirical validation with the respective ablations, we believe one can ground TiRex's superior performance at least partly in the established theoretical properties of its backbone: As shown by [1,2,3], recurrent architectures like sLSTMs occupy a higher class of formal languages than transformers. Specifically, their state-tracking ability allows them to solve problems (e.g., parity, counting) that are provably beyond the reach of standard transformers (Figure 7). We hypothesize that this fundamental expressivity advantage allows TiRex to better model temporal dynamics, leading to better performance, especially across long horizons. We alluded to this aspect in the appendix, but the reviewer comment shows that we did not link this well enough. Hence, we will clarify this connection in the final version. Regarding failure modes, we address this in the second next point.
- The authors showed that TiRex can predict periodic spikes well, which is a common failure mode for other time-series foundation models. This is not surprising given the specific data augmentation techniques (e.g., spike injection) employed to train TiRex. But how well can TiRex generalize to situations not well covered in the training data. For example, what will TiRex predict when faced with aperiodic spike trains (e.g., from neuroscience)? It would be interesting to test TiRex on synthetic time series where different aspects of the signal can be controlled.
While our spike injection augmentation was indeed key, we designed it specifically to encourage generalization rather than memorization:
- Diverse Spike Features: The augmentation is more complex than simply adding a repeating spike. As detailed in Section 3 and Appendix B, we introduce significant diversity by randomizing:
- Spike Shape: We use multiple kernels (Tophat, RBF, Linear) to define the spike's shape.
- Spike Parameters: The width and amplitude of each spike are sampled from distributions.
- Temporal Structure: The periodicity itself is randomized for each sample, and we use a variety of underlying temporal patterns. The goal of this diverse training is to teach the model a general concept of "sharp, transient events" and understand them “in-context”, rather than having it memorize a simple periodic pattern.
We will extend the explanations in the main paper (Section 3) and Appendix B to make this more clear.
- Predictability of Aperiodic Spikes: A truly stochastic spike, with no causal link to the preceding signal, is by definition unpredictable from the context alone. In this case TiRex might provide an estimation of the expectation with high uncertainties. However, if spikes are triggered by the system's underlying dynamics, we hypothesize TiRex would be uniquely suited to generalize as the sLSTM's state-tracking, might allow to model the latent state leading up to an event - a potential advantage over architectures that may be limited to pattern matching.
- What are some failure modes of TiRex that the authors have observed? Understanding these failure modes will help further improve the model.
While we did not observe any specific repetitive failure modes in the evaluation, two hypothetical ones are worth noting:
- As discussed in the limitations section, our current model treats multivariate time series as a collection of univariate ones. While this approach is strong, as reflected in the Gift-Eval results (8 out of 23 datasets are multivariate) and pointed out in previous work [4] it may fail in scenarios where forecasting is impossible without explicitly modeling interdependencies between variates.
- As a zero-shot, in-context learning model, TiRex's performance depends on having a sufficient context of past values to identify the time series' properties. When faced with a very short context that does not exhibit the characteristic patterns of the series, its forecasts may be unsatisfactory, a natural limitation of this paradigm.
- Did the authors find any interpretable strategies TiRex is using to make zero-shot forecasts?
We did not find any specific interpretable strategies. However, we also did not explicitly search for them. It is likely that for specific forms of time series TiRex would use straightforward strategies to construct its forecast.
References
[1] Beck, M., Pöppel, K., Spanring, M., Auer, A., Prudnikova, O., Kopp, M. K., Klambauer, G., Brandstetter, J., and Hochreiter, S. xLSTM: Extended Long Short-Term Memory. NeurIPS 2024.
[2] Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. TACL 2023
[3] Delétang, G., Ruoss, A., Grau-Moya, J., Genewein, T., Wenliang, L. K., Catt, E., Cundy, C., Hutter, M., Legg, S., Veness, J., and Ortega, P. A. Neural networks and the Chomsky hierarchy. ICLR 2023.
[4] Nie, Y., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. ICLR 2022.
Thank you for your response. I especially appreciate your clarification concerning the potential theoretical advantages of TiRex. I do wish the authors included new data when answering some of my other questions, such as the claim that "However, if spikes are triggered by the system's underlying dynamics, we hypothesize TiRex would be uniquely suited to generalize as the sLSTM's state-tracking, might allow to model the latent state leading up to an event - a potential advantage over architectures that may be limited to pattern matching." However, this does not change the fact that this is a very good paper with potentially significant contribution to the field of time series foundation models.
There has been lots of recent work on zero-shot time series forecasting with pre-trained time series foundation models. Most of these models are based on transformers, which often struggle with time-series data. On the other hand LSTM are excellent at state-tracking, making them a classic choice for time-series, but they traditionally lack the strong in-context learning abilities of Transformers. This paper aims to bridge this gap. The authors introduce TiRex, a new zero-shot forecasting model that has the state-tracking strengths of LSTMs and the powerful in-context learning ability. The model is built on xLSTM, an enhanced LSTM with strong in-context learning ability. To help address the quantile collapsing problem, the model used contiguous patch masking (CPM) to random mask consecutive patches during training. Multiple training data augmentation methods are proposed to improve the model’s prediction ability. The model achieves state-of-the-art results, outperforms other foundation models such as TimsFM, Chronos, Moirai.
优缺点分析
The paper leveraged new model structure (xLSTM), new training method (CPM) and better data augmentation specifically censor augmentation, amplitude modulation and spike injection to improve state-of-the-art results by a significant amount (Figure 4). Among the three points, I believe 1, 2 are new and the specific data augmentation method is also new although I could be missing something here. In terms of the impact, I think CPM is the most impactful for two reasons 1) addressing collapsing quantiles problems in other models such as Chronos Bolt and Times FM. 2) looking at Table 1, not using CPM seems to lead to the most performance loss.
问题
No
局限性
Yes
最终评判理由
The was no discussion during rebuttal and I will maintain my recommendation of Accept.
格式问题
No
We thank the reviewer for their careful summary of our work. We are glad they recognized the novelty of leveraging the xLSTM architecture and our proposed Contiguous Patch Masking (CPM) training strategy. We especially appreciate that the reviewer highlighted the impact of CPM in addressing the collapsing quantiles problem and its significant contribution to the model's overall performance.
I would like to thank the authors for the response.
The paper proposes a novel LSTM based architecture for pre-trained time-series models that leveraged the xLSTM architecture. The model shows that it can perform well on zero-shot and few-shot forecasting tasks with much lower size, and efficient inference compared to other founadational time-series models
优缺点分析
Strengths
- The improvement in performance given the efficiency of the model is very significant.
- The architecture is quite novel for time-series models.
- The evaluation is well done with comparison with state-of-art foundational models and benchmark suites
- The performance improvement is very significant shocasing that innovations in model backbone is important for founfational time-series models.
- Ablations on data augmentation, patching are very useful for practioners
Weaknesses
- It is a improvement over the xLSTM architecture with additional novelty only in input patching and simple data augmentation
问题
- Can the model be used for other tasks like imputation, anomaly detection and classification?
- How does the performance vary with horizon length and frequency of time-series?
- Can the hidden states be interpreted to model different dependencies and explanations similar to attention modules?
局限性
- See Weaknesses
- The model can only do forecasting and not other time-series analysis tasks like imputation, anomaly detection and classification
- It is not as interpretable as some simpler statistical models
最终评判理由
The authors have provided acceptable answers to all the questions. The work is very innovative with the ability to provide sota generalizable performance with low parameter recurrent models across domains. Therefore, I recommend accept.
格式问题
No
We thank the reviewer for their feedback and for highlighting the significance of TiRex's performance improvements, its architectural novelty, the thoroughness of the evaluation, and the practical utility of the ablation studies. In the following, we address the reviewer's views on weaknesses and answer the questions by the reviewer, point by point:
- It is a improvement over the xLSTM architecture with additional novelty only in input patching and simple data augmentation
While the xLSTM backbone is indeed a foundational element we would like to clarify our contributions, as our work introduces two other key contributions that are crucial for achieving the final performance.
- Contiguous Patch Masking (CPM): CPM combined with our “masked inference “mechanism addresses the challenge of autoregressive error accumulation in long-horizon forecasting, enabling stable and coherent predictions where previous models fail. This is in contrast to a naive multi-patch training approach, which we show harms short-term performance, and standard autoregressive inference, which degrades long-term performance (Table 1).
- Data Augmentations: The proposed online augmentations demonstrate their importance in improving the model's robustness and generalization capabilities (Table 1).
The synergistic combination of the xLSTM architecture, the CPM training strategy, and our specialized data augmentations, which together establish the TiRex model, therefore represent our main contributions.
- Can the model be used for other tasks like imputation, anomaly detection and classification?
While we did not analyze these tasks in the paper, we believe the model holds significant potential for them. Similar to how pre-trained models in other domains provide powerful feature extractors, the hidden representations of TiRex could be leveraged for various downstream tasks. The state-tracking capability of the sLSTM modules, in particular, may provide a rich, compressed representation of the time series dynamics, suitable for tasks like anomaly detection or classification. We will add this discussion to our future work section.
- How does the performance vary with horizon length and frequency of time-series?
Our evaluation shows that TiRex excels across various horizon lengths and frequencies.
- Horizon Length: Figures 9, 10, and 11 in the appendix provide a detailed performance breakdown on the GiftEval-ZS benchmark for long-term (600-900 steps), medium-term (100-600 steps), and short-term (< 100 steps) forecasts, respectively. TiRex achieves the top rank in all three categories, with its most significant performance gains observed in long- and medium-term forecasting.
- Frequency: The GiftEval and Chronos-ZS benchmarks were chosen specifically for their diversity, hence also covering different frequencies (as detailed in Tables 7 and 8). TiRex's number one overall rank on both benchmarks (Fig 4 and 5) demonstrates its strong and robust performance across these different time scales. For GiftEval TiRex is the top-performing model for daily, hourly, minutely, and secondly datasets, and among the top three for weekly, monthly, and quarterly datasets.
- Can the hidden states be interpreted to model different dependencies and explanations similar to attention modules?
While a detailed interpretability study was beyond the scope of this work, the architecture of TiRex is promising for such analysis. Unlike transformer models, which rely on attention maps for interpretation, TiRex utilizes sLSTM modules with explicit recurrent state vectors. We hypothesize that this state vector acts as a compressed summary of the time series history. Analyzing these state dynamics could offer a powerful and potentially more direct way to understand the model's forecasting strategy, in future research.
Dear Reviewers,
Thank you all for your thoughtful reviews.
As we begin the discussion phase, I’d like to kindly ask whether you’ve had a chance to read the authors’ rebuttal. If so, please feel free to add any comments as soon as possible, whether the response addresses your concerns, raises further questions, or leaves issues unresolved.
Engaging in the discussion not only helps ensure a thorough and fair evaluation but also signals to the authors that their rebuttal has been read and considered. Your input at this stage is greatly appreciated and will contribute meaningfully to the final decision.
The paper introduces TiRex, a zero-shot time-series forecasting foundation model that builds on xLSTM and adds two key ingredients: Contiguous Patch Masking (CPM) and a set of online data augmentations (censor augmentation, amplitude modulation, spike injection). The central claims are:
-
TiRex (∼35M params) achieves SOTA zero-shot performance on GiftEval-ZS and Chronos-ZS, outperforming substantially larger baselines (e.g., TimesFM, Chronos/Bolt, Moirai) while offering faster inference.
-
CPM + masked inference mitigate autoregressive error accumulation / quantile collapse, improving medium/long-horizon forecasts.
-
Strong results across horizons and frequencies and univariate modeling still excels on several multivariate datasets within GiftEval.
-
Removing CPM produces the largest degradation while augmentations and patching also matter.
Strengths:
-
Improvements are consistent and sizable across two rigorous, standardized benchmarks.
-
CPM is simple and effective and ablations isolate its contribution to stabilizing multi-patch prediction and combating quantile collapse.
-
Horizon-specific and frequency-specific analyses are useful ablations for practitioners.
-
Smaller model + faster inference are valuable for deployment.
-
The paper is well written and the problem framing and design choices are easy to follow.
Weaknesses:
-
Advances largely combine a new backbone (xLSTM) with CPM and simple augmentations. While the empirical contribution is strong, the methodological novelty is incremental beyond xLSTM.
-
The paper lacks formal analyses for why CPM + xLSTM reduce quantile collapse or improve long-horizon stability.
-
No systematic interpretability study (e.g., state dynamics inspection) to illuminate failure modes or decision rationales.
-
The spike results are compelling, but stress testing on aperiodic/synthetic spikes would strengthen generalization claims.
-
Focuses on univariate modeling while broader multi-variate modeling is deferred.
Primary reasons:
-
Strong, consistent empirical gains over established zero-shot foundation models on standardized, diverse benchmarks, using a compact recurrent architecture with clear speed advantages.
-
Practical, reproducible training ideas that others can adopt.
These outweigh the concerns about limited theoretical analysis and incremental novelty beyond xLSTM. However, given those concerns, I recommend Poster. The paper stands out by demonstrating that carefully trained recurrent backbones can outperform much larger transformer-based TFM baselines in zero-shot forecasting, with an ablation-supported training mechanism that appears broadly useful.
Discussions:
-
z91z (final: Accept/5): Praised efficiency and performance; initially questioned whether novelty beyond xLSTM reduces to patching + simple augmentations. After rebuttal, maintained Accept.
-
ru3x (final: Accept/5): Viewed CPM as the most impactful element. Minor discussion; kept Accept.
-
yiaE (final: Accept/5): Asked for theory, unseen-pattern generalization (aperiodic spikes), and failure modes/interpretability. Authors pointed to recurrent expressivity results and diversity of spike augmentation. No new data added, but the reviewer still found the paper very good and kept Accept.
-
Hidj (final: Weak Accept/raised score): Requested theoretical guarantees and noted missing classic datasets; authors clarified those datasets are included via GiftEval/Chronos-ZS and committed to code release. Reviewer increased score to Weak Accept.
Weighing: Three Accepts and one Weak Accept after rebuttal. The novelty and theory reservations remain, but the empirical strength, ablation evidence, efficiency gains, and clarity justify acceptance.