6.8

/10

Poster4 位审稿人

最低6最高8标准差0.8

3.8

置信度

正确性3.0

贡献度3.3

表达2.8

NeurIPS 2024

Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series

Vijay Ekambaram,Arindam Jati,Pankaj Dayama,Sumanta Mukherjee,Nam H Nguyen,Wesley M. Gifford,Chandra Reddy,Jayant Kalagnanam

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

TL;DR

Fast and Light-weight Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series

摘要

关键词

time seriesfoundation modelspretrained modelsforecastingtime-seriestimetsfmlight-weightforecasters

评审与讨论

审稿意见

评分: 6置信度: 42024-06-19

It presents a lightweight pretrained model for TS with strong performance.

优点

Overall, this is a strong paper with extensive experimental work.

缺点

Adaptive patching is a well-conceived design, although I am unclear why it is termed "adaptive" when the design appears to be fixed and pre-set. Perhaps "multiscale patching" would be a more accurate descriptor. Additionally, the writing should more clearly distinguish between the designs used for pretraining and those used solely for finetuning.

For the full-shot head probing, it is important to include comparisons with state-of-the-art end-to-end methods, as readers are likely interested in such comparisons.

问题

In the weakness

局限性

not applicable

作者回复

2024-08-06

Thank you reviewer for the constructive feedback. Please find our response below

Q1: Adaptive patching is a well-conceived design, although I am unclear why it is termed "adaptive" when the design appears to be fixed and pre-set. Perhaps "multiscale patching" would be a more accurate descriptor.

Yes, we agree with the suggestion to rename it to multiscale patching. We initially used the term “adaptive patching” to indicate that the patch length changes across layers. However, since this change is not runtime adaptation, renaming it to multi-scale patching is indeed more accurate.

Q2: Additionally, the writing should more clearly distinguish between the designs used for pretraining and those used solely for finetuning.

Sure. All the multi-variate and exogenous related components are part of the fine-tuning flow, while the rest of the components are used in both pretraining and finetuning. We will clearly distinguish between the designs used for pretraining and those used solely for finetuning in the revised manuscript.

Q3: For the full-shot head probing, it is important to include comparisons with state-of-the-art end-to-end methods, as readers are likely interested in such comparisons.

Due to space constraints, we couldn’t add the full end-to-end methods in the main result section. However, these results are available in the appendix (Table 14) where TTM with Head probing (HP) consistently outperform other HP benchmarks and are also superior or very competitive as compared to the full end-to-end training of popular TS architectures. This demonstrates that TTM with simple head probing is both highly effective and extremely lightweight, as it avoids the need to update the backbone weights. We will move this appendix table to the main result section in the revised manuscript.

2024-08-08

Thank you for the rebuttal. It's an interesting paper, I will maintain my rating.

审稿意见

评分: 6置信度: 32024-07-13

This paper introduces Tiny Time Mixers (TTM), a new compact pre-trained model for efficient zero/few-shot multivariate time series forecasting. TTM is based on the lightweight TSMixer [1] architecture and incorporates several innovations, which enable effective pre-training on varied dataset resolutions with minimal model capacity. Besides, TTM also employs multi-level modeling to capture channel correlations and fuses exogenous data into the forecasting process during fine-tuning. The authors comprehensively evaluate TTM on multiple datasets and compare their performance with other state-of-the-art models. The results highlight TTMs' superior accuracy, computational efficiency, and lightweight nature.

优点

Enhanced Zero/Few-shot Forecasting Performance: TTMs demonstrate a substantial improvement in zero/few-shot forecasting capabilities, outperforming existing benchmarks by 4-40%. This advancement is particularly notable as it shows that smaller models can achieve high accuracy without the extensive computational resources typically required by larger models.
Specialized Techniques for Pre-training and Fine-tuning: The paper presents novel techniques such as adaptive patching, diverse resolution sampling, and resolution prefix tuning. These innovations enable robust pre-training on heterogeneous datasets and facilitate effective fine-tuning, allowing TTMs to capture channel correlations and integrate exogenous signals, which is critical for accurate multivariate forecasting.
Innovation in Model Architecture: The paper introduces Tiny Time Mixers (TTMs), a compact pre-trained model architecture tailored for multivariate time series forecasting. Starting from just 1 million parameters, TTMs offer a lightweight and efficient alternative to larger, more computationally intensive models. This addresses the need for fast and resource-friendly forecasting tools.

缺点

The base model TSMixer used for pre-training is not newly proposed.
TTM necessitates to train different models for different context length and forecast length, posing some new weaknesses compared to Transformer-based pre-training models. While this paper presents forecast length adaptation strategy to handle the fixed forecast length problem, performance loss still emerges when there is a disparity between the model's forecast length and the actual forecast length.
There are not sufficient experiments to prove the scaling law on pre-training models. The paper's results show that the TTM performs best with 5M parameters. Will the TTM perform better with a larger number of parameters? Showing the tradeoff between performance and cost along with parameters size increasing will facilitate understanding of the model and help choose the most suitable model size.
Some parts of TTM lack ablation studies to sufficiently prove their effectiveness, such as decomposing the exogenous mixer module and decoder channel-mixing.

问题

see weakness

局限性

see weakness

作者回复

2024-08-06

Thank you reviewer for the constructive feedback. Please find our response below

Q1: The base model TSMixer used for pre-training is not newly proposed.

Yes, you are correct that TTM builds on the TSMixer foundation. However, TSMixer doesn’t discuss the techniques required to construct a pre-trained model with effective transfer learning capabilities. To achieve this, we have introduced several innovative components on top of TSMixer, such as adaptive patching, diverse resolution sampling, and resolution prefix tuning. These enhancements are crucial for effectively handling of pre-training across datasets with varying resolutions, while maintaining minimal model size. Section 4.7 details the positive impact of these techniques through comprehensive ablation studies. With the help of these novel components and the learned pre-trained weights, we outperform TSMixer by 15% in few-shot setting, as highlighted in Table 4.

Q2: TTM necessitates to train different models for different context length and forecast length, posing some new weaknesses compared to Transformer-based pre-training models. While this paper presents forecast length adaptation strategy to handle the fixed forecast length problem, performance loss still emerges when there is a disparity between the model's forecast length and the actual forecast length.

Recent advancements in language and vision have seen a shift towards adopting small, focused foundation models over large, all-purpose ones (SLMs Vs LLMs) considering the ease and effectiveness for production deployments and high task-specific accuracy of small focused models. We have applied a similar strategy to time-series foundation models by designing TTMs that are small, focused, and tailored for specific forecasting contexts. These models are particularly well-suited for production enterprise applications, which often require minimal GPU resources for deployment and enable rapid finetuning without the risk of overfitting, challenges that are more pronounced with massive transformer models.

In addition, pretraining TTM is computationally inexpensive and can be quickly trained in less than a day, a notably faster time as compared to existing counterparts which often take several days to weeks. Hence pre-training multiple TTM has no practical challenges or limitations and can easily be achieved. We would like to highlight that, as part of this paper, we also plan to release and open-source a few pre-trained TTMs with different forecasting contexts that can widely cover most of the common enterprise use-cases.

In addition, we also support several forecast length adaptation techniques to adapt a pre-trained model for different forecast lengths with extremely minimal accuracy impact. Particularly, the pruning technique has found to be very effective (Ex. only 0.8% drop in MSE when more than 50% pruning is applied for the scenario of pruning forecast length from 720 to 336.) Please see Figure 4 for the details.

Q3: There are not sufficient experiments to prove the scaling law on pre-training models. The paper's results show that the TTM performs best with 5M parameters. Will the TTM perform better with a larger number of parameters? Showing the tradeoff between performance and cost along with parameters size increasing will facilitate understanding of the model and help choose the most suitable model size.

The scaling laws of TTMs are influenced by the amount of pretraining data used. When training the TTM on Monash data alone (~250M samples), model accuracy saturates beyond 1M parameters (this represents the TTM-Quick model referred in the paper). However, by expanding the pretraining data to 1B samples by integrating Monash with additional data sources, accuracy improvements continued with increased model size, reaching saturation around 5M parameters (the TTM-Advance Model referred to in the paper). Further increasing the model size beyond 5M did not provide additional benefits for the 1B pretraining dataset. We will include these findings in a new section on scaling laws covering various model sizes. Thank you for bringing up this important aspect.

Q4: Some parts of TTM lack ablation studies to sufficiently prove their effectiveness, such as decomposing the exogenous mixer module and decoder channel-mixing.

In general, the Exogenous Mixer and Decoder Channel-Mixing components are designed to be used together. The Exogenous Mixer captures channel correlations in the forecasts, while the Decoder Channel-Mixing captures these correlations in the past context. Using both components together allows the model to learn channel correlations from both forecasts and past contexts, providing a comprehensive view. Running one without other will have information loss in the channel-correlation. Therefore, in the ablation study, we reported them as a single component.

2024-08-12

Thanks for the response to address my concerns. I decide to raise my score.

审稿意见

评分: 8置信度: 42024-07-16

The paper introduces Tiny Time Mixers (TTMs), a series of pre-trained models designed for efficient zero/few-shot multivariate time series forecasting. TTMs are built on the lightweight TSMixer architecture and incorporate innovations such as adaptive patching, diverse resolution sampling, and resolution prefix tuning to handle varied dataset resolutions with minimal model capacity. These models outperform existing benchmarks in accuracy while significantly reducing computational requirements. The empirical studies demonstrate superior performance across multiple tasks and datasets.

优点

The paper introduces innovative solutions to overcome the limitations of large pre-trained models in time series forecasting, offering techniques that could potentially be adapted to enhance other time series forecasting models as well.
The empirical studies are robust, thoroughly assessing the model's accuracy and efficiency across multiple benchmark datasets.
The experimental results are impressive, demonstrating enhanced accuracy and substantially lower computational demands compared to existing methods.
The ablation studies are thorough, providing a detailed analysis of the impact of different pre-training datasets and the effectiveness of the proposed training techniques.

缺点

I did not see significant weaknesses that need to be addressed in this paper. As a minor suggestion, given that there is another mixer architecture known as TSMixer[1], it would be beneficial to include a footnote or a mention in the appendix to clarify that the TSMixer referenced in this work differs from the other one. The clarification will help avoid potential confusion and ensure the distinctiveness of the models is clearly understood.

[1] Chen, Si-An, et al. "Tsmixer: An all-mlp architecture for time series forecasting." arXiv preprint arXiv:2303.06053 (2023).

问题

Given that the pre-training datasets may include time series with varying input lengths, how do TTMs manage these differences during training? Additionally, how do TTMs handle time series with input lengths not encountered before during inference?
Considering that resolution prefix embeddings are learned discretely, how do TTMs accommodate resolutions that were not seen during training when encountered during inference?
What criteria were used to determine the patch size for lagged exogenous features in the TTMs?
The training and evaluation of TTMs primarily focus on long time series with hundreds of input steps. Does this focus impact the performance of TTMs when applied to shorter time series, such as those with fewer than 30 time steps?

局限性

The limitations regarding the restricted number of downstream tasks are discussed in the appendix.

作者回复

2024-08-06

Thank you reviewer for the constructive feedback. Please find our response below

Q1: Adding clarification note on the TSMixer architecture

Thank you for the feedback. Sure. We will clarify the TSmixer used in this paper as compared to the other one.

Q2: Given that the pre-training datasets may include time series with varying input lengths, how do TTMs manage these differences during training?

During pretraining - we do a sliding window on every pretraining dataset to get them converted into several windows based on the TTM’s context and forecast lengths so that we can train the model with all the windows. In cases, when even a single window cannot be created as the training data is extremely small, then these datasets are skipped during pretraining. Appendix Table 8 lists the datasets which were NOT skipped as part of this filtering step.

Q3: How do TTMs handle time series with input lengths not encountered before during inference?

During inference, if the user-provided input length is greater than the TTM-configured length (K), we use the last-K time-points as the input context. On the other hand, if the user-provided length is shorter than the configured TTM length, we have 2 options: if minor adjustments are needed - then we can prepend zeros to virtually extend the length. However, for major adjustments - it is preferable to quickly pre-train another TTM with the required shorter length for enhanced accuracy. Note that, pre-training shorter context length TTM is computationally inexpensive and can be quickly trained in a matter of few hours. Even, pretraining TTM on very long context lengths can be achieved in less than a day, a notably faster time as compared to existing counterparts which often take several days to weeks. As part of this paper, we will release and open-source a few pre-trained TTMs with different forecasting contexts that can widely cover most of the common enterprise use-cases across industries.

Q4: Considering that resolution prefix embeddings are learned discretely, how do TTMs accommodate resolutions that were not seen during training when encountered during inference?

If we encounter unseen resolutions, we recommend the user to either use Out-of-Vocabulary (OOV) token configured as part of the pre-trained model to accommodate these unseen resolutions or use the TTM model pre-trained without the Resolution Prefix Tuning (RPT) module.

Q5: What criteria were used to determine the patch size for lagged exogenous features in the TTMs?

Patch size for lagged exogenous features is mostly a hyperparameter and depends on the target data-set characteristics. Please note that, Exogenous Mixer block is introduced only during the finetuning phase (on the target-data) where these parameters can easily be configured based on the target data characteristics. In general - at least a patch length of 3 or more is suggested so that we have at least one or more forward and backward lags across all channels for effective inter-channel correlation modelling of exogenous signals.

Q6: The training and evaluation of TTMs primarily focus on long time series with hundreds of input steps. Does this focus impact the performance of TTMs when applied to shorter time series, such as those with fewer than 30 time steps?

In general - TTM is pre-trained with higher context length (512, 1024 and 1536) as longer context naturally gives more information for the model to learn better and enable transfer learning. However, TTM also performs well with shorter context length too. For Example. APP, SER and CC data in our benchmarks require a shorter context length of 96 and TTM pretrained with context length 96 works pretty well and outperforms other benchmarks by a good margin (Table 6).

2024-08-09

Thank you for addressing my concerns. I'm satisfied with the answers and would like to keep my score as a strong acceptance.

审稿意见

评分: 7置信度: 42024-07-22

This paper proposes a novel time-series pre-trained model TTM that instead of trying to over-parameterize the model, TTM tries to under-parameterize the model for better generalization ability. The model architecture is simple, straightforward, and easy to understand, coming along with advanced training strategies, such as adaptive patching, diverse resolution sampling, etc.

Empirical evaluations show the benefit of TTM in both state-of-the-art forecasting performance with other supervised, zero-shot, and pre-trained methods, as well as super-efficiency in training.

优点

The idea of using an under-parameterized pre-trained model (1M parameters in TTM compared to millions or billions of parameters in other pre-trained time-series models) for better generalization ability seems to be novel and promising.
The training strategies are clearly presented, the evaluations are comprehensive, and the results are impressive.

缺点

While the idea of using much fewer model parameters can work as presented in this paper, it is counter-intuitive as the current popular pre-trained models tend to be much larger than the proposed TTM. Beyond the empirical results, the paper does not go into deep intuitions of why this manner can work well in generalization, making the idea less convincing.
Presentation may be improved, currently the presentation of figures and tables is too crowded.

问题

Following the point about weakness, could the author intuitively explain why such few parameters can work well for time-series forecasting tasks?
As the model is under-parameterized, my worry about TTM is that TTM can be a model of garbage in and garbage out (take it easy, I am saying this is bad, I am just very curious). Here is my concern:

Let us make a comparison here, the proposed TTM is like a linear regression, very simple, while other over-parameterized models are like a polynomial/kernel regression. We all know that linear regression has less capability than polynomial regression for interpretations. However, for extrapolations, if the extrapolation data still follows the distribution of interpolation, then we can still expect polynomials to work better than linear. However, the reality is that extrapolation data does not follow the distribution of interpolation for most cases in time series (non-stationary problem), so in this case, both linear and polynomials can make mistakes. However because linear is less 'aggressive' in predicting trend changes due to the capability limit itself, it makes less wrong predictions than polynomials in many cases.

That is, given the fact that the datasets used in this work, such as ETT, are highly non-stationary, i.e., data distribution can change heavily as time goes. I might want to see two things:

First, I want to see whether TTM can still show trend information in long-term forecasting. I wonder if the outperformance is because TTM is making 'conservative' and simple predictions that sacrifice the trend information because of the under-parameterization than other pre-trained models.

Second, I want to see some evaluations on some very easy stationary signals, such as sin/cos wave, and see that TTM can still work better than other pre-trained models.

Does the training process involve using any synthetic data, such as simple sin/cos waves, which can be common practices in training other pre-trained models?

局限性

Please refer to the Weaknesses and Questions

作者回复

2024-08-06

Thank you reviewer for the constructive feedback. Please find our response below:

Q1: Could the author intuitively explain why such few parameters can work well for time-series forecasting tasks?

There are three important design choices of TTM that greatly enhance its forecasting accuracy despite its extremely small model capacity:

All existing pre-trained models use a very high volume of pretraining data (TimesFM used ~300B and Moirai used 27B), hence they naturally require massive model sizes. However, as shown in Figure 3, Section 4.7, we observe that “limited” pretraining data with “high resolution diversity” greatly helps in time-series model generalization, as opposed to simply increasing the pretraining data size. This is an important observation and finding for the community that resolution diversity in pretraining data is very crucial for time-series FMs. Based on these findings, we proceed with a well-reduced dataset (1B) with high resolution diversity which naturally reduces our model size compared to counterparts needing to pretrain with several hundred billion time-series. We introduce a high diversity in our data via Diverse Resolution Sampling technique (DRS) which our counterparts fail to do.
Secondly, we opted for TSMixer-based models instead of transformer-based models, which further reduced the model size drastically. The TSMixer architecture has successfully established in the past that interleaving simple gated attentions with mixing components across patches, channels, and features greatly enhances forecasting accuracies with very limited model capacity, as the quadratic time-complexity of self-attentions can be entirely avoided. After TSMixer, several other mixer architectures have been published, reiterating the power of these simple architectures. Thus, avoiding complex transformer architectures further reduced our model size significantly.
Finally, we further increased the modeling power of TSMixer without drastically increasing its size by introducing several innovative components, such as adaptive patching/multi-scale patching, diverse resolution sampling, and resolution prefix tuning. These enhancements are crucial for effectively handling large pre-training across datasets with varying resolutions, all while keeping the model capacity very minimal.

Through these three innovative design choices, we managed to keep TTM as small as possible while outperforming state-of-the-art accuracies.

Q2: Does TTM predict the trends and seasonality well or it’s just doing conservative or simple predictions ?

Yes. Thanks for raising this important concern. Kindly refer to the pdf attached in the common rebuttal section where we have shared several zero-shot forecasting samples of TTM on various datasets, showcasing its ability to model trends and complex seasonal patterns (both real-world and synthetic sin-cos as per the request).

Q3: Does the training process involve using any synthetic data ?

In the current pre-trained models, we do not use any synthetic data. However, we augment more data via Diverse Resolution Sampling (DRS) technique to create different resolutions of existing datasets, which greatly enhances the model performance (as shown in Figure 3, Section 4.7).

Q4: Presentation may be improved, currently the presentation of figures and tables is too crowded.

Thanks for the feedback. We will address this in the final manuscript.

评论- Response to the rebuttal

2024-08-08

I appreciate the authors’ additional results addressing my concern.

The showcase plots generally meet my expectation, where the model accurately predicts the overall trend, it misses most residual details. That is, it makes conservative predictions.

Nevertheless, I am impressed by the model's performance in capturing the overall trend (seasonal trend as said by the authors) given that it uses only a few parameters.

Based on the provided results, I have adjusted the scores accordingly.

作者回复

2024-08-06

Thank you reviewers for your time, effort, and valuable feedback on our paper. We have clarified your queries in the respective sections. We also extend our gratitude to the Area Chairs and all the PC members for investing their valuable time throughout the review process.

Short summary: In 2024, the landscape of time-series forecasting has been dominated by large pre-trained models. Large Transformer-based pre-trained models like TimesFM, Moirai and Moment published at ICML 2024, have garnered significant attention within the time-series community. However, these models are massive, often requiring several hundred millions of parameters, and they face challenges in supporting faster runtime, quick fine-tuning, and integrating exogenous data. In contrast, our study introduces several innovative design modifications and enhancements to traditional model architectures, resulting in an exceptionally small model starting from just 1 million parameters that outperform the existing state-of-the-art models by a notable margin. Moreover, our model offers several other practical advantages such as faster inference, fine-tuning, explainability, exogenous data infusion, and compatibility with CPU deployments—features that are highly valued in industrial forecasting but often lacking in current SOTA models. TTM seeds the first effort towards building light-weight pre-trained models for time-series forecasting and we believe that our model will inspire numerous exciting research endeavours in this area.

We have clarified all the reviewer’s queries below in the respective sections. Thank you.

最终决定Accept (poster)

2024-09-25

The paper presents TTMs, small, pre-trained time-series models. The architecture is based on TSMixer but with a number of improvements to enable the models to transfer well (both zero-shot and finetuning). The performance, both in terms of forecasting quality and latency appear impressive. The reviewers unanimously recommend acceptance, and I agree with this assessment. Congratulations to the authors!