/10

Oral3 位审稿人

最低4最高4标准差0.0

ICML 2025

Sundial: A Family of Highly Capable Time Series Foundation Models

Yong Liu,Guo Qin,Zhiyuan Shi,Zhi Chen,Caiyin Yang,Xiangdong Huang,Jianmin Wang,Mingsheng Long

提交: 2025-01-18更新: 2025-07-24

TL;DR

We introduce Sundial, a family of native, flexible, and scalable time series foundation models pre-trained on a trillion time points.

摘要

关键词

Time SeriesFoundation Models

评审与讨论

审稿意见

评分: 42025-03-13

The paper presents a collection of foundation models for time series. To this end, the authors proposed a loss called TimeFlow for predicting the distribution next-patch, enabling Transformers to be pre-trained without the need for discrete tokenization. It is argued that this loss function helps in preventing mode collapse. The pre-training is conducted on a dataset called TimeBench.

update after rebuttal

Following the rebuttal and the authors' feedback, I have increased my scores accordingly.

给作者的问题

none

论据与证据

Yes, they are clear and convincing.

方法与评估标准

YES

理论论述

There are no theoretical claims.

实验设计与分析

The experimental designs and the conducted analysis seem to be acceptable.

补充材料

I skimmed through the supplementary material.

与现有文献的关系

I find the idea interesting and the results pertinents.

遗漏的重要参考文献

The following paper is not referenced in the submitted paper, even though it achieves high-quality image generation without the need for discrete tokenization, thus having some connections to the proposed method: Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024.

其他优缺点

It is not clear why the authors considered as “unprecedented” the large-scale TimeBench dataset. Moreover, it would have been better to clearly describe this dataset, and how it was curated. The description in the paper and appendix is only limited to half a column. The paper would have benefited from a clear in-depth description of the dataset. Moreover, it is not clear the influence of the dataset to the obtained results, namely does this dataset has a great influence on the obtained results compared to other methods that considered smaller datasets.

Following the rebuttal and the authors' feedback, I have increased my scores accordingly.

其他意见或建议

none

作者回复

2025-04-01

Many thanks to Reviewer 4SCU for providing a valuable review.

Q1: Include the relevant citation (Li et al. 2024).

We referred to it in Section 5.3. In this Section, we provided the comparison using different training objectives: MSE Loss, Diffusion Loss (Li et al., 2024), and TimeFlow Loss (Ours). Performance using Diffusion Loss is notably inferior to TimeFlow Loss. The different generative frameworks and target modalities distinguish our work from the previous work.

Q2: Detailed description of the TimeBench datasets.

Thanks for your valuable suggestion for the dataset description. We consider TimeBench as an unprecedented large-scale dataset since the previous scale of time series datasets is still at the million level. The largest pre-trained time series dataset before our work was 300B (Shi et al., 2024). TimeBench initially extends the scale to the trillion level, following a streamlined curation:

Collection: Different from other modalities, time series data are highly heterogeneous and confidential. Most of them are unavailable on open websites or repositories. There are also limited domains that encompass typical and predictable time series, such as weather, traffic, marketing, energy, etc, leading to slow progress on dataset construction.
Preprocessing: Due to common device faults, it is difficult to pre-train large models using raw time series data. We have conducted tedious preprocessing, including missing values imputation, abnormal exclusion, and normalization techniques.
Quality control: We conduct statistical analysis in our collection, examining time series through the lenses of intrinsic properties, e.g., non-stationarity, forecastability, and seasonality. This approach allows us to characterize the data quality inherent to time series, which affects the training stability of next-token prediction.
Diversity and Generality: We adopt synthetic techniques to improve pattern diversity and the capability of seasonal/trend forecasting. Further, we adopt ERA5, which includes well-defined and systematic real-world temporal observations.

We will provide a clear and detailed description of TimeBench (See also Q4 of Reviewer VkQC) in the final revision.

Q3: Does TimeBench have a great influence on the results compared to other methods that consider smaller datasets?

We compare Sundial with other time series foundation models that are pre-trained with smaller datasets. We also conduct pre-training on Sundial using different scales of datasets (Chronos-94B, LoTSA-230B, TimeBench-1032B). These results highlight the scaling behavior of using larger datasets.

Zero-Shot (MSE \| MAE)	Chronos (94B)	Moirai (230B)	Time-MoE (300B)	Sundial (94B)	Sundial (230B)	Sundial (1032B)
ETTh1	0.591 \| 0.468	0.417 \| 0.419	0.400 \| 0.424	0.402 \| 0.429	0.403 \| 0.419	0.411\| 0.434
ETTh2	0.405 \| 0.410	0.362 \| 0.382	0.366 \| 0.404	0.377 \| 0.414	0.364 \| 0.398	0.333 \| 0.387
ETTm1	0.645 \| 0.500	0.406 \| 0.385	0.394 \| 0.415	0.367 \| 0.402	0.352 \| 0.385	0.336 \| 0.377
ETTm2	0.310 \| 0.350	0.311 \| 0.337	0.317 \| 0.365	0.280 \| 0.341	0.273 \| 0.334	0.258 \| 0.320
ECL	0.214 \| 0.278	0.187 \| 0.274	in distribution	0.172 \| 0.269	0.171 \| 0.267	0.169 \| 0.265
Weather	0.292 \| 0.315	0.287 \| 0.281	0.265 \| 0.297	0.254 \| 0.301	0.252 \| 0.297	0.234 \| 0.270

审稿人评论

2025-04-09

Dear authors, I thank you for your useful feedback. I have increased my score accordingly.

作者评论

2025-04-09

Thank you for providing the insightful review, which helped us a lot in the rebuttal and paper revision. We will elaborate on the dataset curation and include the scaling analysis in our final version.

审稿意见

评分: 42025-03-13

The paper introduces Sundial, a novel family of time series foundation models that address fundamental challenges in time series forecasting through a native, flexible, and scalable approach. The work's primary innovation is the proposed TimeFlow Loss, an optimization objective based on flow-matching that enables Transformers to be pre-trained directly on continuous time series data without requiring discrete tokenization. This approach allows the model to generate multiple probable predictions when conditioned on arbitrary-length time series, achieving flexibility in representation learning beyond parametric densities that constrain distribution modelling capacity. The authors make several significant technical contributions to achieve highly capable time series foundation models. First, they employ continuous tokenization through patch embedding that accommodates variable-length inputs alongside minimal but crucial adaptations to the Transformer architecture, including Pre-LN for stability, RoPE for temporal causality, and optimizations like FlashAttention and KV Cache. Second, they curate TimeBench, an unprecedented corpus containing 1 trillion time points from diverse sources, including real-world datasets and synthetic data across various frequencies. This extensive pre-training dataset enables the model to learn comprehensive temporal dynamics and patterns. The paper presents a systematic comparison between generative and deterministic forecasting paradigms, revealing that models pre-trained with MSE loss often produce over-smoothed predictions due to mode collapse on heterogeneous data distributions. In contrast, Sundial generates diverse yet coherent temporal patterns that align well with input patterns. The authors validate their design choices through extensive ablation studies that compare TimeFlow Loss with alternative approaches, demonstrating trade-offs between inference speed and prediction quality.

给作者的问题

Here are my questions and I would love to hear back from the authors on them.

(1) Quantification of Prediction Diversity: While you demonstrate TimeFlow Loss mitigates mode collapse through visualizations in Figures 13-14, could you provide quantitative metrics to measure prediction diversity across your generated samples? I think this would strengthen your claims regarding the advantages of flow-matching over MSE-based training and help practitioners better understand when to choose each approach.

(2) Computational Resource Requirements: Your work demonstrates impressive results using a trillion-point dataset but lacks details on computational requirements. Could you provide specific information about training time, hardware configurations, and estimated computational costs? This information would be valuable for assessing scalability and environmental impact considerations that are increasingly important in foundation model research.

(3) Context Length Sensitivity: How sensitive is Sundial's performance to the availability of historical context? Your experiments maintain fixed context lengths, but real-world applications often face varying historical data availability. An analysis showing performance degradation curves as context length decreases would provide important insights into the model's robustness in practical deployment scenarios.

(4) Conservative Prediction Behaviour: You briefly mention conservative prediction behaviour as a limitation. Could you elaborate on the specific scenarios where this manifests most prominently and provide a quantitative analysis of this phenomenon? This information would help users understand when Sundial might underperform on trend forecasting tasks compared to alternative approaches.

(5) Controlled Baseline Comparisons: For comparing against baseline models (Table 1), did all models have access to identical context lengths and inference settings? The architectural differences noted in Table 6 raise questions about whether performance differences might be partially attributable to these variations rather than fundamental modelling approaches.

(6) Multivariate Extension Strategy: Given your univariate pre-training approach, what specific architectural or training methodology modifications would be required to effectively model complex inter-variate correlations in multivariate settings? Would this require fundamental changes to the TimeFlow Loss formulation or primarily architectural adaptations? I think a clearer roadmap for this extension would significantly enhance the paper's impact.

I am happy to change the score if the answers are convincing, and look forward to hearing back from the authors.

论据与证据

The authors make several significant claims regarding their proposed time series foundation models, and overall, most claims are backed by thorough evidence experimentation. The central contribution—the TimeFlow Loss based on flow-matching—is well-established through comprehensive mathematical formulations in Sections 3.1 and 4.1.3, providing a clear theoretical foundation for the approach. The claim of state-of-the-art performance is convincingly demonstrated through extensive benchmarking across multiple datasets. Table 1 shows Sundial consistently outperforming other advanced foundation models on point forecasting tasks, with quantitative improvements (7.57% MSE reduction and 4.71% MAE reduction compared to Time-MoE). For probabilistic forecasting, Table 2 demonstrates Sundial achieving first place in MASE and second place in CRPS on the GIFT-Eval benchmark across 23 datasets. These results are particularly impressive given the zero-shot nature of the evaluation.

The authors' claims regarding model scalability are supported by both empirical performance gains across model sizes (as shown in Table 1) and convergence improvements (Figure 7 shows a 15.38% reduction in training objectives for larger models). However, the scaling analysis could have been strengthened with more model size variants to establish clearer scaling laws. The assertion that TimeFlow Loss mitigates mode collapse is partially supported through ablation studies in Table 3 comparing different training objectives and visualizations in Figures 13-14 that contrast Sundial's diverse predictions with the over-smoothed outputs from MSE-trained models. While these comparisons are informative, a more quantitative evaluation of prediction diversity would have strengthened this particular claim.

The paper's ambitious dataset contribution (TimeBench with 1 trillion time points) is well-documented in Table 4, with clear attribution of sources and distributions, lending credibility to the large-scale pre-training claims. The inference speed claims require some nuance—while Figure 6 demonstrates that Sundial achieves an 11.34× speedup compared to Chronos, it's not necessarily faster than all baseline approaches. This represents a reasonable trade-off given the probabilistic capabilities of the model.

方法与评估标准

The methods and evaluation criteria proposed in the paper are fundamentally well-aligned with the challenges of time series foundation modelling. The authors recognize a critical limitation in existing approaches—specifically, the tension between discrete tokenization (which limits the representation of continuous values) and parametric densities (which restrict distribution modelling capacity). Their proposed TimeFlow Loss offers a theoretically sound solution by enabling Transformers to learn flexible predictive distributions without requiring tokenization or prior distribution specification, which is particularly appropriate for the heterogeneous nature of large-scale time series corpora.

The patch-based continuous tokenization approach sensibly balances computational efficiency with representation quality, addressing the unique characteristics of time series data while maintaining compatibility with Transformer architectures. Technical adaptations like Pre-LN for stability and RoPE for temporal causality demonstrate thoughtful consideration of the domain-specific challenges in time series modelling.

Regarding evaluation criteria, the authors employ a comprehensive benchmarking strategy that convincingly validates their claims. The use of both point forecasting metrics (MSE, MAE) and probabilistic metrics (CRPS, MASE, WQL) across three established benchmarks—Time-Series-Library, GIFT-Eval, and FEV—ensures thorough performance assessment. The GIFT-Eval benchmark is particularly appropriate as it encompasses 23 datasets with diverse characteristics, providing a robust measure of generalization capability. The evaluation against both statistical methods and competing foundation models offers the necessary context for interpreting performance gains.

The TimeBench dataset, with its unprecedented scale of 1 trillion time points and diverse sources spanning different frequencies and domains, constitutes an appropriate foundation for pre-training models intended for broad applicability. The careful exclusion of test datasets from pre-training data demonstrates methodological rigour in preventing data leakage that could invalidate zero-shot performance claims.

The ablation studies comparing TimeFlow Loss with alternative approaches (Table 3) and the exploration of inference trade-offs (Figure 8) provide crucial scientific validation of design choices. However, the paper would benefit from more domain-specific evaluations to complement the general-purpose benchmarks and from deeper analysis of prediction diversity beyond visual showcases. Nevertheless, the overall approach and evaluation strategy are well-matched to the foundational goals of developing generalizable, probabilistic time series models capable of effective zero-shot performance.

理论论述

The paper's main theoretical foundation rests on the flow-matching framework originally proposed by Lipman et al. (2022), which the authors adapt to time series forecasting. The primary theoretical contribution—TimeFlow Loss—is formulated in Section 4.1.3, where the authors extend the conditional flow-matching objective to handle sequential time series data. While the mathematical formulation is clearly presented (Equations 4-7), the authors don't provide formal proofs for the new theoretical properties of this adaptation. Instead, they demonstrate its effectiveness through empirical validation across multiple benchmarks.

The theoretical formulation builds directly upon the preliminaries established in Section 3.1, where the standard flow-matching framework is introduced. The authors correctly present the existing theoretical elements including the velocity field ODE (Equation 1), the Flow-Matching objective (Equation 2), and the Conditional Flow-Matching objective with Gaussian formulation (Equation 3). These equations are properly cited to the original work. The adaptation to time series involves conditioning the flow-matching process on a learned representation h_i, which appears mathematically sound, but the paper doesn't provide a rigorous proof of its properties beyond empirical results. The inference procedure (Algorithm 1) is a straightforward application of numerical ODE solving, consistent with standard flow-matching literature.

The claim that TimeFlow Loss mitigates mode collapse during pre-training is primarily substantiated through qualitative visual showcases (Figures 13-14) and comparative experiments (Table 3) rather than through theoretical guarantees. This empirical validation approach is reasonable for a systems paper but limits the theoretical depth of the contribution.

From my understanding, while the paper presents a coherent mathematical framework for applying flow-matching to time series forecasting, it does not contain novel theoretical proofs requiring verification. The mathematical soundness of the approach derives from its faithful extension of established flow-matching theory, with innovations focused on architecture and application rather than fundamental theoretical advances.

实验设计与分析

The authors present a comprehensive evaluation framework for their proposed time series foundation models, with experiments that are generally well-designed but I think it contains several methodological limitations that merit consideration. I am providing some of them here.

(1) The zero-shot forecasting evaluation methodology demonstrates strong validity through the use of established benchmarks (Time-Series-Library, GIFT-Eval, and FEV) and clear separation between pre-training and evaluation datasets. The authors implement appropriate metrics (MSE/MAE for point forecasting and MASE/CRPS/WQL for probabilistic forecasting) that align with community standards. The careful exclusion of evaluation datasets from the pre-training corpus (explicitly denoted by dashes in Table 1) strengthens the experimental integrity of zero-shot claims. However, the comparison methodology raises concerns regarding controlled evaluation conditions. While Table 6 documents architectural differences between models, the paper doesn't explicitly confirm whether all baseline models had access to identical context lengths during inference, which could significantly impact forecasting performance. Additionally, the reliance on officially reported results from other papers (noted in Table 7) introduces potential inconsistencies in evaluation protocols that could affect fair comparison.

(2) The ablation study for TimeFlow Loss (Table 3) is methodologically sound in maintaining architectural consistency while varying only the training objective. However, this analysis is limited to reporting point forecasting metrics (MSE) without evaluating probabilistic metrics, which undermines a complete assessment of the central claim regarding improved distribution modelling. The qualitative comparisons in Figures 13-14 partially address this gap but lack objective quantification of prediction diversity.

(3) The scaling behaviour analysis in Figure 7 presents valid training curves but would be strengthened by more intermediate model sizes to establish clearer scaling laws. Similarly, the inference speed versus performance trade-off analysis (Figure 8) systematically varies generation parameters but doesn't properly quantify the relationship between these parameters and actual inference time.

(4) Most concerning is the absence of statistical significance testing or confidence intervals across all experimental results, particularly important given the high variance typically observed in time series forecasting performance. This omission limits the robustness of performance comparisons, especially for closely-matched results.

补充材料

Yes, I reviewed the full paper, including all of the supplementary materials.

与现有文献的关系

I think this work makes several notable contributions that both build upon and diverge from established research trajectories in time series modelling. The authors' central contribution—TimeFlow Loss—represents a significant advancement in how foundation models handle continuous-valued time series data. This approach extends the flow-matching framework of Lipman et al. (2022) to the autoregressive time series domain, creating a bridge between continuous generative modelling and sequential forecasting that has been largely unexplored in prior literature. The authors' decision to embrace generative modeling rather than discrete tokenization strategically positions their work as an alternative to the language-modeling inspired approach taken by Chronos (Ansari et al., 2024). While Chronos adapted techniques from NLP by discretizing continuous values, Sundial's native approach avoids the information loss inherent in quantization—addressing a fundamental limitation recognized but unresolved in previous work. Similarly, the TimeFlow Loss offers greater flexibility than the parametric mixture distributions employed by Moirai (Woo et al., 2024), which the authors correctly identify as potentially constraining when modelling heterogeneous time series distributions at scale.

The architectural adaptations, while individually derived from existing techniques like RoPE (Su et al., 2024) and Pre-LN (Xiong et al., 2020), represent a thoughtful integration of components that specifically address the challenges of time series forecasting. This integration draws from disparate research streams that have not previously been united for time series foundation models. Particularly, the deliberate incorporation of FlashAttention and KV Cache reflects an understanding of the efficiency challenges faced by practitioners—a consideration often neglected in academic time series research but well-established in large language model literature.

TimeBench's trillion-point scale represents an order-of-magnitude increase over previous datasets like those used in Time-MoE (300B, Shi et al., 2024b) and Timer (231B, Liu et al., 2024a,b). This scaling aligns with the broader foundation model literature's emphasis on dataset size as a critical factor in model capability, while specifically addressing the unique challenges of time series heterogeneity. The authors' careful compilation of diverse frequency data connects to emerging work on scaling laws for time series (Shi et al., 2024a), extending these insights into previously unexplored data volumes.

I think the empirical results position Sundial at the intersection of deterministic and probabilistic forecasting research streams. The comparative analysis against both MSE-trained models and diffusion-based alternatives provides a valuable empirical bridge between these previously separate approaches. This integration of multiple modelling paradigms within a unified foundation model framework represents a meaningful synthesis of previously disparate research directions in time series forecasting.

遗漏的重要参考文献

N/A

其他优缺点

This work demonstrates considerable originality in its approach to time series foundation models. The authors identify a fundamental tension in prior work—the trade-off between discrete tokenization and parametric distributions—and present a creative solution through their TimeFlow Loss framework. This contribution is conceptually significant as it bridges flow-matching techniques with autoregressive time series modelling, enabling Transformers to learn flexible distributions directly from continuous values without imposing restrictive prior distributions. This represents a meaningful step forward in the relatively nascent field of time series foundation models, not merely an incremental improvement over existing methods.

The paper's most significant contribution is its comprehensive approach to the generative modelling paradigm for time series. While individual components (flow-matching, Transformer architectures) exist in isolation, their integration into a cohesive framework specifically designed for time series forecasting demonstrates genuine innovation. The TimeBench dataset with 1 trillion time points represents a substantial resource contribution to the research community and enables proper exploration of scaling behaviour in time series foundation models, an important but previously under-explored area.

From a technical writing perspective, the paper generally maintains strong clarity. The mathematical formulations in Sections 3 and 4 are precise and well-structured, providing sufficient detail for implementation. The ablation studies effectively isolate the contribution of the TimeFlow Loss compared to alternatives, though they could benefit from more rigorous quantification of prediction diversity beyond the qualitative showcases in Figures 13-14.

However, several weaknesses merit attention. While the model shows impressive zero-shot performance, the paper inadequately addresses computational efficiency during training. Training on 1 trillion time points likely requires significant computational resources, but the paper provides a limited discussion of training time, hardware requirements, or environmental impact—considerations increasingly important in foundation model research. Additionally, the inference procedure introduces complexity with its sampling parameters, creating practical deployment challenges that aren't thoroughly addressed.

The model's conservative prediction behaviour (noted in the limitations section) represents a substantive weakness that could limit practical utility. Since accurate trend forecasting is critical in many applications, this limitation deserves a more thorough analysis rather than a brief acknowledgment. Furthermore, the univariate pre-training approach, while pragmatic, sidesteps the important challenge of modelling inter-variate correlations in multivariate time series—a limitation that restricts the model's applicability to many real-world scenarios where complex interdependencies exist.

I think despite these limitations, the paper makes a compelling contribution to time series foundation model research by establishing a new paradigm for generative modelling in this domain, demonstrating strong empirical results, and providing a solid foundation for future work in this direction.

其他意见或建议

N/A

伦理审查问题

N/A

作者回复

2025-04-01

Many thanks to Reviewer gDT6 for providing a detailed review and recognizing our contributions.

Q1: Quantitative evaluation of prediction diversity.

Thanks for your suggestion. We extend the ablation study of Table 3, where we provide probabilistic metrics CRPS to evaluate the diversity. Note that MSE-optimized models give deterministic predictions (a peak distribution), while generative forecasters can give non-deterministic results to estimate the distribution (we keep consistent with the original paper by using 20 raw predictions):

Zero-Shot (CRPS)	ETTh1	ETTh2	ETTm1	ETTm2	ECL	Weather	GIFT-Eval
TimeFlow	0.0059	0.0037	0.0057	0.0029	0.0082	0.0021	0.505
Diffusion	0.0082	0.0053	0.0070	0.0039	0.0095	0.0032	0.534
MSE	0.0063	0.0040	0.0058	0.0032	0.0080	0.0023	0.642

The predictive distribution modeled by TimeFlow is more coherent and diverse than its counterparts, notably on the diverse GIFT-Eval, validating TimeFlow's advantages in generative modeling.

Q2: Computational resources requirements.

Pre-training: We use 32 A100-40G GPUs. Each GPU handles a batch size of 128 and a context length of 2880. Benefiting from the patch tokenization, our mode finishes pre-training (100k iterations) on TimeBench in around 24 hours, totally using 768 GPU hours.

Inference (Context length=2880; Sampling steps=50; Sampling numbers=20; A single GPU with memo > 2GB): It takes 7ms to generate one prediction (140ms for 20 raw samples per forwarding). This cost is approximately proportional to sampling steps and numbers.

Q3: Context length sensitivity.

We provide performance under different context lengths.

Zero-Shot (MSE, Averaged)	480	960	1440	1920	2400	2880
ETTh1	0.404	0.412	0.427	0.422	0.417	0.411
ETTh2	0.343	0.346	0.337	0.336	0.340	0.333
ETTm1	0.385	0.375	0.361	0.349	0.339	0.336
ETTm2	0.288	0.287	0.264	0.252	0.255	0.258
ECL	0.171	0.172	0.171	0.169	0.169	0.169
Weather	0.248	0.239	0.237	0.235	0.236	0.234

Note that Sundial handles various context lengths on FEV and GIFT-Eval. Unlike fixed-context models, Sundial can be flexible for practitioners, where the context length can be dynamically adjusted during inference instead of re-training.

Q4: Discussion about the conservative prediction behavior.

We delved into this behavior after the initial submission, focusing on the pre-training distribution: Meteorological data (ERA5) taking a great proportion is less likely to encompass extreme trends. The conservative behavior is mitigated by reweighing training samples in TimeBench.

Zero-Shot (FEV)	MASE	WQL
Original	0.845	0.712
Reweighting	0.832	0.685

Q5: Controlled comparisons of baseline models.

We report official results from public leaderboards and papers. We regard the context length as an inherent capability, which a TSFM can make continuous self-improvement.

Despite different context lengths in baseline models, our experiments still validate that performance improvement is not simply attributable to length variations: On TSLib, Sundial (context length=2880) surpasses previous state-of-the-art Time-MoE (context length=3072); On FEV and GIFT-Eval, there are more than half time series with less than 512 context length, less than the maximum context length of most TSFMs. We will also extend the baseline evaluation like Q3 in the revised version.

Q6: Extension for multivariate forecasting.

Univariate pre-training addresses different variate numbers of datasets during pre-training. A multivariate extension is crucial for the future improvement of Sundial, including:

Architecture: Recent works proposed new attention mechanisms (e.g., Moirai, Timer-XL) for intra-/inter-variate modeling, which can be seamlessly incorporated into Sundial. Multivariate extension on the flow-matching network (e.g., from MLP to iTransformer) is applicable to make the post-merging on univariate representations.
Post-training: Another roadmap is univariate pre-training and multivariate fine-tuning. Similar to GPT-3, a univariate pre-trained TSFM is a start-point model. We will explore multivariate prompting (e.g., special variate token) to instruct TSFMs on downstream tasks.

W1: Experimental supplement to mentioned issues:

Quantify the relationship between parameters and inference time (Extension of Figure 6):

FEV Evaluation	Small	Base	Large
Parameters	32 M	128 M	444 M
Inference Time	2.36ms	3.13ms	5.31ms

Statistical significance testing (five runs):

Zero-Shot (MSE)	ETTh1	ETTh2	ETTm1	ETTm2	ECL	Weather
Mean $\pm$ Std.	0.411 $\pm$ 0.001	0.333 $\pm$ 0.000	0.336 $\pm$ 0.000	0.258 $\pm$ 0.001	0.169 $\pm$ 0.000	0.234 $\pm$ 0.001

Different from the high variance typically observed in forecasting performance, generative forecasting aggregates multiple predictions for calibration, achieving a small variance.

审稿人评论

2025-04-04

Good job for providing a detailed response. I really appreciate the effort you put into it. I am happy with your answer to my first question; it was clear, well-explained, and totally convincing. For the second question, I appreciate all the details you shared about the training setup - things like GPU specs, batch size, context length, and total training duration. That was helpful. That said, I was actually hoping to get a bit more info on computational costs, like FLOPs, energy consumption, or even the monetary cost. These details are super important when thinking about the scalability of large models. But just to be clear, that does not mean your response was not convincing. For the rest of the questions, you covered everything I asked, and I found them convincing as well. I am thankful that you even extended your results - it all makes a lot of sense to me. Thanks again for your hard work. I am happy to bump up your score to a 4, and I truly hope your work gets accepted. Great job to the team.

作者评论

2025-04-07

We’re glad that our responses addressed your concerns, which helped a lot us to improve the quality of this work. We will incorporate the corresponding revisions, including more detailed computational costs, into the final manuscript. Thanks again for your response and raising the score.

审稿意见

评分: 42025-03-14

This paper introduces Sundial, a family of native, flexible, and scalable time series foundation models. It proposes a TimeFlow Loss based on flow-matching for model training, which can generate multiple probable predictions. It also proposes some crucial adaptations of Transformers and the TimeBench with 1 trillion time points to enhance the time series foundation model. Sundial shows good generalization performance on zero-shot forecasting and achieves new state-of-the-art on both point forecasting and probabilistic forecasting benchmarks.

给作者的问题

Please see the ’Other Strengths And Weaknesses’ and ‘Other Comments Or Suggestions’ parts. Addressing the weaknesses or suggestions above may help improve the paper.

论据与证据

The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

The proposed method and evaluation criteria make sense for the problem.

理论论述

There is no proof or theoretical claims.

实验设计与分析

Experimental designs and analyses are sound and valid.

补充材料

I reviewed all the appendix.

与现有文献的关系

This paper contributes to the literature of time series foundation models. Existing models mainly use parametric densities or language modeling, while this paper introduces a new perspective of generative modeling. It also proposes a large TimeBench for model training and a family of foundation models with SOTA performance on point and probabilistic forecasting.

遗漏的重要参考文献

There are no essential references not discussed.

其他优缺点

Strengths:

(1) This paper is well presented. Most motivations, designs, and contributions are clearly described. The proposed method is easy to follow.

(2) The idea of introducing generative modeling into time series foundation models is interesting. The proposed TimeFlow loss allows Transformers to be trained without discrete tokenization and make probable predictions.

(3) The proposed Sundial is a family of scalable and efficient time series foundation models, which achieves state-of-the-art zero-shot performance on point forecasting benchmarks and probabilistic forecasting benchmarks, including GIFT-Eval and FEV.

Weaknesses:

(1) Figure 1 needs more detailed explanations, such as the explanations on these two tokenization ways and three modeling techniques (their meanings, advantages and disadvantages) and the differences between them.

(2) The proposed model makes some critical adaptations to the Transformer architecture, such as RoPE, Pree-LN, FlashAttention, and KV Cache. More explanations on why these adaptations help the model are needed. It would also be better to conduct some experimental analysis on the effects of these adaptations.

(3) This paper mentions mode collapse in foundation models. It needs more descriptions of this term and more explanations on how the proposed model overcomes mode collapse.

其他意见或建议

(1) Are there any principles in collecting datasets in TimeBench? For example, how can we decide which dataset should or should not be included during collection? How do these different datasets help the pre-training?

(2) With a larger scale of pre-training data, why does Sundial use a smaller model size than Time-MoE and Chronos?

作者回复

2025-04-01

Many thanks to Reviewer VkQC for providing thorough insightful comments.

Q1: Explanations about different tokenization ways and modeling techniques.

Thanks for your suggestion, we provide a comparison to enhance the clarity:

Tokenization	Meanings	Advantages	Disadvantages
Discrete	quantize time series into a fixed vocabulary	compatible with language modeling	foreign (discrete precision), compute-intensive, OOV risk
Continuous	embed time series into latent representations	native (operate on original values), efficient (patch token)	unconstrained output range

Modeling	Meanings	Advantages	Disadvantages
Parametric Densities	specify data with prior distributions	competent with suitable prior, fit well on small-scale data	inflexible, risk of mode collapse
Language Modeling	predict categorical distributions of word sequences	flexible, scalable	rely on discrete tokens
Generative Modeling	learn the underlying distribution that generates data	flexible, scalable, compatible with continuous tokens	require sampling

Q2: Effectiveness of architectural adaptations.

RoPE maintains the temporal causality, leading to better performance.

TSLib (Zero-Shot, Averaged)	w/o RoPE	with RoPE
MSE \| MAE	0.302 \| 0.360	0.290 \| 0.342

Pre-LN improves training stability, leading to stable convergence and better performance.

TSLib (Zero-Shot, Averaged)	Post-LN (15k iter)	Post-LN (30k iter)	Pre-LN (15k iter)	Pre-LN (30k iter)
MSE \| MAE	0.295 \| 0.348	0.297 \| 0.350	0.294 \| 0.347	0.290 \| 0.342

FlashAttention reduces computational costs. KV Cache improves the inference speed of multi-step autoregression.

Context Length = 2880, Patch Size=16, Batch Size=96	w/o FlashAttention	with FlashAttention
Training Speed (s/iter)	1.2723	1.2245
Memory Footprint (GB)	35.41	30.18

Prediction Length = 160, Autoregression Steps = 10	w/o KV Cache	with KV Cache
Inference Speed (s/iter)	1.08	0.62

Q3: Discussion of mode collapse in time series foundation models.

Mode collapse is a failure of representation learning, where a model generates a limited variety of outputs, ignoring the diversity in the training data. For time series foundation models, mode collapse stems from the heterogeneity of time series distribution and sometimes leads to oversmooth predictions (See showcases in Figure 13-14).

Our work addresses it by training objectives. We adopt generative modeling to learn flexible distribution without probabilistic priors. As an extension of Table 3, we evaluate the distributional metric CRPS to assess the quality of generated predictions using different training objectives.

Zero-Shot Benchmark (CRPS)	ETTh1	ETTh2	ETTm1	ETTm2	ECL	Weather	GIFT-Eval
TimeFlow	0.0059	0.0037	0.0057	0.0029	0.0082	0.0021	0.505
Diffusion	0.0082	0.0053	0.0070	0.0039	0.0095	0.0032	0.534
MSE	0.0063	0.0040	0.0058	0.0032	0.0080	0.0023	0.642

Results show that the predictive distribution modeled by TimeFlow is more coherent and diverse than counterpart training objectives, especially on the highly diverse GIFT-Eval, which validates TimeFlow's effectiveness in addressing mode collapse.

Q4: Principles in collecting datasets in TimeBench.

The curation of trillion-scale TimeBench includes:

Preprocessing	Quality Control	Composition
impute missing values via mean values or ARIMA	measure statistics like ADF-test, predictability, exclude datasets that deviate from normal ranges	collect datasets from typical domains that consist of predictable and applicable data, such as weather, traffic, marketing, energy, etc
replace abnormal values (e.g., 3-sigma)	exclude less predictable series based on the performance of simple ML forecasters	adopt synthetic technique (e.g., KernelSynth), which improves the capability of seasonal/trend prediction
conduct normalization to mitigate range discrepancies	use statistics to determine the sampling weight during pre-training	use systematic datasets with well-defined temporal dynamics, such as meteorological ERA5, which enhance the understanding of local variations

Q5: About the smaller model size compared to Time-MoE and Chronos.

The difference mainly comes from the architectural choice: Time-MoE adopts MoE, which increases parameter counts greatly. Compared with the encoder-decoder Chronos, Sundial adopts a decoder-only Transformer. To effectively scale the parameter counts, it is crucial to enhance the domain diversity and dataset completeness in the pre-trained corpora, which leaves an important future work.

最终决定Accept (oral)

2025-05-01

It proposes a TimeFlow Loss based on flow-matching for model training, which can generate multiple probable predictions, demonstrated on both point forecasting and probabilistic forecasting benchmarks. [Reviewer VkQC]
The idea of introducing generative modeling into time series foundation models is interesting. [Reviewer VkQC]
The proposed TimeFlow loss allows Transformers to be trained without discrete tokenization and make probable predictions. [Reviewer VkQC]
This work demonstrates considerable originality in its approach to time series foundation models. [Reviewer gDT6]
Most significant claims made by the authors are backed by thorough evidence experimentation. [Reviewer gDT6]
The central contribution—the TimeFlow Loss based on flow-matching—is well-established through comprehensive mathematical formulations in Sections 3.1 and 4.1.3, providing a clear theoretical foundation for the approach. [Reviewer gDT6]
The paper's ambitious dataset contribution (TimeBench with 1 trillion time points) is well-documented in Table 4, with clear attribution of sources and distributions, lending credibility to the large-scale pre-training claims. [Reviewer gDT6]
The patch-based continuous tokenization approach sensibly balances computational efficiency with representation quality, addressing the unique characteristics of time series data while maintaining compatibility with Transformer architectures. [Reviewer gDT6] 9.Technical adaptations like Pre-LN for stability and RoPE for temporal causality demonstrate thoughtful consideration of the domain-specific challenges in time series modelling. [Reviewer gDT6]
The paper presents a collection of foundation models for time series. To this end, the authors proposed a loss called TimeFlow for predicting the distribution next-patch, enabling Transformers to be pre-trained without the need for discrete tokenization. [Reviewer 4SCU]

All reviewers acknowledge the author responses.

Zero-Shot (MSE \| MAE)	Chronos (94B)	Moirai (230B)	Time-MoE (300B)	Sundial (94B)	Sundial (230B)	Sundial (1032B)
ETTh1	0.591 \| 0.468	0.417 \| 0.419	0.400 \| 0.424	0.402 \| 0.429	0.403 \| 0.419	0.411\| 0.434
ETTh2	0.405 \| 0.410	0.362 \| 0.382	0.366 \| 0.404	0.377 \| 0.414	0.364 \| 0.398	0.333 \| 0.387
ETTm1	0.645 \| 0.500	0.406 \| 0.385	0.394 \| 0.415	0.367 \| 0.402	0.352 \| 0.385	0.336 \| 0.377
ETTm2	0.310 \| 0.350	0.311 \| 0.337	0.317 \| 0.365	0.280 \| 0.341	0.273 \| 0.334	0.258 \| 0.320
ECL	0.214 \| 0.278	0.187 \| 0.274	in distribution	0.172 \| 0.269	0.171 \| 0.267	0.169 \| 0.265
Weather	0.292 \| 0.315	0.287 \| 0.281	0.265 \| 0.297	0.254 \| 0.301	0.252 \| 0.297	0.234 \| 0.270