PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
3
4
3
ICML 2025

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We propose VisionTS, a time series forecasting foundation model building from rich, high-quality natural images.

摘要

关键词
time series forecastingfoundation modelstransfer learning

评审与讨论

审稿意见
3

This paper propose the VisionTS, which uses the strong pre-trained ability in the vision model to help the time series modality. The core idea is the inherent similarity between the image and the time series such as trend, seasonality, and so on. To align the input of time series into image, the author first convert the time series into a 2D gray scale image followed by the MAE for prediction. The experiments are solid and the performance of the proposed visionTS is noteworthy.

给作者的问题

None

论据与证据

The main claims of this paper is that the image and time series share similar properties. The author explains this points by an intuitive explanation. However, it is not very convincing to me since the domain gap between these two modalities. Further empirical evidence may be needed to justify why the pre-trained MAE can be used even in a zero-shot way to perform time series forecasting.

方法与评估标准

  • The proposed methods mainly focus on the transformation of how to align the input space of image and time series. The solution of transforming to 2D gray scale image is reasonable.

  • The evaluation metrics are commonly used in existing works, which makes sense.

理论论述

No theoretical proof included in the paper.

实验设计与分析

  • The experiments are solid. The authors have included various time series tasks, including zero-shot and few-shot, and have compared with both LLM-based and classic time series models.

  • I have noticed that the authors have discussed VisionTS can not model the interaction between multivariate time series data. Since the multivariate data is almost ubiquitous in real-world, this inability may weaken the practical usage. However, since this paper is a first attempt, this problem may be addressed in the future.

补充材料

I have reviewed the Suppl.

与现有文献的关系

The idea of using vision model for time series is novel, which can inspire further explorations.

遗漏的重要参考文献

The experimental comparison lacks one highly related work CALF [CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning, AAAI2025], which is the existing SoTA LLMs-based time series forecasting works. How is the performance of the proposed VisionTS compared with CALF?

其他优缺点

None

其他意见或建议

  • The LLM have been recently proved not very necessary for time series [Are Language Models Actually Useful for Time Series Forecasting? NIPS2024]. So I wonder whether the vision model also suffers the similar situation?
作者回复

Thank you for your encouraging response. We are delighted that you find our paper novel, with solid experiments and the performance is noteworthy. Below are our responses:

Claims And Evidence: However, it is not very convincing to me since the domain gap between these two modalities. Further empirical evidence may be needed to justify why the pre-trained MAE can be used even in a zero-shot way to perform time series forecasting.

  • As you noted, we are the first to leverage a pre-trained vision model for zero-shot forecasting. We understand that ground-breaking ideas often require time to gain community endorsement. As an initial exploration, we've conducted extensive experiments to validate VisionTS's effectiveness. To our knowledge, our evaluation benchmark is the largest among existing TSF foundation models.

  • To explore the domain gap between the modalities further, we visualize their similarities in Fig 7. We find notable heterogeneity within time-series data across domains, with images potentially "bridging" these isolated time-series representations, which might explain why VisionTS outperforms some cross-domain TSF models.

  • We welcome any suggestions that could further strengthen the persuasiveness!

Essential References Not Discussed: The experimental comparison lacks one highly related work CALF [CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning, AAAI2025], which is the existing SoTA LLMs-based time series forecasting works. How is the performance of the proposed VisionTS compared with CALF?

  • We compare the full-shot results reported by CALF with VisionTS's zero-shot results below:
VisionTS (zero-shot)CALF (full-shot)
ETTh1, MSE0.3900.432
ETTh1, MAE0.4140.428
ETTh2, MSE0.3330.349
ETTh2, MAE0.3750.382
ETTm1, MSE0.3740.395
ETTm1, MAE0.3720.390
ETTm2, MSE0.2820.281
ETTm2, MAE0.3210.321
Electricity, MSE0.2070.175
Electricity, MAE0.2940.265
Weather, MSE0.2690.250
Weather, MAE0.2920.274
  • We can observe that VisionTS, even in a zero-shot scenario, proves comparable to CALF in full-shot settings. It highlights the greater transferability of visual modality to time series over textual modality.

Other Comments: The LLM have been recently proved not very necessary for time series [Are Language Models Actually Useful for Time Series Forecasting? NIPS2024]. So I wonder whether the vision model also suffers the similar situation?

  • We agree that text modality may offer limited benefit to time series. Following the paper's ablation study, we removed or replaced VisionTS's visual backbone with simpler modules. Appendix D.3 shows these changes degrade performance, indicating visual knowledge is indeed beneficial for TSF, unlike textual knowledge.

We hope these responses can fully address your concerns. Thank you once more for your detailed feedback!

审稿人评论

Thanks for providing this rebuttal.

I have carefully read this response. My major concerns have been addressed. However, I still have the following suggestions for the authors to further improve their paper.

  1. Moving the justification of "why vision model can do TS tasks" in the front of the paper, since this justify the motivation of this work.

  2. From the comparison with CALF, it seems the proposed method fails behind on the Electricity and Weather dataset. It is suggest ed to add related discussion in the revision.

In short, despite the limitation of real-world multivariate data, and comparison with existing SoTA, given this work is the first exploration of applying vision model to TS, I tend to maintain my previous positive rating.

作者评论

Thanks for your quick response and positive feedback on our paper!

Moving the justification of "why vision model can do TS tasks" in the front of the paper, since this justify the motivation of this work.

Thank you for your suggestion. In the front of the paper (Introduction section), we have included an illustrative example (Figure 2) and referred to the modality visualization experiment (Lines 106-118) to support our motivation. We will emphasize this further in the final version, possibly by bolding the relevant texts.

From the comparison with CALF, it seems the proposed method fails behind on the Electricity and Weather dataset. It is suggest ed to add related discussion in the revision.

Thank you for pointing this out. However, we respectfully clarify that VisionTS operates in zero-shot mode in this comparison (i.e., without training on the Weather and Electricity datasets). Full-shot results for VisionTS are reported in Table 19 (Appendix D.2) and summarized as follows:

VisionTS (full-shot)CALF (full-shot)
Electricity, MSE0.1650.175
Electricity, MAE0.2590.265
Weather, MSE0.2270.250
Weather, MAE0.2620.274

Notably, these results were achieved by fine-tuning VisionTS for just one epoch (only fine-tuning layer normalization while freezing other modules). This demonstrates that VisionTS, with minimal fine-tuning, is able to outperform CALF in the full-shot mode.

If you have any other questions, feel free to reach out for further discussion!

审稿意见
3

The paper proposes to utilize a pre-trained vision masked autoencoder for time series forecasting. The time series data is processed channel-independent and stacked depending on the periodicity of the series. A pre-trained vision-MAE is applied, and the result is transformed back in the series space representing the forecast. The authors argue that the intrinsic pattern of vision data is more similar to time series data than text, hence, pre-trained vision models might be beneficial for TSF foundation models while LLM are not. The approach was evaluated on an extensive set of benchmarks and shows promising results.

update after rebuttal

we thank the authors for the clarifications, I updated my score already

给作者的问题

  • As far as I understand, VisionTS relies on a specific periodicity that is used for segmentation. How would you use VisionTS to model time series data that exhibits multiple periodicity?

论据与证据

The main claim is that Vision-TS, a pre-trained vision autoencoder with time series specific pre- and postprocessing, is suitable for time series forecasting. The claim is supported by an extensive set of benchmark evaluations. While I think the claim is justified, I have some concerns about the evaluation results (see below).

方法与评估标准

Yes, the paper evaluates on multiple standard benchmark datasets for time series forecasting. The method itself is a creative approach that might not constitute a model that will be actually used in practice but might help to understand why and how TSF foundation models work.

理论论述

There are no theoretical proofs or claims.

实验设计与分析

In general, the experimental design is well done as the paper evaluates multiple benchmarks (individual datasets of the three benchmarks overlap, which is fine). I only have one concern and one suggestion:

  1. Gift-Eval is the most comprehensive benchmark of the utilized benchmarks. The results are unfortunately discussed only very briefly in Figure 7. Hence, only the MASE metrics, but no WQL or average rank metrics are reported. Although the leaderboard is even linked in the paper, some models outperforming VisionTS (ttm, chronos bolt) are not reported. Although some might be concurrent work, updating the results would be beneficial.

  2. Following the argument from point (1), gift-eval and monash reposoitry are likely the much more relevant benchmarks as they are more comprehensive as the long-term benchmark. Recent work further highlights problems of the respective long-term benchmark [1,2]. Therefore, I would suggest emphasizing and discussing the gift-eval and monash in more depth instead of mostly highlighting the long-term benchmarks.

[1] L. Brigato, R. Morand, K. Strømmen, M. Panagiotou, M. Schmidt, and S. Mougiakakou, ‘Position: There are no Champions in Long-Term Time Series Forecasting’, Feb. 19, 2025, arXiv: arXiv:2502.14045. doi: 10.48550/arXiv.2502.14045. [2] Invited Talk by Christoph Bergmeir - Fundamental limitations of foundational forecasting models: The need for multimodality and rigorous evaluation, Time Series in the Age of Large Models Workshop NeurIPS 2024

补充材料

I did not check the supplementary code.

与现有文献的关系

The work is located in the field of pre-trained time series models / foundational time series models. Most closely related is the work that builds upon existing language models, as VisionTS also utilized a pre-trained model that is actually pre-trained on non time series data. To the best of my knowledge, VisionTS is the first model that utilizes a pre-trained vision model.

遗漏的重要参考文献

Some methods that appear in the evaluation (e.g. TTM and Chronos) are not cited.

其他优缺点

Strength:

  • Creative approach that can help the understanding of what drives the performance of TSF foundation models.
  • Extensive benchmark scope

Weakness:

  • Evaluation reporting might miss certain metrics and models (see Experimental Design & Analysis). I suggest the author should update the mentioned results. I especially want to note that regardless of weather, VisionTS is still the best-performing model; this would improve the paper. The novelty of the idea (which helps to further understand TSF foundation models) with a thorough and sound evaluation is more important than outperforming SOTA. If the evaluation results are updated accordingly, I would consider increasing my score towards acceptance.
  • Approach seems to not to scale with bigger models (see results on different MAE size)

其他意见或建议

作者回复

Thank you for your positive comments on our work. We are pleased that you find our paper novel and well-experimented, aiding in understanding the workings of TSF foundation models. Here are our responses to your concerns:

E1: Gift-Eval is the most comprehensive benchmark of the utilized benchmarks.

E2: Recent work further highlights problems of the respective long-term benchmark. Therefore, I would suggest emphasizing and discussing the gift-eval and monash in more depth instead of mostly highlighting the long-term benchmarks.

  • We fully agree that LTSF has limitations. Since this ground-breaking viewpoint has recently emerged, before it is widely accepted by the research community, we believe it is still necessary to test widely-used benchmarks. As you can see, other reviewers are still interested in the LTSF datasets.

  • We also agree on the need to enhance presentation of solid benchmarks. We will add the GIFT-Eval results to the right side of the teaser (Figure 1) to further emphasize them and expand the discussion of their experimental results. Below is further discussion on the GIFT-Eval leaderboard which will be included in our final revision:

W1: Evaluation reporting might miss certain metrics and models (see Experimental Design & Analysis). I suggest the author should update the mentioned results.

For GIFT-EVAL, only the MASE metrics, but no WQL or average rank metrics are reported.

  • We report MASE but not CRPS (WQL) or average rank, since these are probabilistic metrics, not point metrics (note that the average rank is based on CRPS-sorted results). Due to limitations of the MAE model, VisionTS is not a probabilistic forecasting model, as mentioned in Section 6. Comparing the CRPS metric of point-forecasting models with probabilistic models is unfair, as the former's CRPS tends to be significantly worse. If we only consider the point forecasting model (e.g., TTM r1/r2 and Timer) on the leaderboard, VisionTS significantly outperforms them in both MASE and CRPS.

Some methods that appear in the evaluation (e.g. TTM and Chronos) are not cited. Although the leaderboard is even linked in the paper, some models outperforming VisionTS (ttm, chronos bolt) are not reported. Although some might be concurrent work, updating the results would be beneficial.

  • Thank you for your reminder and suggestion. We will cite all the referenced papers and update our results to include those concurrent works, even though some lack published papers, as noted by Reviewer TiNp and mentioned in Line 297. We also highlight that TTM's "superior results over VisionTS" were achieved by fine-tuning on the GIFT-Eval training dataset, not as a zero-shot model. Its zero-shot capability is significantly weaker than VisionTS, as shown in the leaderboard and already referenced in Figure 4 of our paper.

  • We would also like to note that there may be data leakage issues for these concurrent works. For instance, both TimesFM 2.0 and Chronos-bolt used the M4 dataset, while TimesFM also utilized the Weather for pretraining. In contrast, visual MAE was trained on ImageNet, long before gift-eval, which can ensure no leakage.

W2: Approach seems to not to scale with bigger models (see results on different MAE size)

  • One explanation is that larger vision models tend to memorize image-related details, leading to overfitting and harmful to time series forecasting. Moirai (ICML 2024 oral) also exhibited similar behavior, where larger models perform worse on out-of-distribution data (See Table 6 in [1]), with even more severe degradation than VisionTS. This is understandable given the disparity between image and time series modalities. We believe future adaptations in time series domain could alleviate this issue.

  • Additionally, we found that larger models are not without merit. For example, MAE (large) performs well on ETTh1, and MAE (huge) shows good results on Electricity. Exploring the scenarios for different MAE sizes is a promising research direction.

Q1: How would you use VisionTS to model time series data that exhibits multiple periodicity?

  • We would assess potential periodicities based on sampling frequency and select the optimal P using the validation set. For time series without clear periodicity or complex multi-periodicity, we can try P=1, which can be effective in our experiments (Appendix C.5).

We hope these responses can fully address your concerns. Thank you once more for your detailed feedback, which greatly enhances the robustness of this paper.

[1] Unified Training of Universal Time Series Forecasting Transformers

审稿人评论

Thanks for providing this rebuttal and addressing my main concern. I updated my score accordingly.

I want to note that regarding the LTSF benchmark discussion, it is important to highlight issues when they are present - while I understand your concern that reviewers are still interested in LTSF benchmark, a change can only happen when top conference papers lead the way. So, while I understand that fully omitting it is problematic, its still in realm of the authors to highlight what they think is most important and meaningful.

作者评论

Thank you very much for your prompt response and your appreciation! We believe that the time series research community may need some time to adapt to this potential change. If this paper is fortunate enough to be accepted, we will do our best to improve this situation.

Best,

Authors

审稿意见
4

In this paper the authors propose to adapt an image masked auto encoder pretrained on ImageNet for time series forecasting. They justify their choice by the similarities between the image and the time series modalities. They empirically show that the proposed method achieves superior performance compared with other state of the art baselines.

给作者的问题

My main questions are around the additional practices introduced in the paper, e.g., alignment and reconstruction, reshaping into 2D with the explicit hints of the periodicity, etc. These practices are relatively agnostic to the backbone foundation model, and should be inspected separately - if considered part of the VisionTS, they undermine the zero-shot claim of the proposed method, as these practices are very lookback window specific, and it's not surprising to see that proper handling of the time scale and the periodicity can good deliver models (e.g., [1]).

I'd suggest the following ablation studies to help better understand the proposed method:

  1. An empirical study when other methods are fed with the time series of some good alignment and the results are rescaled.
  2. An empirical study when a unverisal or an improper P is fed to VisionTS.

[1] SparseTSF: Modeling Long-term Time Series Forecasting with 1k Parameters, https://arxiv.org/abs/2405.00946

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

N/A.

实验设计与分析

No issue.

补充材料

No.

与现有文献的关系

N/A.

遗漏的重要参考文献

N/A.

其他优缺点

Strength:

  • The paper presents thoughtful designs and evaluation of the transfer learning capability of visual models on time series forecasting. It's a relatively novel and intuitive approach.
  • The authors device and discuss a few key designs that are proper and exclusive for this visual approach. The empirical results are impressive.

Weakness:

  • The paper lacks ablation studies to help understand the contributions from each of the new mechanisms the authors introduce as compared to a time series native pretrained model. See questions.

其他意见或建议

Some empirical results are slightly out dated or misinterpreted:

  • Since GIFT-eval is cited, consider explaining why the current leaderboard leaders are not mentioned (e.g., no corresponding publications).
  • Table 3 is misleading, as the speed up from VisionTS is due to the alignment and the reconstruction step, and it only shows up for >1k forecast length, possibly for batches that use a same P within. Moirai and TimesFM are not the best decoder only reference either, e.g., TimesFM does not implement cached decoding properly.
作者回复

Thank you for your invaluable review. We are delighted that you believe that our paper is novel and the experiment results are impressive. Below are our responses to the concerns:

W1: ablation studies for each of the new mechanisms.

Q1: An empirical study when other methods are fed with the time series of some good alignment and the results are rescaled.

  • If "other methods" refers to zero-shot pretrained models, we kindly note that the alignment is applicable only to visual models, as it is meant to convert 1D time series into 2D formats. To the best of our knowledge, no existing TSF foundation models accept direct 2D input, making this alignment step inapplicable to existing models.
  • If "other methods" refers to models trained from scratch, we conducted an ablation study using the same alignment but substituting the vision backbone with various models. Table 20 (Appendix D.3) indicates these substitutions significantly hurt performance and fail to achieve zero-shot forecasting, underscoring the vision backbone's crucial role in VisionTS. The mentioned SparseTSF (or TimesNet using similar alignment) cannot achieve such zero-shot forecasting as well.
  • Beyond the alignment mechanism, we also introduce a new mechanism: using smaller standard deviations (rr) during normalization. Thanks for your suggestions and we would like to investigate its contribution. The following table summarizes Moirai's performance with different rr (average MSE across four ETT datasets), indicating that rr significantly higher or lower than 1 lead to notable degradation. The reason is that image and time series distributions differ; the former is limited by color range while the latter is not. Therefore, this mechanism is unnecessary for other TSF foundation models. |rr|0.4|0.6|0.8|1.0|1.2|1.5| |-|-|-|-|-|-|-| ||0.474|0.494|0.387|0.372|0.370|0.380|

Q2: An empirical study when a universal or an improper P is fed to VisionTS.

  • Thank you for your suggestion. We have compared the results using P=1 on the ETT datasets in Table 7 (Appendix B.2). To further validate this, we tested P=1, 7, and 24 on the Monash dataset: |Proper P|P=1|P=7|P=24|LLMTime| |-|-|-|-|-| |0.729|0.931|0.957|1.102|1.041|

Results indicate that selecting an appropriate P based on sampling frequency is crucial for zero-shot forecasting.

Q: if considered part of the VisionTS, they undermine the zero-shot claim of the proposed method, as these practices are very lookback window-specific.

  • We kindly note that existing zero-shot foundation models also have their own mechanisms to incorporate sampling frequency based on the specific lookback window. For example, Moirai selects an appropriate patch size based on the sampling frequency (see Appendix B.1 of [1]). TimesFM [2] even includes the sampling frequency as model input. We believe that leveraging prior data characteristics (e.g., sampling frequency and periodicity) to further enhance the performance of zero-shot models is also a promising research direction.

Comment 1: consider explaining why the current GIFT-Eval leaders are not mentioned (e.g., no corresponding publications).

  • Thank you for your suggestion. As you noted, current leaders are concurrent works, some without publications (mentioned in Line 297). However, we plan to include them in the final paper version, as discussed with Reviewer cUEq. We would also like to note that there may be data leakage issues for these concurrent works. For instance, both TimesFM2 and Chronos-bolt used the M4 dataset, while TimesFM also utilized the Weather for pretraining. In contrast, visual MAE was trained on ImageNet and ensures no leakage.

Comment 2: Table 3 is misleading, as the speed up from VisionTS is due to the alignment and the reconstruction step, and it only shows up for >1k forecast length, possibly for batches that use a same P. Moirai and TimesFM are not the best decoder only reference either, e.g., TimesFM does not implement cached decoding properly.

  • Thank you for your suggestion. For efficiency testing, our experimental settings are consistent with Moirai (refer to [1] Table 23), using longer forecast lengths. We choose Moirai and TimesFM for comparison since they are our baselines, and we will note in our final revision that VisionTS's evaluation uses a same P. To further address your concerns, we test shorter lengths, as shown in the table below.
Context Len100100100100200300400
Pred Len100200300400100100100
Moirai (base)0.030.030.040.040.040.040.04
TimesFM0.020.030.040.060.020.020.02
VisionTS0.040.030.030.030.040.050.05

The table shows that VisionTS's runtime is similar to these two models.

We hope these responses can fully address your concerns. Thank you again for your detailed feedback!

[1] Unified Training of Universal Time Series Forecasting Transformers

[2] A decoder-only foundation model for time-series forecasting

审稿意见
3

In this paper, the authors explore a novel direction in applying foundation models to time series forecasting. Given the intrinsic similarities between natural images and time series, such as modality, origin, information density, and features, the authors introduce VisionTS, a TS forecasting model built upon the pretrained CV foundation model MAE. By leveraging segmentation, rendering, and alignment techniques, 1D time series data is transformed into 2D matrices, enabling the reconstruction and prediction of masked horizon sequences.

给作者的问题

Please refer to the weakness.

论据与证据

Yes. Extensive zero-shot and full-shot experiments have been conducted on the long-term TSF benchmark, the GIFT-Eval Leaderboard, and the Monash TSF Benchmark. Additionally, efficiency evaluations have been included. The results highlight the superior performance of the approach in cross-modality forecasting research.

方法与评估标准

Yes. This method maintains the same experimental settings as the previous methods in time series forecasting.

理论论述

This paper does not contain any theoretical discussions and claims.

实验设计与分析

The experimental designs are reasonable and comprehensive.

补充材料

This paper has uploaded the code as supplementary material.

与现有文献的关系

This paper focuses on foundation model for time series forecasting, which is basic research and could be used in widely scientific researches, such as energy, sale, and finance. Since it brings a new technique, I don’t find the specific relation between the proposed method and scientific literature in other research areas.

遗漏的重要参考文献

I think this paper have included all essential references in this research area.

其他优缺点

Strength:

  1. The authors investigate foundation models for time series forecasting from a novel view, and provide well-founded motivations forleveragingpretrainedvisualmodelasnumericseriesforecaster. 2.The authors conduct extensive experimental evaluations under both zero-shot and full-shot settings, andachieve promising performance.
  2. This paper is well-written, providing sufficient analysis and key insights of the methodology and experiments.

Weakness:

  1. In time series forecasting tasks, it is necessary and important to monitor the temporal order of time points or patches, such as PatchTST.How does VisionTS obtain the complete temporal information of visible patches during the alignment process?
  2. In the zero-shot experiments in Tables 1 and 9, VisionTS underperforms compared to Moirai on half of the datasets (ETTm2, Electricity, and Weather). The authors should provide a detailed discussion and analysis of the underlying reasons.
  3. Zero-shotcomparisononthelong-term TSF benchmark shown in Table 1 is suggested to include TTM [1] and Time-MoE[2] as baselines as well.

[1] Ekambaram, Vijay, et al.Tiny time mixers (ttms): Fast pre-trained models for enhanced zero/few-shot forecasting of multivariate time series.NeurIPS, 2024. [2] Shi, Xiaoming, et al. Time-moe: Billion-scale time series foundation models with mixture of experts. ICLR, 2025.

其他意见或建议

Please refer to the weakness.

作者回复

Thank you for your invaluable response. We are delighted that you found our paper novel, well-motivated, and with sufficient experiments and insights. Below are our responses to the concerns:

W1: How does VisionTS obtain the complete temporal information of visible patches during the alignment process?

  • During the alignment process, the temporal information is encoded one-to-one to the spatial information after transformation. Specifically, the patch at the ii-th row and jj-th column corresponds to the time step j×N+ij\times N+i (with the input window scaled to N(Nn)N(N-n) time steps in total). During the ViT processing, each patch receives a unique 2D positional encoding, enabling the model to capture spatial information and, consequently, the corresponding temporal information.

W2: In the zero-shot experiments in Tables 1 and 9, VisionTS underperforms compared to Moirai on half of the datasets (ETTm2, Electricity, and Weather). The authors should provide a detailed discussion and analysis of the underlying reasons.

  • A possible explanation is that Moirai's pre-training data significantly overlaps with the domains similar to Electricity and Weather (e.g., Moirai pretraining containing 60% energy and 15% climate data). This possibly leads to data leakage. In contrast, VisionTS, using an MAE model trained solely on ImageNet, is free from such potential leakage.
  • Additionally, this experiment is not to prove VisionTS is superior to Moirai; instead, we aim to show a purely visual model can achieve performance comparable to a native time series pretrained model. Maybe we can rephrase this statement as: In the zero-shot experiments, Moirai (trained on time series) underperforms compared to VisionTS (trained on images) on half of the time series datasets. This underscores the promising potential of vision models in TSF.

W3: Table 1 is suggested to include TTM [1] and Time-MoE [2] as baselines as well.

Thank you for your suggestion. We report the base and large model results for Time-MoE since the ultra model weights are unreleased. For TTM, we used the official HuggingFace model for replication. The following table summarizes the performance of various zero-shot foundation models.

VisionTS (base)Time-MoE (base)Time-MoE (large)TTM (v1)Moirai (base)
ETTh1, MSE0.3900.4000.3940.3980.434
ETTh1, MAE0.4140.4240.4190.4210.439
ETTh2, MSE0.3330.3660.4050.3480.346
ETTh2, MAE0.3750.4040.4150.3930.382
ETTm1, MSE0.3740.3940.3760.5200.382
ETTm1, MAE0.3720.4150.4050.4790.388
ETTm2, MSE0.2820.3170.3160.3120.272
ETTm2, MAE0.3210.3650.3610.3480.321
Electricity, MSE0.207(data leakage)(data leakage)0.2010.188
Electricity, MAE0.294(data leakage)(data leakage)0.2930.274
Weather, MSE0.2690.2650.2700.2340.238
Weather, MAE0.2920.2970.3000.2660.261
avg, MSE0.309--0.3350.310
avg, MAE0.345--0.3670.344
1st count70015

We hope these responses can fully address your concerns. Thank you once more for your detailed feedback!

最终决定

This paper presents VisionTS, which explores a very interesting idea: leveraging a vision masked autoencoder pre-trained on ImageNet for time series forecasting. By converting 1D time series data into 2D images, VisionTS reformulates forecasting as an image reconstruction problem. The experiments show that VisionTS delivers strong zero-shot performance and achieves state-of-the-art results with minimal fine-tuning on standard benchmarks.

Overall, all reviewers appreciate the paper’s novelty and creativity, as well as its extensive and rigorous evaluation across multiple benchmarks under both zero-shot and fine-tuning settings. Meanwhile, some concerns are raised, including: 1) more ablations are needed for understanding the efficacy of newly introduced components; 2) some competitive baselines (like TTM) are missing; 3) some important benchmark metrics are missing; 4) its scalability with large models is questionable; and 5) it is unclear how this method can generalize to time series data that exhibits multiple periodicity.

After considering the authors’ rebuttal, which effectively addressed most of these concerns, two reviewers increased their scores. Consequently, all reviewers now unanimously favor accepting the submission. The AC concurs with this decision.