PaperHub
6.7
/10
Poster3 位审稿人
最低6最高8标准差0.9
6
6
8
4.3
置信度
ICLR 2024

Pathformer: Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting

OpenReviewPDF
提交: 2023-09-20更新: 2024-04-21
TL;DR

In this paper, we propose Multi-Scale Transformers with Adaptive Pathways (Pathformer) for time series forecasting.

摘要

关键词
Time seriesTransformerMulti-scale

评审与讨论

审稿意见
6

This paper proposes multi-scale transformers with adaptive pathways. The proposed method integrates both temporal resolution and temporal distance for multi-scale modeling. It further enriches the multi-scale transformer with adaptive pathways. Experimental results showed the efficacy of proposed method and state-of-the-art performance.

优点

  1. It's novel to propose multi-scale transformers with adaptive pathways.

  2. It's novel to integrate both temporal resolution and temporal distance for multi-scale modeling.

  3. The experiments showed state-of-the-art performance.

缺点

1.The current time series forecasting datasets are pretty small, and performance may be satuated or over-fitting. Could the method be used for larger datasets?

  1. Scalformer [1] also uses the multi-scale nature of time series data, this paper didn't mention and compare the similarities and differences with Scalformer.

[1] Shabani, Amin, et al. "Scaleformer: iterative multi-scale refining transformers for time series forecasting." ICLR (2023).

问题

  1. The current time series forecasting datasets are pretty small, and performance may be satuated or over-fitting. Could the method be used for larger datasets?

  2. Scalformer [1] also uses hierarchical design and the scales of time series data, Could this paper compare the similarities and differences with Scalformer.

[1] Shabani, Amin, et al. "Scaleformer: iterative multi-scale refining transformers for time series forecasting." ICLR (2023).

评论

We would like to sincerely thank Reviewer LNPN for acknowledging our technical novelty and empirical contributions, as well as the comments regarding larger datasets and the related multi-scale baseline method. We have revised our paper accordingly.

Q1: Results on larger datasets.

A1: We seek larger datasets from two perspectives: data volume and the number of variables. We add two datasets, the Wind Power dataset and the PEMS07 dataset, to evaluate the performance of Pathformer on larger datasets. The Wind Power dataset comprises 7397147 timestamps, reaching a sample size in the millions, and the PEMS07 dataset includes 883 variables. If the reviewer may mention other larger datasets, we are also willing to test on them. Pathformer also demonstrates superior predictive performance on these larger datasets compared with some state-of-the-art methods such as PatchTST, DLinear, and Scaleformer. We add experiments on these larger datasets in Section A.6 of the revised supplementary\underline{\text{Section A.6 of the revised supplementary}}.

MethodsPathformerPatchTSTDLinearScaleformer
MetricsMSE / MAEMSE / MAEMSE / MAEMSE / MAE
PEMS07960.135 / 0.2430.146 / 0.2590.564 / 0.5360.152 / 0.268
1920.177 / 0.2710.185 / 0.2860.596 / 0.5550.195 / 0.302
3360.188 / 0.2780.205 / 0.2890.475 / 0.4820.276 / 0.394
7200.208 / 0.2960.235 / 0.3250.543 / 0.5230.305 / 0.410
Wind Power960.062 / 0.1460.070 / 0.1580.078 / 0.1840.089 / 0.167
1920.123 / 0.2140.131 / 0.2370.133 / 0.2520.163 / 0.246
3360.200 / 0.2830.215 / 0.3070.205 / 0.3250.225 / 0.352
7200.388 / 0.4140.404 / 0.4290.407 / 0.4570.414 / 0.426

Q2: Compare with Scaleformer.

A2: We mentioned the method of Scaleformer in the related work of the original submission\underline{\text{the related work of the original submission}}. In the revision, we add Scaleformer results in Tabel 1 of the revised paper\underline{\text{Tabel 1 of the revised paper}} as an important baseline of multi-scale models. We also provide a more detailed comparison between Scaleformer and our model in the related work of the revised paper and Section A.5.3 of the revised supplementary\underline{\text{the related work of the revised paper and Section A.5.3 of the revised supplementary}}, as follows:

Scaleformer also utilizes the modeling of multi-scale features for time series forecasting. It differs from our proposed Pathformer in the following aspects:

  • Scaleformer employs fixed sampling rates, while Pathformer has the capability for adaptive multi-scale modeling based on the differences in time series samples.
  • Scaleformer obtains multi-scale features with different temporal resolutions through downsampling. In contrast, Pathformer not only considers time series features of different resolutions but also models from the perspective of temporal distance, taking into account global correlations and local details. This provides a more comprehensive approach to multi-scale modeling through both temporal resolutions and temporal distances.
  • Scaleformer requires the allocation of a predictive model at different temporal resolutions, resulting in higher model complexity than Pathformer.
评论

Dear Reviewer LNPN,

We would like to sincerely thank you for your time and efforts in reviewing our paper.

We have made an extensive effort to try to successfully address your concerns, by conducting experiments on larger datasets with more timestamps and variables, comparing our proposed Pathformer with Scaleformer, and making revisions to the paper and appendix accordingly.

We hope our response can effectively address your concerns, If you have any further concerns or questions, please do not hesitate to let us know, and we will respond timely.

All the best,

Authors

评论

Dear Reviewer LNPN,

Since the End of author/reviewer discussions is just in one day, may we know if our response addresses your main concerns? If so, we kindly ask for your reconsideration of the score. Should you have any further advice on the paper and/or our rebuttal, please let us know and we will be more than happy to engage in more discussion and paper improvements.

Thank you so much for devoting time to improving our work!

审稿意见
6

This paper presents a new multi-scale Transformer architecture for long-range time series modeling.The Discriminant Fourier Transform (DFT) is utilized to determine the patch sizes so as to divide the input time series into patches of different sizes, thus enabling cross-scale information fusion. In the multiscale transformer block, intra-patch attention and inter-patch attention mechanisms are utilized to perform attentional operations, thus enhancing the processing of temporal information.Experiments show the proposed method achieves state-of-the-art performance among existing models and exhibits superior generalization capabilities across different transfer scenarios.

优点

1.The paper is well written and well-motivated. 2.The Multi-Scale Router combines the advantages of patch division and seasonality decomposition. 3.Dual attention helps to harmonize operations between intra-patch and inter-patch components, allowing the transformer to efficiently process time series data.

缺点

Time-series Dense Encoder (TiDE) is also a popular long-term time-series forecasting benchmark.But the paper does not include experiments comparing the proposed method to TiDE.

问题

Why is there a gap between the benchmark data in the paper and the original paper.

评论

We would like to sincerely thank Reviewer Sqch for acknowledging our technical contributions, providing the recent advanced method to compare with, and the suggestion for a more complete benchmark. We have revised our paper accordingly.

Q1: Compare with Time-series Dense Encoder (TiDE).

A1: The results of TiDE presented in the original paper are the best ones selected from different input sequence lengths of (48,96,192,336,720). Considering the urgency and fairness, we conduct a comparison under the condition of a fixed input sequence length of 96. We show results of some datasets here, and the complete results of other datasets and methods are available in Table 1 of the revised paper\underline{\text{Table 1 of the revised paper}}.

MethodsPathformerTIDE
MetricsMSE / MAEMSE / MAE
ETTm2960.170 / 0.2480.182 / 0.264
1920.238 / 0.2950.256 / 0.323
3360.293 / 0.3310.313 / 0.354
7200.390 / 0.3890.419 / 0.410
Electricity960.145 / 0.2360.194/ 0.277
1920.167 / 0.2580.193 / 0.280
3360.186 / 0.2750.206 / 0.296
7200.231 / 0.3090.242 / 0.328
Weather960.156 / 0.1920.202 / 0.261
1920.206 / 0.2400.242 / 0.298
3360.254 / 0.2820.287 / 0.335
7200.340 / 0.3360.351 / 0.386

Q2: More datasets in the benchmark.

A2: We add the ILI and Traffic datasets in the benchmark commonly used by previous papers such as PatchTST. The complete results of other compared methods can be found in Table 1 of the revised paper\underline{\text{Table 1 of the revised paper}}.

MethodsPathformerPatchTST
MetricsMSE / MAEMSE / MAE
ILI241.587 / 0.7581.724 / 0.843
361.429 / 0.7111.536 / 0.752
481.505 / 0.7421.821 / 0.832
601.731 / 0.7991.923 / 0.842
Traffic960.479 / 0.2830.492 / 0.324
1920.484 / 0.2920.487 / 0.303
3360.503 / 0.2990.505 / 0.317
7200.537 / 0.3220.542 / 0.337

We also evaluate the Exchange_rate dataset, which is proposed in the benchmark of the Autoformer paper. Our proposed Pathformer also outperforms Autoformer on this dataset.

MethodsPathformerAutoformer
MetricsMSE / MAEMSE / MAE
Exchange960.084 / 0.2030.197 / 0.323
1920.178 / 0.3000.300 / 0.369
3360.346 / 0.4250.509 / 0.524
7200.889 / 0.7041.447 / 0.941
评论

Dear Reviewer Sqch,

We would like to sincerely thank you for your time and efforts in reviewing our paper.

We have exerted significant effort to effectively address your concerns, by including experiments comparing the proposed Pathformer to TiDE, adding ILI, Traffic and Exchange datasets in the benchmark, and making revisions to the paper and appendix accordingly.

We hope our response can address your concerns. If you have any further concerns or questions, please do not hesitate to let us know, and we will respond timely.

All the best,

Authors

审稿意见
8

The paper proposes a variation of PatchTST architecture in the context long-horizon time series forecating.

优点

  • An interesting study that makes an incremental step towards making transformers effective at long-horizon forecasting task
  • Paper is clearly written and topic is important for the ICLR audience

缺点

  • "Recent advances for time series forecasting are mainly based on Transformer architectures". I would say this statement is not aligned with the most recent empirical results and contradicts existing facts. Having read the following papers one could argue that the transformer based models in time series forecasting have been basically a disaster in the recent years, mainly because authors of papers based on transformer-driven models disregarded including some basic baselines in their studies. Please rewrite intro and related work accordingly.
    • Challu et al. N-HiTS: Neural hierarchical interpolation for time series forecasting. AAAI'23
    • Zeng et al. Are transformers effective for time series forecasting? AAAI'23
    • Li et al. Do Simpler Statistical Methods Perform Better in Multivariate Long Sequence Time-Series Forecasting? CIKM'23
  • Not all datasets are present in the study. Please include additional results on ILI and Traffic from PatchTST
  • Please include results from Zeng et al. Are transformers effective for time series forecasting? AAAI'23 in your table and you will see that your results are not state of the art. This makes the results unconvincing, because basically a very complex model is not able to use the same inputs as very simple models in an effective way. Transformer based papers have consistently failed to include appropriate baselines in the studies creating a large gap in methodology and undermining the ultimate reliability of these studies. The work can be interesting if authors show that with the proposed modifications a transformer based model can be more effective than much simpler and faster models presented in Zeng et al. and Li et al.
  • The model seems to borrow conceptually very heavily from the PatchTST model without explicitly recognizing the source of inspiration. Without a detailed explanation of the actual difference between the two architectures the proposed architecture appears to be a minor perturbation of the original PatchTST.

问题

  • When talking about multi-scale processing in related work please discuss relation to Challu et al. N-HiTS: Neural hierarchical interpolation for time series forecasting AAAI'23, which seems to be relevant work on multi-scale modelling
评论

We also conduct experiments to compare the generalization capabilities of Pathformer and DLinear, where we train the model on one dataset and test the performance on other datasets. The specific experimental setting is the same with Section 4.2 of the paper.

The results of the generalization on other datasets:

MethodsPathformerDLinear
MetricsMSE / MAEMSE / MAE
ETTh2960.340 / 0.3690.37 / 0.398
1920.411 / 0.4060.502 / 0.498
3360.384 / 0.4010.563 / 0.531
7200.450 / 0.4480.723 / 0.605
ETTm2960.220 / 0.2940.272 / 0.325
1920.258 / 0.3060.352 / 0.398
3360.325 / 0.3500.425 / 0.478
7200.422 / 0.4080.553 / 0.517
Cluster-A240.121 / 0.2230.342 / 0.418
480.186 / 0.2810.389 / 0.468
960.249 / 0.3340.392 / 0.473
1920.372 / 0.4160.523 / 0.616
Cluster-B240.140 / 0.2430.201 / 0.342
480.202 / 0.2980.256 / 0.387
960.296 / 0.3570.389 / 0.476
1920.464 / 0.4680.628 / 0.635
Cluster-C240.069 / 0.1730.145 / 0.242
480.144 / 0.2540.267 / 0.387
960.174 / 0.2840.411 / 0.512
1920.327 / 0.3860.522 / 0.532

In the generalization experiments, Pathformer still outperforms DLinear. This indicates that a relatively complex Transformer architecture may have better generalization capabilities than a simple linear model.

Q4: Compare with PatchTST

A4: We mention the source of inspiration of patching in Section 3 of our revised paper\underline{\text{Section 3 of our revised paper}}. We also want to clarify that Pathformer extends from patch division to realize adaptive multi-scale modeling, which is a novel design. It is not a perturbation of PatchTST, as how the series are divided, and how the correlations are modeled are both designed differently. The main differences with PatchTST are as follows:

  • Adaptive Multi-scale Modeling: PatchTST employs a fixed patch size for all data, hindering the grasp of critical patterns in different time series. We are the first to propose adaptive pathways that dynamically select varying patch sizes tailored to the dynamic features of individual samples, enabling adaptive multi-scale modeling.

  • Partitioning with Multiple Patch Sizes: PatchTST employs a single patch size to partition time series, obtaining features with a singular resolution. Pathformer utilizes multiple different patch sizes for partitioning, which captures multi-scale features from various temporal resolutions.

  • Global correlations between patches and local details in each patch: PatchTST performs attention between divided patches, overlooking the internal details in each patch. Pathformer not only considers the correlations between patches but also the detailed information within each patch. It introduces dual attention (inter-patch attention and intra-patch attention) to integrate global correlations and local details, capturing multi-scale features from various temporal distances.

We also add the above discussion in Section A.5.1 of the revised supplementary\underline{\text{Section A.5.1 of the revised supplementary}}.

Q5: Compare with NHITS

A5: NHITS also models multi-scale features for time series forecasting, and Pathformer differs from it in the following aspects:

  • NHITS models time series features of different resolutions through multi-rate data sampling and hierarchical interpolation. Pathformer not only takes into account time series features of different resolutions but also approaches multi-scale modeling from the perspective of temporal distance. Simultaneously considering temporal resolutions and temporal distances enables a more comprehensive approach to multi-scale modeling.
  • NHITS employs fixed sampling rates for multi-rate data sampling, lacking the ability to adaptively perform multi-scale modeling based on differences in time series samples. In contrast, Pathformer has the capability for adaptive multi-scale modeling.
  • NHITS adopts a linear structure to build its model framework, whereas Pathformer enables multi-scale modeling in a Transformer architecture.

We add the above discussion in Section A.5.2 of the revised supplementary\underline{\text{Section A.5.2 of the revised supplementary}}. We also compare our performance with NHITS in Table 9 of the revised supplementary\underline{\text{Table 9 of the revised supplementary}}

评论

We would like to sincerely thank Reviewer 5cD6 for providing a detailed review and insightful comments regarding important basic baselines, more datasets, and more detailed comparisons with existing methods. We have revised our paper accordingly.

Q1: More discussion on some basic baselines.

A1: Thanks a lot for raising this valuable comment. We agree with the reviewer that we should also include some basic baselines as proposed in the comment, which may put our paper in a more correct context, give the readers a clearer view of the potentials and challenges of current Transformer methods, and make our contributions more convincing. We have made the following revisions according to your comment:

  • We add these important basic baselines in Introduction or Related Work in the revised paper\underline{\text{Introduction or Related Work in the revised paper}}, and rewrite these two parts accordingly.

  • We compare the performance of our method with these baselines, such as DLinear, NLinear and NHITS, to make our empirical improvements more convincing. The detailed experimental results are in the following parts.

Q2: Include additional results on ILI and Traffic from PatchTST

A2: We add the ILI and Traffic datasets in Table 1 in the revised paper\underline{\text{Table 1 in the revised paper}}. Here, we show some models with superior performance: PatchTST, DLinear, FEDformer, and Autoformer (bold indicates the best). Results of other compared methods are also included in the revised paper.

MethodsPathFormerPatchTSTDLinearFedformerAutoformer
MetricsMSE / MAEMSE / MAEMSE / MAEMSE / MAEMSE / MAE
ILI961.587 / 0.7581.724 / 0.8432.573 / 1.0732.624 / 1.0952.906 / 1.182
1921.429 / 0.7111.536 / 0.7522.673 / 1.0852.516 / 1.0212.585 / 1.038
3361.505 / 0.7421.821 / 0.8322.773 / 1.1262.505 / 1.0413.024 / 1.145
7201.731 / 0.7991.923 / 0.8422.827 / 1.1522.742 / 1.1222.761 / 1.114
Traffic960.479 / 0.2830.492 / 0.3240.648 / 0.3960.576 / 0.3590.597 / 0.371
1920.484 / 0.2920.487 / 0.3030.613 / 0.6140.610 / 0.3800.607 / 0.382
3360.503 / 0.2990.505 / 0.3170.614 / 0.3830.608 / 0.3750.623 / 0.387
7200.537 / 0.3220.542 / 0.3370.655 / 0.4050.621 / 0.3750.639 / 0.395
评论

Q3: Include and compare with the results from Zeng et al.

A3: Some results presented in the paper of Zeng et al. (DLinear) are better than ours in Table 1, which is because that DLinear uses a larger input length (H=336) than ours (H=96). To ensure a fair comparison, we conduct separate experiments for input sequence lengths (H) of 96 and 336. Here, we show results of some models and datasets, with the complete results of other models and on other datasets available in the Table 1 in the revised paper and Table 9 in supplementary\underline{\text{Table 1 in the revised paper and Table 9 in supplementary}}.

The results for the input sequence length H=96:

MethodsPathformerDLinearNLinear
MetricsMSE /MAEMSE / MAEMSE / MAE
ETTm1960.316 / 0.3460.342 / 0.3700.339 / 0.369
1920.366 / 0.3700.383 / 0.3940.379 / 0.386
3360.386 / 0.3940.413 / 0.4140.411 / 0.407
7200.460 / 0.4320.472 / 0.4520.478 / 0.442
Weather960.156 / 0.1920.195 / 0.2530.168 / 0.208
1920.206 / 0.2400.239 / 0.2990.217 / 0.255
3360.254 / 0.2820.282 / 0.3330.267 / 0.292
7200.340 / 0.3360.352 / 0.3900.351 / 0.346
ILI241.587 / 0.7582.573 / 1.0732.725 / 1.069
361.429 / 0.7112.673 / 1.0852.530 / 1.032
481.505 / 0.7422.773 / 1.1262.510 / 1.031
601.731 / 0.7992.827 / 1.1522.492 / 1.026

The results for the input sequence length H=336:

MethodsPathformerDLinearNLinear
MetricsMSE / MAEMSE / MAEMSE / MAE
ETTm1960.285 / 0.3360.299 / 0.3530.306 / 0.348
1920.331 / 0.3610.335 / 0.3650.349 / 0.375
3360.362 / 0.3820.369 / 0.3860.375 / 0.388
7200.412 / 0.4140.425 / 0.4210.433 / 0.422
Weather960.144 / 0.1840.176 / 0.2370.182 / 0.232
1920.191 / 0.2290.220 / 0.2820.225 / 0.269
3360.234 / 0.2680.265 / 0.3190.271 / 0.301
7200.316 / 0.3230.323 / 0.3620.338 / 0.348
ILI241.411 / 0.7052.215 / 1.0811.683 / 0.868
361.365 / 0.7271.963 / 0.9631.703 / 0.859
481.537 / 0.7642.130 / 1.0241.719 / 0.884
601.418 / 0.7722.368 / 1.0961.819 / 0.917

The results above reveal that our proposed model Pathformer outperforms DLinear with both input sequence lengths of 96 and 336. Zeng et al. point out that previous Transformer cannot extract temporal relations well from longer input sequences, but our proposed Pathformer performs better with a longer input length, indicating that considering adaptive multi-scale modeling can be an effective way to enhance such a relation extraction ability of Transformers.

评论

Dear Reviewer 5cD6,

We would like to express our sincere gratitude for your time and efforts in reviewing our paper.

We have made an extensive effort to try to successfully address your concerns. In our response:

  • We provide more discussions with basic models, such as DLinear, NLinear, and rewrite the introduction and related work according to your suggestions.

  • We add ILI and Traffic datasets from PatchTST.

  • We compare our performance with these basic baselines on diverse datasets to make the results of the proposed pathformer more convincing.

  • We also compare Pathformer with PatchTST and NHITS to show the novelty and effectiveness of Pathformer and make revisions to the paper and appendix accordingly.

We hope our response can address your concerns. If you have any further concerns or questions, please do not hesitate to inform us, and we will be more than happy to address them promptly.

评论

Dear Reviewer 5cD6,

Since the End of author/reviewer discussions is just in one day, may we know if our response addresses your main concerns? If so, we kindly ask for your reconsideration of the score.

Should you have any further advice on the paper and/or our rebuttal, please let us know and we will be more than happy to engage in more discussion and paper improvements. We would really appreciate it if our next round of communication could leave time for us to resolve any of your remaining or new questions.

Thank you so much for devoting time to improving our paper!

AC 元评审

The paper introduces a novel approach called Pathformer, aiming to enhance time series forecasting using Transformer-based models. Traditional methods often struggle to model time series comprehensively across various scales. In response, the proposed Pathformer leverages multi-scale transformers with adaptive pathways to address this limitation. This paper represents a noteworthy advancement, introducing novelty by proposing multi-scale transformers with adaptive pathways and integrating both temporal resolution and temporal distance for comprehensive multi-scale modeling, ultimately achieving state-of-the-art results in long-horizon forecasting tasks. The paper's weaknesses include the need for a revised introduction reflecting recent empirical results, insufficient dataset inclusion and baseline comparisons, and unclear demonstration of the proposed modifications' superiority over simpler models. Conceptual similarities with PatchTST should be addressed, and scalability to larger datasets must be explored. Additionally, similarities with Scalformer and comparisons with TiDE are lacking. After the rebuttal, most of these problems have been addressed. Therefore, I recommend accepting this paper.

为何不给更高分

This paper investigates the advantages of multiscale features for LSTF tasks. The concept of multiscale in LSTF has been explored in some prior works, so the novelty of this work is not particularly substantial.

为何不给更低分

Although the application of multiscale features in the LSTF domain is not novel in this work, the proposed method significantly differs from previous approaches. With well-organized experiments and clear results, consensus among all reviewers has been reached, hence, recommending acceptance.

最终决定

Accept (poster)