PaperHub
5.8
/10
Rejected4 位审稿人
最低3最高8标准差1.8
6
3
8
6
4.0
置信度
ICLR 2024

The Power of Minimalism in Long Sequence Time-series Forecasting

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11

摘要

Recently, transformer-based models have been widely applied to time series forecasting tasks due to their remarkable capability to capture complex interactions within sequential data. However, as the sequence length expands, Transformer-based models suffer from increased memory consumption, overfitting, and performance deterioration in capturing long-range dependencies. Recently, several studies have shown that MLP-based models can outperform advanced Transformer-based models for long-term time series forecasting (LTSF) tasks. Unfortunately, linear mappings often struggle to capture intricate dependencies when handling multivariate time series. Although modeling each channel independently can alleviate this issue, it will significantly increase the computational cost. To this end, we introduce a set of simple yet effective depthwise convolution models named LTSF-Conv to perform LTSF. Specifically, we apply unique filters to each channel to achieve channel independence, which plays a pivotal role in enhancing overall forecasting performance. Experimental results show that LTSF-Conv models outperform the state-of-the-art Transformer-based and MLP-based models across seven real-world LTSF benchmarks. Surprisingly, a two-layer non-stacked network can outperform the state-of-the-art Transformer model in 91% of cases with a significant reduction of computing resources. In particular, LTSF-Conv models substantially decrease the average number of trainable parameters (by $\sim$ 12$\times$), maximum memory consumption (by $\sim$ 86$\times$), running time (by $\sim$ 18$\times$), and inference time (by $\sim$ 2$\times$) on the Electricity benchmark.
关键词
Long-term time series forecastingTransformersEfficiency

评审与讨论

审稿意见
6

The paper studies the problem of time series forecasting. The paper investigates the performance of a simple model, namely a one-layer convolutional network applied to every feature independently and combined with a linear layer. The paper shows that such a simple approach could improve upon the existing baselines in most of the cases with significantly reduced computational costs.

优点

  1. The paper shows that a simple network structure could be very effective in the widely used time-series forecasting datasets. The paper thus provides a valuable baseline that future methods should all consider when dealing with such kinds of tasks.

  2. The paper conducts extensive experiments and studies to make their results convincing.

缺点

  1. As also mentioned in the paper, time series with multiple periodic intervals could not be captured by a single convolutional layer. I think it might make the paper stronger by generating synthetic data with various periodic behaviors and testing various models on it.

  2. Every univariate is now processed independently in the current convolutional network. Is there a specific reason for doing so except for the efficiency concerns? How would the performance change if we also include the feature dimension in the convolutional filter?

  3. Why just one layer of convolution? How does the performance change if multiple layers are applied? This may help to capture longer periodic patterns or even help with the multi-period issue.

  4. As also mentioned in the paper, the effectiveness of relatively simple models such as DLinear and the convolutional network may largely depend on the nature of the current tasks. For much more complex time series with more features and periodic complexities, such simpler methods may not be as good as the transformer-based models.

  5. How do we determine the kernel size of the convolution, which should be critical for the forecasting task, especially for the cases where we don't know the periodic interval of the data streams?

问题

Please check the weaknesses part.

Update after the rebuttal: Thanks for the detailed response. It addresses some of my concerns but some remain. E.g., there is no empirical evidence to support the claim of estimating the period. And the solution for more complex tasks is reasonable but not convincing enough. Overall, I think the work could serve as a solid baseline for the time series forecasting tasks, so I keep my score for weak acceptance.

评论

We sincerely appreciate your detailed comments and positive ranking. We provide point-wise responses to your concerns below.

Q1 : Every univariate is now processed independently in the current convolutional network. Is there a specific reason for doing so except for the efficiency concerns? How would the performance change if we also include the feature dimension in the convolutional filter?

  • Many thanks for your careful reading and suggestion. In LTSF tasks, maintaining channel independence has been observed to improve prediction performance as compared to channel mixing, as reported in prior research [2]. Our model utilizes group-wise convolution to cleverly achieve channel independence, while concurrently reducing model complexity. However, standard CNN uses the idea of channel mixing, which suffers from noise disturbance among the channels and reduces performance. We have conducted additional ablation experiments about depth-wise CNN and general CNN. The results are in Section C.2, Page 13 of the original manuscript. Channel independence helps reduce information confusion. When each channel focuses on capturing specific time series long-term patterns, the model is more capable of distinguishing and understanding these patterns without the noise that channel mixing might introduce. If we also include the feature dimension in the convolutional filter, the performance will degrade.
  • Moreover, we also apply channel independence and channel dependence to the MLPs-based models respectively. The results can be found in Table 5 of the attached PDF file. Overall, the accuracy of the CI-based models was higher than the CD-based models.

Q2 : Why just one layer of convolution? How does the performance change if multiple layers are applied? This may help to capture longer periodic patterns or even help with the multi-period issue.

  • We are grateful for your constructive comments. We want to emphasize that this manuscript does not only aim to showcase the contributions of SOTA results. Simple yet valuable basic units can serve as foundational building blocks, laying the groundwork for scalable and complex networks. We firmly believe that LTSF-Conv models serve as straightforward yet competitive basic units, exhibiting great potential for further expansion of complex structures. As the reviewers pointed out, based on our extended experiments, when applying complex network structures, the performance will be further improved. In the future, there is the potential for more valuable research to be explored in this new track.

Q3: Also mentioned in the paper, the effectiveness of relatively simple models such as DLinear and the convolutional network may largely depend on the nature of the current tasks. For much more complex time series with more features and periodic complexities, such simpler methods may not be as good as the transformer-based models.

  • Thank you for pointing out this. When dealing with more complex datasets, increasing the depth, or introducing additional hierarchical structures to the network helps the model adapt better to the diversity of the data. For LTSF-Conv basic unit, exhibiting great potential for further expansion of complex structures. Whether based on MLPs, simpler convolution, or transformers, each model has contributed to the field of long sequence prediction. The development of these models has advanced the field of LTSF, providing diverse modeling choices for various tasks.

Q4: How do we determine the kernel size of the convolution, which should be critical for the forecasting task, especially for the cases where we don't know the periodic interval of the data streams?

  • Thank you for pointing out this. In essence, finding the most suitable convolutional kernel size is a dynamic process, combining experimentation and consideration of specific data intricacies for optimal model performance. It's important to iterate and evaluate the model's performance to find the optimal kernel size for your specific forecasting task. In the absence of known periodic intervals, we usually employ a multi-scale analysis strategy, which enables the model to capture features at different time scales without relying on prior knowledge of periodicity. We can also use cross-validation to assess the performance of different kernel sizes.
评论

Q5: As also mentioned in the paper, time series with multiple periodic intervals could not be captured by a single convolutional layer. I think it might make the paper stronger by generating synthetic data with various periodic behaviors and testing various models on it. The paper conducts extensive experiments and studies to make their results convincing.

  • Many thanks for your careful reading and suggestion. We are sorry for the late submission due to the addition of two new datasets and some new baselines. We supplemented additional experiments to further verify the validity of LTSF-Conv models in LTSF tasks.

  • Solar power prediction is a pivotal concern within the realm of renewable energy, wielding significant influence across diverse domains. We add two supplementary datasets from real-world applications, namely the Solar-Jinta and Solar-Alabama benchmarks. We think public datasets are more convincing. Meanwhile, we also add several superior baselines on Solar-Energy datasets. Solar-Jinta records seven key meteorological factors of solar radiation, collected by the hour. Solar-Alabama contains the solar power production of 137 PV plants in the USA, with a data granularity of 10 minutes. The solar power of different PV plants is influenced by varying geographical and weather conditions. You can download the datasets at https://github.com/laiguokun/multivariate-time-series-data. Table 7 provides a summary of prediction results from several baselines on the Solar-Energy datasets. It can be observed that DConv outperforms the other baselines for most horizons by a large margin. Moreover, compared to other datasets, the Solar-Jinta benchmark has a smaller size (num 8761). The experimental results also indicate that our model performs equally well on small datasets, ensuring its generalizability and robustness. We firmly believe that LTSF-Conv models as competitive basic units, exhibiting great potential for further expansion of complex network structures.

MSEMAE
BenchmarksMethods96192336720Avg96192336720Avg
Solar-JintaDConv0.4800.5210.5390.6480.5470.4810.4960.5220.5730.518
PatchTST0.4910.6020.6170.7100.6050.4640.5230.5330.6010.530
Flowformer0.6460.7920.7481.0650.8120.5650.6240.6110.7830.645
Reformer0.9540.9340.9930.9950.9690.7730.7570.7420.7300.751
Informer0.7180.8210.8471.1020.8720.6160.6560.7370.8550.716
LogTrans0.7040.8170.7541.0120.8210.5940.6510.6420.7620.662
DLinear0.5230.5880.6380.7340.6200.4960.5310.5620.6160.551
Solar-EnergyDConv0.1730.1890.2010.2060.1920.2280.2370.2440.2450.239
PatchTST0.1780.1920.2030.2180.1970.2480.2650.2670.2740.263
Flowformer0.1900.2670.2790.2430.2450.2290.2630.2680.2520.253
Reformer0.1950.2230.2530.2810.2380.2340.2520.2860.2860.265
Informer0.1950.2200.2590.2460.2300.2410.2410.2740.2660.256
LogTrans0.2190.2170.2240.2410.2250.2400.2440.2580.2690.252
DLinear0.2210.2310.2470.2550.2390.2940.3010.3170.3140.307

Table 7: Multivariate long-term time series forecasting results on Solar benchmarks.

We look forward to your further feedback.

审稿意见
3

This paper aims to address the challenge of long-term time series forecasting (LTSF) and introduces LTSF-Conv models as a solution. The authors discuss the limitations of existing methods, such as Transformer-based models and MLP-based models, and highlight the need for a balance between performance and efficiency. The experiments show that the proposed LTSF-Conv models, based on convolutional neural networks (CNNs), consistently outperform complex Transformer-based models and state-of-the-art MLP-based models, while maintaining efficiency. The paper provides some insights into input window sizes, encoder-decoder structures, and handling time series with multiple periods among channels.

优点

This paper has the following advantages:

  1. Novel solution: The paper introduces the LTSF-Conv model as a new approach to address long-term time series forecasting. By utilizing convolutional neural networks (CNNs), the model outperforms complex Transformer-based and MLP-based models in most cases while maintaining efficiency.

  2. Empirical research and extensive experiments: The authors conduct comprehensive experimental evaluations on multiple real-world datasets across various domains such as weather, traffic, and electricity. The results consistently demonstrate that LTSF-Conv models outperform other complex models in terms of average performance. The paper provides concrete performance comparison data to support their findings.

  3. Analysis and discussion of existing models' limitations: The paper thoroughly analyzes the limitations of existing Transformer-based and MLP-based solutions, particularly in handling long-term time series and multi-channel data. This analysis helps to understand their constraints and guides future research.

  4. Efficiency and reduced computational resources: Compared to complex models, LTSF-Conv models achieve high performance while significantly reducing computational resource requirements. This is particularly valuable in practical applications with limited computing resources, enhancing the model's practical usability and scalability.

  5. Insights for other aspects in the field: The paper also explores issues related to input window sizes, encoder-decoder structures, and handling multi-channel time series, providing valuable insights for future research in the LTSF domain.

缺点

Based on your understanding of the AI industry, you believe this paper has the following shortcomings:

  1. Lack of innovation:

    • The model used in the paper consists of only two layers of convolutional networks, along with a decomposition of trend and periodic components. The loss function used is the classical MSE loss. There is a lack of innovation in the model design.
    • The innovation mainly lies in explaining the good performance of the simple convolutional model. However, the paper only provides a simple "proof" that convolutional kernels larger than a certain duration can capture periodic information shorter than that duration. The subsequent heatmaps only qualitatively observe that the model captures some periodic information, without explaining why the simple convolutional model is competitive.
    • Obvious conclusions, such as the smaller memory footprint, shorter training and inference time, and fewer parameters of the simple model, are extensively analyzed and explained in the paper.
  2. Experimental limitations:

    • The paper lacks several important baseline models based on CNN architectures, such as SCINet and TimesNet, in the Conv-based model category (this is particularly severe, as all models in this category are proposed in this paper).
    • MLP-based models lack models like N-Hits, and there is also a lack of references to the aforementioned models.
    • The discussion of "model performance with respect to lookback" in Section C.1 lacks the inclusion of DConv. Considering Table 1 and Table 2, which compare the model results, the "best" model used in the tables is actually the model with a lookback of 1600, which naturally performs better than other baseline models with smaller lookback values that have not reached their optimal states. Moreover, even the "best" model is outperformed by PatchTST, which does not have data with lookback values of 720, 1000, and 1600.
    • The performance of the proposed models in complex datasets like Traffic is poor.
  3. Presentation issues:

    • The quality of the figures illustrating the model is low and overly simplified.
    • There are formatting issues with the caption of Figure 6.

In summary, the identified shortcomings of the paper include a lack of innovation in the model design, experimental limitations in terms of missing baseline models and dataset performance, and presentation issues with figures and captions.

问题

What I'm concerned about are listed in the weakness. I won't refuse to raise my points if the authors can address my concerns.

评论

Many thanks for your careful reading and suggestion. We appreciate that you found us some advantages.

Meanwhile, we would like to provide some insights into the contribution of this paper. When the DLinear model [1] was initially introduced, its simple linear structure sparked considerable controversy. However, it ultimately won the Best Paper Award, defying complex Transformer models in the LTSF field. Subsequently, more works have been dedicated to refining and advancing linear models, including architectural modifications and innovative training methodologies. We want to emphasize that our manuscript does not only aim to showcase the contributions of SOTA results. Our model structure is as simple as DLinear, while extensive experiments have consistently demonstrated that depth-wise convolutional units possess inherent advantages over linear units (such as DLinear, and RLinear) in LTSF. Linear mappings often struggle to capture intricate dependencies when handling data with multiple periods among channels. Although modeling each channel independently can alleviate this issue, it will significantly increase the computational cost. More experimental analyses are described in Section 7, Page 9. Based on our basic unit, more complex networks can be proposed to further improve the prediction performance of such tasks. In the future, there is the potential for more valuable research to be explored in this new track.

We would like to address your concerns point-by-point below:

Q1 : The paper lacks several important baseline models based on CNN architectures, such as SCINet and TimesNet, in the Conv-based model category (this is particularly severe, as all models in this category are proposed in this paper). MLP-based models lack models like N-Hits, and there is also a lack of references to the aforementioned models.

  • Thank you for your comments. Due to space limitations, we compared representative SOTA models from recent years in original our manuscripts. Following the reviewer’s suggestion, we have added SCINet, TimesNet, MICN, and N-Hits baseline models. We adopt their official codes and only change the length of input sequences. The results are summarized in Table 2. In Table 2, we conducted an experiment within a broader window size range of {96, 336, 512, 720, 1600} for a fair comparison. It can be found in the attached supplementary material file. The results demonstrate that LTSF-Conv models consistently surpass all SCINet, TimesNet, MICN, and N-Hits on seven LTSF benchmarks. The symbol * denotes the experimental results we conducted after increasing the input length (re-implementation). For N-Hits, authors have adopted an extended step size hyperparameter search, and we directly adopt the results. We will add the related references in the revised paper.

Q2 : Considering Table 1 and Table 2, which compare the model results, the "best" model used in the tables is actually the model with a lookback of 1600, which naturally performs better than other baseline models with smaller lookback values that have not reached their optimal states. Moreover, even the "best" model is outperformed by PatchTST, which does not have data with lookback values of 720, 1000, and 1600.

  • Thank you for pointing out this. Previous research [1] has also shown that Transformer-based baselines have not benefited from a longer look-back window. Their performance fluctuates or gets worse as the input lengths increase. We have conducted experiments to analyze it. The reason why most transformer models do not increase the lookback window length is because the performance will become worse. More analysis of the look-back windows is described in Section C.1, Page 12. in the original Appendix.
  • We have confidence in the fairness of our previous comparison. The experimental results not only include Conv-Best and DConv-Best but also the default look-back window length is set at 512, such as Conv and Dconv. This is using the same lookback window length as PatchTST.
  • Different from other Transformer-based baselines, PatchTST can extend the lookback window to 512, but does not benefit from longer windows. It's important to note that this extension comes at the cost of a substantial increase in the model’s computational resource requirements. On our server resources, PatchTST ran out of GPU memory for a look-back window size greater than 720.
  • To alleviate your concerns, we switched to a higher-performance server to validate PatchTST with lookback-window size \in {720, 1600}. Since our model only experiments from the extended window {336, 512, 720, 1600}, we ignore the 1000 step size. The results are in Table 3 of the attached supplementary material file. As expected, PatchTST did not benefit from longer lookback windows. As the input length increases, the significant memory overhead poses limitations for its practical application.
评论

Q3 : The discussion of "model performance with respect to lookback" in Section C.1 lacks the inclusion of DConv.

  • Thank you for pointing out this. Due to the computational cost of Dconv and Conv is similar, we do not add it in Figure 5 of the original manuscript. The detailed efficiency comparison is described in Tables 6-8, Page 16-17. in the original Appendix.

Q4 : There are formatting issues with the caption of Figure 6. The quality of the figures illustrating the model is low and overly simplified.

  • Thank you so much for your careful check. We are sorry that we ignored the issues of font size on the vertical axis in Figure 6. We have revised it. Following the reviewer’s suggestion, we have made modifications to Section 4.1 of the original manuscript to provide a more detailed explanation of the model.
MethodsConvDConvPatchTSTConvDConvPatchTST
Series Length7201600
MetricMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
Weather960.1410.1890.1670.2200.1560.2080.1410.1960.1660.2220.1590.214
1920.1830.2310.2120.2590.1970.2450.1830.2390.2090.2600.2020.253
3360.2330.2720.2570.2940.2470.2840.2320.2770.2530.2930.3090.350
7200.3030.3240.3180.3390.3110.3330.2960.3240.3070.3340.3060.334
Avg0.2150.2540.2390.2780.2280.2680.2130.2590.2340.2770.2440.288
Electricity960.1310.2250.1330.2280.1330.2290.1290.2250.1300.2260.1400.245
1920.1460.2390.1450.2420.1490.2450.1440.2390.1450.2390.1560.258
3360.1610.2570.1620.2580.1680.2660.1590.2550.1610.2560.1730.277
7200.2000.2880.2020.2910.2060.2980.1950.2860.1990.2880.2120.308
Avg0.1600.2520.1610.2550.1640.2600.1570.2510.1590.2520.1700.272
ETTm2960.1610.2510.1620.2520.1660.2590.1610.2570.1610.2550.1700.265
1920.2180.2910.2160.2900.2210.2970.2130.2960.2140.2920.2260.305
3360.2720.3290.2680.3260.2720.3310.2580.3290.2590.3250.2790.341
7200.3510.3870.3470.3780.3500.3800.3250.3780.3250.3690.3440.379
Avg0.2510.3150.2480.3120.2520.3170.2390.3150.2400.3100.2550.323
ETTm1960.2940.3410.3060.3490.2990.3520.2980.3480.3070.3540.3140.367
1920.3330.3620.3360.3660.3400.3770.3320.3680.3340.3700.3410.382
3360.3630.3820.3650.3840.3760.3980.3560.3840.3560.3840.3690.399
7200.4130.4100.4140.4110.4170.4220.3940.4090.3940.4060.4070.421
Avg0.3510.3740.3550.3780.3580.3870.3450.3770.3480.3790.3580.392
ETTh1960.3770.4040.3750.3980.3790.4110.3910.4160.3870.4100.3800.413
1920.4170.4280.4130.4210.4160.4330.4250.4370.4220.4320.4690.478
3360.4330.4500.4380.4380.4250.4400.4480.4650.4420.4470.4610.478
7200.4810.4850.4530.4660.4480.4690.4640.4770.4560.4700.4950.497
Avg0.4270.4420.4200.4310.4170.4380.4320.4490.4270.4400.4510.467
ETTh2960.2760.3440.2700.3360.2770.3400.2760.3470.2740.3440.2790.347
1920.3360.3890.3310.3760.3410.3820.3320.3960.3330.3860.3400.390
3360.3400.3980.3250.3870.3310.3870.3470.4100.3380.4030.3420.406
7200.3810.4270.3850.4370.3850.4270.3950.4360.4040.4540.4010.445
Avg0.3330.3900.3280.3840.3340.3840.3380.3970.3370.3970.3410.397

Table 3 The comparisons of PatchTST and LTSF-Conv models with look-back window size of {720, 1600} on LTSF benchmarks.

See the supplementary materials for the full results.

评论

Dear Reviewer,

Greetings! We would like to express our gratitude for your comments during the review process. We have taken great care to address your question points comprehensively and provide responses.

We look forward to your further feedback, and if there are any aspects that you still find unclear or require additional elaboration, we would be more than willing to engage in further discussion. Your feedback is crucial for us to revise the final manuscript.

审稿意见
8

This paper presents an innovative depthwise convolution model to perform long-term time series forecasting. The key idea is to apply unique filters to each channel to achieve channel independence. The experiment results on public benchmark datasets justified the effectiveness of the proposed method.

优点

  1. This paper is well written and organized.
  2. The proposed convolution based long-term forecasting technique is well-motivated based on a theoretical insight over the periodicity assumption of the time series.
  3. Applying RevIN over the one-depthwise convolution operation to deal each channel independently is new. Based on that, a simple yet effective LTSF-Conv models for long term forecasting tasks is developed.
  4. The experiment results are comprehensive and quite solid in this paper. State-of-the-art transformer based methods such as PatchTST and MLP-based model TiDE are both compared. The proposed Convolution-based models significant outperform baselines on most cases. In addition, they also consume small GPU memory and exhibit less trainable parameters.

缺点

  1. Whether the Conv-LTSF still works for time series that does not exhibit strong periodicity?
  2. I wonder whether explicitly considering the channel dependencies can help further improve the forecasting performance.

问题

Please see the weaknesses

评论

We sincerely appreciate your comments and positive ranking. In order to ensure an unbiased evaluation, we rigorously adhered to an identical code structure as that of the other baseline models. This encompassed utilizing the same data preprocessing procedures, including the data provider module. Our contribution was seamlessly integrated by adding our method to the model folder. ALL experimental results can be repeated by utilizing the provided source code. This simple basic unit serves as a robust starting point, showcasing its potential in handling LTSF tasks.

We provide point-wise responses to your concerns below.

Q1 : Whether the Conv-LTSF still works for time series that does not exhibit strong periodicity?

  • Many thanks for your careful reading and suggestion. Following the reviewer’s suggestion, we compared the proposed model (Conv) with the DLinear and strongest Transformer baseline (PatchTST) on the illness benchmark. The illness benchmark describes the ratio of patients seen with illness and the number of patients from 2002 to 2021. It includes weekly data from the Centers for Disease Control and Prevention of the United States (https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html). Table 4 (in the attached PDF file) provides the corresponding forecasting results. It can be found that Conv outperforms DLinear and PatchTST for all horizons.
MethodsConvPatchTSTDLinear
MetricMSEMAEMSEMAEMSEMAE
240.4510.5520.6370.5520.7250.681
360.5080.5740.7650.6340.7920.744
480.5970.6360.7560.6920.8860.815
600.6750.6800.7760.7410.9600.859
Avg0.5580.6110.7340.6550.8410.775

Table 4: The forecasting results on the ILI benchmark under the setting, where sl = 104.

Q2 : I wonder whether explicitly considering the channel dependencies can help further improve the forecasting performance.

  • In time series forecasting tasks, we intuitively think that channel-mixing techniques are often a superior approach, which allows the model to effectively capture the interdependencies and interactions between different channels. In channel dependencies, information from various channels is integrated efficiently, facilitating a more comprehensive representation of the complex relationships within the data. However, previous studies [2] have found that channel independence can improve performance compared to channel mixing in LTSF.
  • CI: so-called channel independence; CD: so-called channel dependencie. Following the reviewer’s suggestion, we apply channel independence and channel dependence to the different models respectively. Table 5 (in the attached PDF file) summarizes the results of different channel strategies on the Weather and ETTm1 datasets. Most of the CI-based models have a higher testing accuracy than the CD-based models. For MLP-based models, the overall accuracy of the CI-based models was approximately 1%–8% higher than the CD-based models.
  • For our models, we have conducted additional ablation experiments about depth-wise CNN (channel independent) and general CNN (channel dependent). The results are in Section C.2, Page 13 of the original manuscript. It can be found that CI-based models also outperform CD-based models.
WeatherETTm1
96192336720Avg96192336720Avg
RLinear_CDMSE0.1750.2170.2650.3280.2460.3010.340.3730.430.361
MAE0.2250.2590.2930.3390.2790.3420.3660.3850.4170.377
RLinear_CIMSE0.1460.1890.2410.3130.2220.2890.3320.3680.4260.353
MAE0.1940.2340.2740.3270.2570.3350.3610.380.4130.372
DLinear_CDMSE0.1750.2190.2650.3230.2450.2990.3350.3690.4240.356
MAE0.2370.2820.3180.3610.2990.3430.3650.3860.420.378
DLinear_CIMSE0.1460.190.2430.3170.2240.2850.3270.3670.4280.351
MAE0.2120.2570.3010.3580.2820.3340.3580.3830.4170.373
NLinear_CDMSE0.1810.2250.270.3390.2530.3050.3480.3750.4330.365
MAE0.2320.2680.30.3480.2870.3470.3750.3880.4210.382
NLinear_CIMSE0.1460.1890.2420.3210.2240.2930.3370.3790.4350.361
MAE0.1960.2380.280.3350.2620.3410.3670.390.4220.38

Table 5: Multivariate prediction results on two benchmarks with an input length of 336. CI denotes channel-independence, and CD represents channel-dependence.

评论

The authors have addressed my concerns.

审稿意见
6

Summary:

The paper aims to address the challenges faced by Transformer-based models in long-term time series forecasting (LTSF) tasks, mainly when dealing with long sequence lengths. The authors propose a new model, LTSF-Conv, which utilizes depthwise convolution models to enhance forecasting performance while significantly reducing computational costs.

Strengths:

From the computational efficiency (memory usage/flops) perspective, this paper reports very promising results on several datasets.

Weaknesses:

From the accuracy perspective, two recent CNN baseline models, TimesNet and MICN [1] are missing in Table 1 and Table 2. Moreover, the hyperparameter-searching is used, which makes the comparison a little bit unfair. For example, in TimesNet and MICN, the model configurations and lookback window remain the same for most of the experiments, and in PatchTST, only two configurations are considered. Based on the current results, it is hard for me to tell whether the performance gain is from the better model configuration or the proposed structure.

Moreover, based on my understanding of Section 4.1, the main takeaway message would be there are two useful structures, depth-wise 1Dconv, and/or trend/seasonality decomposition. A similar idea (i.e., 1Dconv + decomposition) is also mentioned in MICN (e.g., Figure 1 in [1]). One more interesting thing here is the usage of depth-wise CNN instead. As shown in Table 4, deep-wise gives significant performance improvements. I would expect a more in-depth analysis of why it reaches better results than vanilla CNN. Based on the current presentation, it is a little bit difficult for me to understand what inductive bias can only be utilized by depth-wise CNN but not general CNN.

The theoretical analysis is also kind of weak. Theorem 1 and Corollary 1 consider simple autoregressive state-space structure and MLP/RNN models can also have the same prediction power. MLP can be viewed as a CNN with kernel size equal to the sequence length. RNN is commonly used to model state-space structures. Theorem 2 considers the sequence with both trend and seasonality. From my understanding, the MLP/RNN may also reach a similar performance guarantee.

After reviewing the sample codes in the supplementary material. I also have some concerns about the numerical results reported in the paper. When dealing with test samples, the data_provider function sets the drop_last = True and shuffle_flag = False. The consequence would result in the last several test samples being ignored. Those samples are usually the hardest to predict since they are far away from the training set. Moreover, it seems the main results in Table1 and Table2 are only run with one fixed random seed 1024. The random control experiment is only reported in Figure 5 in the Appendix.

Questions and Suggestions:

  1. As the title used the word minimalism, I would conjecture the main advantage of using simple depth-wise CNN would be its robustness. The time series forecating usually contains a lot of time-varying noise especially when using longer inputs. The usage of a simpler model would have less risk of overfitting that noise but a potential drawback would be more modeling bias may be introduced due to limited representation power. Therefore, I would expect the analysis from the theoretical part to consider the high noise system, such as x(t)=x(tp)+ϵ_tx(t) = x(t-p) + \epsilon\_t where ϵ_t\epsilon\_t could be on the same order of x(t)x(t), and analyze the generalization ability of depth-wise CNN to show it will have better variance bias trade-off.

  2. Please add TimesNet and MICN as benchmarks in Table 1 and Table 2.

  3. Please fix the dataloader issue in the test part and rerun the relevant experiments. It would be better to also report the random control results in Table 1 and Table 2.

  4. Please provide the detailed experimental configurations for each setting in Table 1 and Table 2 to help the reviewer verify those results.

  5. Could the author elaborate more on the seq_last in ConvNet.py file? It seems not to be discussed in Section 4. Moreover, since Revnorm has been used, the sequence would already be centered, why do we still need to subtract the sequence mean?

Conclusion:

While the paper explores an intriguing concept that simpler models might suffice for certain datasets, the current depth of analysis and the reliability of numerical results do not yet support a strong case for acceptance at a top-tier machine learning conference like ICLR. Despite this, the reviewer is willing to reconsider the decision after the authors' rebuttal.

Reference

[1] Wang, Huiqiang, Jian Peng, Feihu Huang, Jince Wang, Junhui Chen, and Yifei Xiao. "Micn: Multi-scale local and global context modeling for long-term series forecasting." In The Eleventh International Conference on Learning Representations. 2022.

优点

Please refer to the Strengths section in Summary.

缺点

Please refer to the Weaknesses section in Summary.

问题

Please refer to the Questions and Suggestions section in Summary.

评论

We appreciate that you found some strengths. Before responding point-by-point, we would like to provide some insights into the contribution of our work. In its initial publication, the DLinear model [1] faced widespread controversy due to its one-layer linear model structure. Despite this initial skepticism, the model ultimately achieved success and exerted a significant influence within the LTSF community. Based on this research, lots of variant models based on MLPs have emerged. However, our model structure is as simple as DLinear, while extensive experiments have consistently demonstrated that depth-wise convolutional units possess inherent advantages over Linear units of DLinear in LTSF. We firmly believe that LTSF-Conv models serve as straightforward yet competitive basic units, exhibiting great potential for further expansion of complex network structures.

Q1 : From the accuracy perspective, two recent CNN baseline models, TimesNet and MICN [1] are missing in Table 1 and Table 2. Moreover, the hyperparameter-searching is used, which makes the comparison a little bit unfair. For example, in TimesNet and MICN, the model configurations and lookback window remain the same for most of the experiments, and in PatchTST, only two configurations are considered. Based on the current results, it is hard for me to tell whether the performance gain is from the better model configuration or the proposed structure.

  • Thank you for pointing out this. We have confidence in the fairness of our previous and future comparisons. First, the experimental results not only include Conv-Best and DConv-Best but also the default look-back window length is set at 512, such as Conv and Dconv in Table 1 and Table 2 of the original manuscript. This is using the same lookback window length as PatchTST (sl=512). Then, Transformer-based baselines have not benefited from a longer look-back window. The reason why most transformer models do not increase the lookback window length is because the performance will become worse. More analysis of the look-back windows is described in Section C.1, Page 12. in the original Appendix. Reference [1] has also proved it.
  • PatchTST can extend the lookback window to 512, but does not benefit from longer windows. It's important to note that this extension comes at the cost of a substantial increase in the model’s computational resource requirements. On our server resources, PatchTST ran out of GPU memory for a look-back window size greater than 720. Reference [3] also conducted similar experiments, encountering the same issue of OOM. This is also a significant limitation of the PatchTST model.
  • Following the reviewer’s suggestion, we switched to a higher-performance server to validate PatchTST with lookback-window size \in {720, 1600}. The results are in Table 3 of the attached supplementary material. It can be found that not only the overall performance has not improved, but the training cost is almost unacceptable.
  • Finally, We have added MICN and TimesNet baseline models. We adopt their official codes and only change the length of input sequences. The results are summarized in Table 2 of the attached PDF file. In Table 2, we conducted an experiment within a broader window size range of {96, 336, 512, 720, 1600}. We consistently chose the most optimal results, thus establishing robust baselines. The results demonstrate that LTSF-Conv models consistently surpass all MICN and TimesNet on seven LTSF benchmarks.

Q2 : I would expect a more in-depth analysis of why it reaches better results than vanilla CNN.

  • Thank you for pointing out this. In LTSF tasks, channel independence can enhance prediction performance compared to channel mixing. Previous research [2] has also found it. For MLP-based variants, achieving channel-independent techniques typically involves multiple independent MLP sub-models, each responsible for handling a specific channel or feature within the time series. It comes at the cost of a substantial increase in the model’s computational resource requirements. As a comparison, our model utilizes group-wise convolution to cleverly achieve channel independence, while concurrently reducing model complexity. This is because depthwise convolution exhibits higher efficiency than standard convolution. In the context of group-wise convolution, a critical consideration lies in aligning the number of channels with the variable dimension and the number of filters, as defined in the initial setup. However, standard CNN uses the idea of channel mixing, which suffers from noise interference among the channels and reduces performance. We have conducted additional ablation experiments about depth-wise CNN and general CNN. The results are in Section C.2, Page 13 of the original manuscript.
评论

Q3 : Please provide the detailed experimental configurations for each setting in Table 1 and Table 2 to help the reviewer verify those results.

  • Thank you for your comments. We add Table 1 in the attached supplementary material, which provides detailed experimental configurations of Conv with sl=512 on all datasets. In the original paper, patchTST also adopted this step size of 512.

Q4 : I also have some concerns about the numerical results reported in the paper. When dealing with test samples, the data_provider function sets the drop_last = True and shuffle_flag = False. The consequence would result in the last several test samples being ignored. Those samples are usually the hardest to predict since they are far away from the training set. Moreover, it seems the main results in Table1 and Table2 are only run with one fixed random seed 1024. The random control experiment is only reported in Figure 5 in the Appendix. Why do we still need to subtract the sequence mean?

  • Thank you for pointing out this. We have confidence in the fairness of our previous comparison. To ensure an objective evaluation, we strictly adhered to the same code structure as the other baseline models, which includes using the same ‘data_factory’ and ‘data_loader’ files, and we just added the file of the proposed method in the model folder. Test{'shuffle_flag:' False, 'drop_last: True' }; Pred{'shuffle_flag:' False, 'drop_last: False' }; Else{'shuffle_flag:' True, 'drop_last: True' } The training set, test set, and validation set for all datasets remain consistent with the previous experiments. Reviewers can check the PatchTST and related models' source code to thoroughly verify the fairness of the experiment.

  • Moreover, Table 1 and Table 2 of the original manuscript present the average results obtained from three experiments using different random seeds (Conv-Best, DConv-Best). It can be found in Section B, page 12 of the original manuscript.

  • Figure 5 shows robust experimental results in the original manuscript. To obtain a better visualization quality, each experimental configuration was repeated with five random seeds. Since Revnorm has been applied, the value of seq_last is 0, and subtracting it has no impact on the input variable XX. Of course, it will not affect the prediction results. It seems there was an upload error, and the final released code will be updated.

评论

We add detailed experimental configurations for univariate results in Table 2 of the original manuscript. Reviewers can further verify the experimental results.

DatasetSeq_LenPre_LenBatch_SizeLearning_RateKernel_SizeIndividualMseMae
ETTh151296160.0052400.0530.177
512192160.0055500.0640.197
512336640.0055500.0750.217
5127201280.0057800.0820.227
ETTh2512961280.0052400.1340.284
512192160.0053500.1720.328
5123361280.0052400.1790.342
5127201280.0052400.2190.376
ETTm151296640.0055500.0260.122
512192640.0052400.0390.150
512336160.0053500.0520.172
512720160.0053500.0710.203
ETTm251296160.0055500.0620.182
512192160.0052400.0900.225
512336640.0052400.1180.261
512720160.0052400.1720.320

Table 6: The hyperparameters of Conv modele with look-back window size 512 on ETT datasets. Note that the default training epochs are set to 100.

MethodsConvDConvTimesNet*MICN-regre*
MSEMAEMSEMAEMSEMAEMSEMAE
Weather960.1400.1880.1660.2200.1590.2150.1610.229
1920.1820.2300.2090.2590.2190.2610.2200.281
3360.2370.2710.2530.2930.2740.3060.2570.316
7200.2940.3240.3060.3350.3470.3560.3110.356
Avg0.2130.2530.2340.2770.2500.2850.2370.296
Electricity960.1290.2250.1300.2250.1680.2720.1550.265
1920.1430.2380.1440.2380.1840.2890.1770.285
3360.1590.2550.1600.2560.1960.2990.1800.292
7200.1950.2860.1990.2880.2200.3200.2070.316
Avg0.1570.2510.1580.2520.1920.2950.1800.290
ETTm2960.1610.2490.1610.2530.1870.2670.1760.275
1920.2170.2870.2130.2920.2490.3090.2540.334
3360.2570.3290.2580.3250.2950.3490.2880.351
7200.3250.3790.3250.3690.4080.4030.4170.440
Avg0.2400.3110.2390.3100.2850.3320.2840.350
ETTm1960.2870.3340.3000.3420.3350.3760.3110.364
1920.3280.3580.3350.3630.3740.3870.3560.388
3360.3560.3840.3560.3840.4100.4110.4070.422
7200.3940.4080.3930.4050.4780.4500.4640.462
Avg0.3410.3710.3460.3740.3990.4060.4080.425
ETTh1960.3650.3930.3660.3820.3840.4020.3890.424
1920.4030.4180.4000.4120.4360.4290.4740.487
3360.4240.4280.4210.4220.4910.4690.5160.524
7200.4500.4600.4290.4460.5210.5000.7430.664
Avg0.4110.4250.4040.4160.4580.4500.5310.525
ETTh2960.2680.3390.2690.3340.3400.3740.2990.364
1920.3270.3820.3260.3750.4020.4140.4410.454
3360.3290.3900.3210.3860.3900.4370.6540.567
7200.3800.4240.3820.4280.4620.4680.9560.716
Avg0.3260.3840.3250.3810.3990.4230.5880.525
Traffic960.3830.2710.3780.2640.5930.3210.4730.306
1920.3970.2750.3900.2690.6150.3310.4750.298
3360.4110.2820.4040.2750.6290.3360.4930.307
7200.4500.3020.4420.2940.6400.3500.5310.325
Avg0.4100.2830.4040.2760.6190.3350.4930.309

Table 2: Multivariate prediction results. * denotes re-implementation after increasing the input length.

If there are any areas where our explanations are not clear or if any other questions, please let us know and we expect more discussions. We want to ensure that you have all the information you need to make a final decision. Your understanding and satisfaction are crucial for us to revise the final Manuscript. Thanks

评论

Dear Reviewers, ACs, and PCs,

With this letter, we would like to express our deep appreciation for your tremendous efforts in reviewing our manuscript entitled "The Power of Minimalism in Long Sequence Time-series Forecasting" at the ICLR conference. Those comments are constructive and helpful for improving the paper. Due to the additional experiments involving multiple models on seven public datasets, our submission for the rebuttal is slightly late. We are confident in the fairness of our previous comparisons. Table 1 provides detailed experimental configurations, and reviewers can verify experiment results. We present the point-to-point responses to the reviewers' comments below. We hope that our responses have addressed your concerns, and we look forward to engaging in further discussions with you. If you have any additional questions or comments, please feel free to share them before the author-reviewer discussion period concludes. If our responses have satisfactorily resolved your concerns, we kindly request that you consider revising the rating of our work. Thank you once again for your valuable time and efforts.

We add some common references in this section.

[1] Zeng, A., Chen, M., Zhang, L., & Xu, Q. (2023, June). Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence (Vol. 37, No. 9, pp. 11121-11128).

[2] Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. (2023). A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In The Eleventh International Conference on Learning Representations.

[3] Das, A., Kong, W., Leach, A., Sen, R., & Yu, R. (2023). Long-term Forecasting with TiDE: Time-series Dense Encoder. arXiv preprint arXiv:2304.08424.

DatasetsSeq_LenPre_LenBatch_SizeLearning_RateKernel_SizeIndividualMSEMAE
ETTh151296160.0055500.3650.393
512192160.0055500.4010.416
5123361280.0057810.4190.437
512720160.0055500.4640.472
ETTh2512961280.0055500.2690.339
5121921280.0057800.3290.383
5123361280.00057800.3350.394
5127201280.00017800.3790.424
ETTm151296160.0057810.2920.338
512192160.0053500.3320.361
512336160.0053500.3640.380
512720160.0053500.4180.411
ETTm251296160.0052410.1610.249
512192160.0055510.2160.288
512336160.0053500.2710.327
512720640.0052400.3610.387
Weather51296160.0055500.1400.188
512192160.0057810.1830.230
512336160.0052410.2340.271
5127201280.0057800.3060.325
Electricity51296160.0055500.1320.227
5121921280.0055510.1450.241
512336640.0055500.1610.257
512720160.0055500.2010.289
Traffic51296160.0053500.3960.275
512192160.0053500.4070.279
512336160.0052400.4170.285
512720160.0052400.4530.304

Table 1: The hyperparameters of Conv modele with look-back window size 512. Note that the default training epochs are set to 100.

评论

Dear Reviewers,

Thanks for your time and commitment to the ICLR 2024 review process.

As we approach the conclusion of the author-reviewer discussion period (Wednesday, Nov 22nd, AOE), I kindly urge those who haven't engaged with the authors' dedicated rebuttal to please take a moment to review their response and share your feedback, regardless of whether it alters your opinion of the paper.

Your feedback is essential to a thorough assessment of the submission.

Best regards,

AC

AC 元评审

This paper presents a simple convolution-based solution to time series forecasting, which is coined as "minimalism" in the title. The claim of "minimalism" is subjective as there is no possible explanation of how simple is "minimalism." Authors performed an extensive rebuttal, which was appreciated by the reviewers with post-rebuttal scores raised. In general, most of the concerns were addressed, while there was still a score of "3". The AC's take on this paper is that the current paper provides a poor presentation, as can be witnessed by scanning throughout the paper---it is subpar to normal ICLR papers, in that the content in page 3 is loosely placed, the numbers in the tables are densely listed, the figures are produced with low readability. Besides, the technical contribution is not overwhelming as argued by the authors---these days, there are many attempts in achieving "simple" models for time series forecasting, however, they were later shown to be inferior to more sophisticated method such as PatchTST, if all models were to be evaluated fairly and hyperparameter-tuned carefully. Expecting solid evaluation is usually difficult in this line of methods, because complex tasks that are more relevant are bypassed otherwise the performance cannot be convincing. While this work can be used as a simple baseline, this field does not need yet another one. In summary, this is a decent work but is below the bar of this conference, and it will not draw significant impact from the community.

为何不给更高分

A decent work, but not good enough. Presentation quality is clearly subpar. The technical solution makes sense but does not make a worthwhile contribution. Evaluation is not convincing enough, in the sense of using more complex tasks and better tuned protocols.

为何不给更低分

N/A

最终决定

Reject