5.3

/10

Rejected6 位审稿人

最低5最高6标准差0.5

3.7

置信度

正确性2.5

贡献度2.3

表达2.7

ICLR 2025

PSformer: Parameter-efficient Transformer with Segment Attention for Time Series Forecasting

Yanlong Wang,Jian Xu,Fei Ma,Shao-Lun Huang,Danny Dongning Sun,Xiao-Ping Zhang

OpenReview PDF

提交: 2024-09-24更新: 2025-02-05

摘要

关键词

Time Series ForecastingTransformerParameter Sharing

评审与讨论

审稿意见

评分: 5置信度: 42024-10-31

This paper proposes a Transformer-based model for time series forecasting (PSformer), which primarily consists of two core components: the Parameter Sharing Technique and the Spatial-Temporal Segment Attention mechanism. The former is designed to reduce the complexity of the model, while the latter is used to simultaneously model temporal dynamics and channel correlations.

优点

S1: PSformer experiments involve a wide range of datasets.

S2: PSformer and baselines are compared in terms of training parameters and running time.

缺点

W1: The paper overall lacks innovation. Parameter Sharing is merely a simple module parameter-sharing technique to reduce complexity. Additionally, Spatial-Temporal Segment Attention simply merges the channel dimension and patch size dimension together before modeling it with a Transformer.

W2: What is the purpose of including the PS Block in the Segment Attention? Why not just add the PS Block after the output of the Attention, which would eliminate the need for parameter sharing? Additionally, why can two SegAttn be viewed as one FFN layer?

W3: The integration of temporal and channel information in Spatial-Temporal Segment Attention may lead to incomplete capture of both time and spatial information, potentially resulting in negative effects. Could you further explain how this approach differs from the separated modeling methods like Crossformer and iTransformer, and visualize the advantages of PSformer in modeling time and channel?

W4: Why is there such a significant difference between the results of PatchTST in Table 1 and the original PatchTST? Please explain.

W5: Please add new baselines for comparison, such as PDF (ICLR2024) and Time-LLM (ICLR2024).

W6: There are many writing errors in the paper, such as "Moment()" on line 125 and "GPT4TS uses BERT as the backbone" on line 127.

问题

See W1-W6.

评论- Response to Reviewer JmMB (2/4)

2024-11-21

W3: How PSformer differs from the separated modeling methods like Crossformer and iTransformer; Visualizing the advantages of PSformer in modeling time and channel.

SegAtt vs. Separate Modeling Methods Unlike methods such as Crossformer[2] and iTransformer[3], which treat temporal and channel information as independent entities for separate modeling, PSformer takes a different approach. Separate modeling not only increases the total number of network parameters but can also degrade predictive performance when aggregating information from both dimensions. In contrast, PSformer introduces Spatial-Temporal Segment Attention (SegAtt), which unifies the treatment of temporal and channel information within spatio-temporal segments. This simplifies the model structure and leads to more structured attention expressions in shallow layer, mitigating the risk of overfitting caused by unnecessary channel mixing. Additionally, our experiments demonstrate that PSformer also performs competitively on single-channel sequences and scenarios with weak inter-channel correlations (as discussed in our response to Reviewer 5(MoxF)'s W2).
Visualizing PSformer’s Advantages We provided further visualizations of attention weights in the README of the anonymous repository to demonstrate the model's ability to capture local spatio-temporal information. To facilitate understanding, we added auxiliary lines to the attention maps to distinguish cross-channel and cross-temporal components. Additionally, we separated single-channel attention submaps and cross-channel attention submaps, analyzing their correspondence with specific time points in the respective time series segments.

In the Visualization section of the README, the Figure 1.1 and Figure 1.2 show that the local spatio-temporal relationships in the attention weight maps are structured and organized. It is evident that regions of concentrated attention weights and gradual changes in attention weights across spatio-temporal dimensions can be observed. This suggests that the model captures temporal consistency between variables, local stationary features. Further verification was conducted by comparing the attention submaps with the corresponding time series variates. From Figure 2.1 to Figure 4.2, we visualized the time series positions corresponding to the high attention weights assigned to univariate time series. These visualizations reveal that the model places higher attention on critical time steps along temporal dimension fo univariate. From Figure 5.1 to Figure 7.2, we visualized the cross-channel attention weights between two time-series variables and their corresponding time series. These analyses demonstrate that PSformer assigns higher attention to important cross-channel patterns, such as large inverse changes between channels, locally continuous temporal patterns, and shared local temporal troughs.

W4: Why do the PatchTST results in Table 1 differ significantly from the original?

As discussed in Appendix A.1, we referred to the PatchTST[3] results from the iTransformer paper, as our work draws inspiration from SAMformer, which adopted same PatchTST results. In addition, the original PatchTST experiments inherited a known bug where drop_last=True was set in the test set Dataloader. This bug may notably impact the predictive performance on many datasets. To better compare with PatchTST, we correcting the bug in the official PatchTST code to set drop_last=False (as noDL) and re-ran its experiments. The results are presented in our response to W5.

W5: Add PDF and Time-LLM for comparison.

Thank you for suggesting additional baseline methods like PDF[5] and Time-LLM[6] for comparison. We appreciate the recommendation and have the following response:

Comparison with PDF and PatchTST. While PDF demonstrates strong predictive performance, as shown in its paper, it also relies on a more complex hyperparameter search process. Additionally, PDF ad PatchTST set drop_last=True in the test set dataloader, which results in the model inadvertently using incomplete batches during evaluation. To ensure a fair comparison, we modified PDF and PatchTST by correcting this DL issue (setting drop_last=False) and set the input length to 512 to align with our experimental setup. Under these adjustments, we have included the updated results in the table below, which provide further insights into PSformer performance under fair conditions.

We conducted tests on the following five datasets. From the experimental results, both PSformer and PDF significantly outperform PatchTST in predictive performance, and PSformer achieving better predictions than PDF. However, the average prediction loss reduction relative to PDF is not substantial. Therefore, we consider PSformer and PDF to exhibit equally excellent predictive performance under the same settings.

To be continued...

评论- Response to Reviewer JmMB (4/4)

2024-11-21

[1] Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems.

[2] Zhang, Y., & Yan, J. (2023). Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In The eleventh international conference on learning representations.

[3] Liu, Y., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., & Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. In The Twelfth International Conference on Learning Representations.

[4] Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In The Eleventh International Conference on Learning Representations.

[5] Dai, T., Wu, B., Liu, P., Li, N., Bao, J., Jiang, Y., & Xia, S. T. (2024). Periodicity decoupling framework for long-term series forecasting. In The Twelfth International Conference on Learning Representations.

[6] Jin, M., Wang, S., Ma, L., Chu, Z., Zhang, J. Y., Shi, X., ... & Wen, Q. Time-LLM: Time Series Forecasting by Reprogramming Large Language Models. In The Twelfth International Conference on Learning Representations.

[7] Huang, Q., Shen, L., Zhang, R., Ding, S., Wang, B., Zhou, Z., & Wang, Y. CrossGNN: Confronting Noisy Multivariate Time Series Via Cross Interaction Refinement. In Thirty-seventh Conference on Neural Information Processing Systems.

评论- Response to Reviewer JmMB (3/4)

2024-11-21

Comparison with Time-LLM. For Time-LLM, We listed the predictive performance from the original paper of the model in the table below. However, we check its offical repository and find Time-LLM also faces DL issue in the test set dataloader, which may affects its reported results. Besides, due to the computational demands of large-scale models, we decided not to execute its code directly in our experiments. When selecting the baseline large models, we considered both MOMENT and TimeLLM. We ultimately chose MOMENT for two main reasons: the MOMENT paper includes a direct comparison with TimeLLM, and it is relatively lightweight (1 billion parameters). In summary, despite the significant difference in parameter scale and the DL issue present in TimeLLM, PSformer still achieves equal or better average predictive performance than TimeLLM on 3 out of 5 datasets.
Comparison with Crossformer. For Crossformer, we used experimental results based on the CrossGNN[7] paper, as there is an inconsistency in prediction lengths between the original Crossformer paper and current mainstream work. Therefore, we aligned the experimental setup with the CrossGNN results to ensure consistency in the comparison. From the results, PSformer performs better than Crossformer.

ETTh1

Models	Metric	96	192	336	720	Avg
PSformer (512, noDL)	MSE	0.352	0.385	0.411	0.440	0.397
Crossformer (96,noDL)	MSE	0.384	0.438	0.495	0.522	0.460
PDF (512, noDL)	MSE	0.361	0.391	0.415	0.468	0.409
PatchTST (512, noDL)	MSE	0.374	0.413	0.434	0.455	0.419
TimeLLM (512, DL)	MSE	0.362	0.398	0.430	0.442	0.408

ETTh2

Models	Metric	96	192	336	720	Avg
PSformer (512, noDL)	MSE	0.272	0.335	0.356	0.389	0.338
Crossformer (96,noDL)	MSE	0.347	0.419	0.449	0.479	0.424
PDF (512, noDL)	MSE	0.272	0.334	0.357	0.397	0.340
PatchTST (512, noDL)	MSE	0.274	0.341	0.364	0.390	0.342
TimeLLM (512, DL)	MSE	0.268	0.329	0.368	0.372	0.334

ETTm1

Models	Metric	96	192	336	720	Avg
PSformer (512, noDL)	MSE	0.282	0.321	0.352	0.413	0.342
Crossformer (96,noDL)	MSE	0.349	0.405	0.432	0.487	0.418
PDF (512, noDL)	MSE	0.284	0.327	0.351	0.409	0.343
PatchTST (512, noDL)	MSE	0.290	0.333	0.370	0.416	0.352
TimeLLM (512, DL)	MSE	0.272	0.310	0.352	0.383	0.329

ETTm2

Models	Metric	96	192	336	720	Avg
PSformer (512, noDL)	MSE	0.167	0.219	0.269	0.347	0.251
Crossformer (96,noDL)	MSE	0.208	0.263	0.337	0.429	0.309
PDF (512, noDL)	MSE	0.162	0.224	0.277	0.354	0.254
PatchTST (512, noDL)	MSE	0.166	0.223	0.273	0.363	0.256
TimeLLM (512, DL)	MSE	0.161	0.219	0.271	0.352	0.251

Weather

Models	Metric	96	192	336	720	Avg
PSformer (512, noDL)	MSE	0.149	0.193	0.245	0.314	0.225
Crossformer (96,noDL)	MSE	0.191	0.219	0.287	0.368	0.266
PDF (512, noDL)	MSE	0.147	0.191	0.243	0.317	0.225
PatchTST (512, noDL)	MSE	0.152	0.196	0.247	0.315	0.228
TimeLLM (512, DL)	MSE	0.147	0.189	0.262	0.304	0.226

W6: There are writing errors.

Thank you for pointing out these writing errors. We will correct these errors in the revised version of the manuscript.

To be continued...

评论- Response to Reviewer JmMB (1/4)

2024-11-21

We appreciate the thoughtful and valuable comments from the reviewer. Below, we offer detailed responses to each of your concerns.

W1: The paper overall lacks innovation; Parameter Sharing is simple module parameter-sharing technique to reduce complexity; Spatial-Temporal Segment Attention simply merges the channel dimension and patch size dimension together before modeling it with a Transformer.

There seems to be a misunderstanding regarding our work. Parameter sharing in PSformer is not merely a simple module parameter-sharing technique. As shown in Figure 2 of our paper, it is utilized in three critical parts within the Encoder structure:

Within the Two-Stage SegAtt Mechanism

The PS_block module is used to construct the Q, K, and V matrices required for the attention mechanism.
These steps primarily focus on applying attention to critical parts of the time series and extracting key features for prediction.

During the Feature Fusion Phase

The PS_block integrates and mixes spatial-temporal features across different segments, enhancing the representation of global features and improving overall expressiveness.

In our work, we introduced this technique to the time-series forecasting domain and utilized it as part of PSformer, achieving competitive predictive performance.

SegAtt vs. Transformer Comparison. It appears the reviewer may be equating the PSformer Encoder with the Transformer Encoder, and SegAtt with the self-attention mechanism. However, there are fundamental differences:

Attention to Local Spatial-Temporal Segments

SegAtt is specifically designed to apply attention to local spatial-temporal segments, enhancing the model’s ability to capture localized spatial-temporal relationships.

Key Differences from Standard Self-Attention

In standard Transformer self-attention, the Q, K, and V matrices are obtained by multiplying the input x with weights from separate Linear layers.
In SegAtt, the Q, K, and V matrices are generated using a three-layer MLP structure with residual connections (PS_block).

Furthermore, the PS_block encapsulates all parameters within the Encoder. This allows PSformer to learn both local and global feature representations more effectively while reducing the total parameter count by sevenfold (including Q, K, and V generation from the two SegAtt modules and the final output fusion). To better understand these distinctions, we encourage the reviewer to compare our code with standard Transformer implementations.

A2.1: As mentioned in our response to W1, the PS Block plays a crucial role within Segment Attention. Additionally, introducing the PS Block into SegAtt enhances the capability of the attention mechanism by leveraging the non-linear transformations of the PS Block. Compared to linear mappings, this significantly improves the construction of the Q, K, and V matrices and enables the model to better capture complex relationships in the input data.

A2.2: Even if the PS Block is not used within the attention mechanism, the parameter count of the Encoder would still increase. This is because neural network is still required to independently construct the Q, K, and V matrices. Moreover, removing the PS Block from SegAtt would negate the additional benefits of parameter sharing, such as reduced model size, improved generalization and learn global representations, which are key advantages of our design.

A2.3: To help readers better understand the structure of the PSformer Encoder, we mentioned that two SegAtt mechanisms can be analogous to an FFN layer. Specifically:

Without residual connections, the two-stage Segment Attention process can be described as:

$\text{two SegAtt}(X) = \text{Attention}(\text{ReLU}(\text{Attention}(X))).$

In comparison, a standard FFN is often represented as:

$\text{FFN}(X) = \text{Linear}(\text{ReLU}(\text{Linear}(X))).$

Because in Section 3.3 of the Transformer[1] paper, they define $\text{FFN}(X) = max(0,xW_1+b_1)W_2+b_2$ . So the two-stage Segment Attention process can be seen as replacing the Linear part with Attention for better understand. However, this might lead to the misunderstanding that SegAtt is no different from the Attention mechanism in Transformer as discussed in W1. We will further clarify this point in the next version of the paper. For a more detailed explanation, please refer to Section 3.2 of our paper.

To be continued...

评论- Official Comment by Authors

2024-11-26

Thank you for your inspiring questions, which are beneficial for clarifying some unclear presentations and improving our work. Additionally, we have uploaded the rebuttal version of the paper, which integrates the content added during the rebuttal period to better showcase the paper's innovations, improving clarity, and enhancing the overall quality. We hope our response has adequately addressed your concerns. We take this as a great opportunity to improve our work and shall be grateful for any additional feedback you could give to us.

2024-11-27

Thanks to the author for their rebuttal. Although the author has addressed these questions, some issues remain unresolved. The main points are as follows:

W1: I still believe this paper's novelty is insufficient. The PS block is used to construct the Q, K, and V matrices for segmented attention and to reduce the number of model parameters. Additionally, segmented attention simply merges the channel dimension and patch size dimension to model spatial-temporal features.
W5: The author conducted experiments with PatchTST, PDF, and TimeLLM as baselines. From the results, the prediction performance of PSformer shows only a slight improvement compared to PDF and PatchTST.

2024-11-27

Thank you for your timely feedback. We have summarized the concerns you remain as follows:

Regarding Innovation. We are the first to introduce the idea of parameter sharing into the time series domain and the first to adopt segment attention to simultaneously capture local spatio-temporal information. We developed a transformer-based model, PSformer, and validated its effectiveness in time series forecasting. We believe that simple yet effective innovation is valuable and elegant as long as it can be applied in broader scenarios. For example, PatchTST was simply about introducing the patch concept to the time series domain, even though patch methods had already been popular in the CV domain for years. Nevertheless, PatchTST remains an important work in the time series domain, with over a thousand citations. Another example would be the LoRA fine-tuning in LLM, which is also a simple but impactful application of existing low-rank approximation method. Therefore, we would like highlight the significance of the proposed parameter sharing design and segment attention in time series domain, as verified by extensive experiments.

Regarding Prediction Performance. We have validated the superiority of PSformer's prediction performance by comparing it with a sufficient variety of competitive benchmark models. Based on the current empirical results, PDF’s predictive performance is arguably among the best in mainstream works. However, if you delve into its code, you’ll find that its parameter settings vary significantly for each task. Even so, across five datasets, PSformer still achieves better prediction performance than PDF on four of them.

Regarding the "slight improvement" of PSformer compared to PDF and PatchTST. We have demonstrated PSformer's superior prediction performance compared to a wide range of competitive benchmark models. However, we also acknowledge that with the rapid progress in this field, it becomes harder to obtain significantly better results without more extra data or computation. It would be useful if an example from top conference or submission in ICLR that demonstrates a more clear improvement over PSformer could be suggested for comparison.

In summary. Based on the best of our knowledge, we believe the achieved improvement is non-trivial and we tried to clarify the concerns over innovation and model predictive performance. It would be grateful if more constructive feedback could be provided to better address the concerns.

2024-12-01

Thank you for your comments. As discussed above, we have reached a consensus on most concerns. For the remaining two, W1 and W5, we provide further responses as follows.

Regarding W5. In addition to the previous explanation, we further provide pairwise comparisons of PSformer with PatchTST and PDF models, using MSE-D to quantify the reduction in loss achieved by PSformer compared to these models. Furthermore, we include comparisons on the Exchange dataset, which is frequently used in time series forecasting studies and exhibits non-stationary characteristics. The detailed comparison results are presented in the table below.

	Exchange				ETTh1				ETTh2				ETTm1				ETTm2				Weather
Horizon	96	192	336	720	96	192	336	720	96	192	336	720	96	192	336	720	96	192	336	720	96	192	336	720
PatchTST (512, noDL)	0.089	0.195	0.340	1.000	0.374	0.413	0.434	0.455	0.274	0.341	0.364	0.390	0.290	0.333	0.370	0.416	0.166	0.223	0.273	0.363	0.152	0.196	0.247	0.315
PSformer (512, noDL)	0.081	0.179	0.328	0.842	0.352	0.385	0.411	0.440	0.272	0.335	0.356	0.389	0.282	0.321	0.352	0.413	0.167	0.219	0.269	0.347	0.149	0.193	0.245	0.314
MSE-D	0.008	0.016	0.012	0.158	0.022	0.028	0.023	0.015	0.002	0.006	0.008	0.001	0.008	0.012	0.018	0.003	-0.001	0.004	0.004	0.016	0.003	0.003	0.002	0.001

	Exchange				ETTh1				ETTh2				ETTm1				ETTm2				Weather
Horizon	96	192	336	720	96	192	336	720	96	192	336	720	96	192	336	720	96	192	336	720	96	192	336	720
PDF (512, noDL)	0.094	0.193	0.340	0.892	0.361	0.391	0.415	0.468	0.272	0.334	0.357	0.397	0.284	0.327	0.351	0.409	0.162	0.224	0.277	0.354	0.147	0.191	0.243	0.317
PSformer (512, noDL)	0.081	0.179	0.328	0.842	0.352	0.385	0.411	0.440	0.272	0.335	0.356	0.389	0.282	0.321	0.352	0.413	0.167	0.219	0.269	0.347	0.149	0.193	0.245	0.314
MSE-D	0.013	0.014	0.012	0.050	0.009	0.006	0.004	0.028	0.000	-0.001	0.001	0.008	0.002	0.006	-0.001	-0.004	-0.005	0.005	0.008	0.007	-0.002	-0.002	-0.002	0.003

From the comparisons across tasks on these six datasets, we observe that the MSE-D difference is not only positive in most cases but also exceeds 0.005 in 15 tasks compared to PatchTST and in 12 tasks compared to PDF. It is important to note that our comparisons were conducted under unified noDL settings, which avoids inconsistencies in test set length caused by dropping the last batch of the test set. The DL issue can significantly reduce the MSE, especially for small-scale datasets with large batch size.

Regarding W1. we understand that different perspectives may lead to varying interpretations of innovations. While we have carefully worked to address this, we also remain open to constructive feedback for further clarification or improvement. We have strived to ensure our contributions are conveyed in a clear and concise manner for better understanding. We acknowledge that some aspects might still require further refinement. In the next version, we will incorporate your valuable suggestions to better present this work.

2024-12-02

Thanks to the author for rebuttal. Your answers address most of my concerns. Considering the paper's novelty, writing quality, and performance of the model, I raise the score to 5.

2024-12-02

Thank you for acknowledging our response. We are delighted to hear that most of your concerns have been addressed, and we appreciate your reassessment of our work. As the author for the rebuttal phase, I have learned a lot from your comments, including more about the latest works and ways to better present our experiment results. These suggestions have been instrumental in improving the overall quality of our work, and we will incorporate these enhancements into subsequent versions. Once again, thank you for your valuable feedback on our work.

审稿意见

评分: 5置信度: 32024-11-02

PSFormer suggests a new transformer-based model for multivariate time-series forecasting. This model utilizes parameter sharing(PS Block) and spatial-temporal segment attention(SegAtt) to decrease computational complexity while effectively modeling time and feature-wise dependency in data. SegAtt is designed to group patches located at the same positions across different variables, enabling efficient capture of local spatio-temporal dependencies. PSBlock allows the model to maintain linear path(residual connection) but also perform nonlinear transformation. As this block is shared, the model can reduce its parameter when it comes to the whole model architecture. With these advantages, PSFormer achieves strong parameter efficiency and enhances predictive accuracy across various benchmark datasets, showing greater scalability and forecasting performance than conventional Transformer models.

优点

PSFormer reduces the number of parameters by parameter sharing, which allows to maintain both size and representational ability of the model. This design enhances the model’s scalability and helps mitigate overfitting in data-scarce scenarios.
PSFormer demonstrates shorter running time and smaller model size, which empirically shows efficacy of parameter sharing.
The SegAtt mechanism effectively models spatio-temporal dependencies in multivariate time series by incorporating inter-variable information, boosting prediction accuracy.

缺点

The model’s performance varies with different hyperparameters, making hyperparameter tuning appear crucial.
While SegAtt improves performance, it may underperform in cases of univariate time series or where there is little dependency between variables.
The paper briefly mentions the necessity of positional encoding, but additional experiments are needed to assess its impact on the generalization of sequences that require a strong temporal order.

问题

Are there any specific settings where PSFormer performs particularly well? For example, do they work better on datasets with strong correlations between variables? When looking at the benchmark datasets, it seems that the Electricity and Exchange datasets exhibited weaker performance compared to others. I am curious if there is any reason for this.
Could the authors provide an ablation study comparing PSFormer’s performance with and without SAM, as well as with other optimization methods such as Adam or SGD? This would clarify the specific impact of SAM on performance and provide a basis for its inclusion in the PSFormer architecture.
The authors briefly explained the influence of positional encoding; however, have you conducted more specific experiments on the effects of this positional encoding on the model? I am curious about additional analysis results for various time series data. Could the authors perform experiments comparing the model’s performance with and without positional encoding on different time series types (e.g., seasonal vs. non-seasonal, stationary vs. non-stationary)? Such an analysis could clarify the benefits of positional encoding in diverse time series forecasting contexts.
In the ablation study, I observed the analysis results regarding the influence of the number of encoder layers and the number of segments. It seems that tuning these hyperparameters has a significant impact on the model’s performance. Could the authors detail their hyperparameter tuning process, or alternatively, discuss recommended tuning approaches (e.g., grid search, random search, Bayesian optimization) tailored to PSFormer?
Could the authors provide a detailed analysis of how parameter sharing in the PS Block affects the model’s temporal pattern capture? For instance, they could compare learned representations across layers with and without parameter sharing, illustrating its influence on time series modeling.

评论- Response to Reviewer MoxF (1/4)

2024-11-21

We thank the reviewer for the efforts in reviewing our paper and the thoughtful comments. Below, we offer detailed responses to each of your concerns.

W2: SegAtt may underperform in cases of univariate time series or with little dependency between variables. We tested the performance of PSformer in single-sequence forecasting. Specifically, we saved the 8 variables from the Exchange dataset into 8 separate single-sequence files, each supplemented with an additional column filled with zeros to form single sequences, along with an unrelated variable. We compared PSformer with the baseline model PatchTST[1] (which is channel-independent and performs well on the Exchange dataset).

The experimental results are shown below. As can be observed, PSformer consistently outperforms across all univariate time series, demonstrating that the PSformer architecture is not only effective in capturing cross-channel information but also performs well in univariate time series or with little dependency between variables.

During the experiments, PatchTST encountered a NaN loss on the validation set for a prediction length of 720, so the corresponding loss value was not recorded. In PSformer forecasting, we followed its current best configuration by adjusting the RevIN lookback window to 16 (For more about this setting, please refer to our response to Reviewer 2(xcmx)'s W2).

		96	192	336	720
Variate 1	PSformer	0.057	0.147	0.360	0.803
	PatchTST	0.063	0.152	0.461	-
Variate 2	PSformer	0.042	0.094	0.152	0.216
	PatchTST	0.052	0.140	0.181	-
Variate 3	PSformer	0.034	0.093	0.169	0.476
	PatchTST	0.055	0.116	0.176	-
Variate 4	PSformer	0.041	0.066	0.090	0.146
	PatchTST	0.058	0.085	0.111	-
Variate 5	PSformer	0.007	0.009	0.012	0.073
	PatchTST	0.016	0.025	0.022	-
Variate 6	PSformer	0.077	0.169	0.532	1.125
	PatchTST	0.099	0.215	0.505	-
Variate 7	PSformer	0.033	0.069	0.120	0.342
	PatchTST	0.039	0.079	0.140	-
OT	PSformer	0.047	0.101	0.190	0.510
	PatchTST	0.097	0.154	0.230	-

Q1: Are there any specific settings where PSFormer performs particularly well? Do they work better on datasets with strong correlations between variables? Why the Electricity and Exchange datasets exhibited weaker performance compared to others?

Based on our experimental experience, there is no specific configuration that guarantees particularly well model performance. Specifically, the model's performance does vary with different datasets and hyperparameter settings. however, this influence is complex, and we have not identified an optimal configuration. The settings we provide represent a relatively general and robust set of hyperparameters across different datasets. Nevertheless, it is still possible to enhance the model's performance through hyperparameter tuning according to the characteristics of datasets.

PSformer indeed enhances prediction capability by capturing cross-channel correlations, as evidenced by the temporal relationships observed in the attention maps (further discussed in the README of our anonymous repository). However, we believe it is not possible to conclude, based on the current eight datasets, whether PSformer performs better on strongly correlated variables. This would require further experiments comparing datasets with strong and weak correlations.

The weaker performance on the Exchange dataset can be attributed to its non-stationary nature and random walk characteristics, which prevent RevIN[2] from obtaining stable mean and variance statistics. These statistics are sensitive to the choice of RevIN's lookback window. Further testing revealed that the model performs best when the lookback window for calculating RevIN's statistics is very small (length 16), achieving results superior to all selected baseline models. The specific experimental data are listed in the table below, and more related discussions can be found in our response to Reviewer 2(xcmx)'s W2. For the Electricity dataset, due to time constraints, we did not conduct detailed experiments. However, we believe that for non-stationary data, RevIN's normalization should be used with caution. Adjusting the lookback window length can help identify more stable statistical means and variances, thereby facilitating model training.

To be continued...

评论- Response to Reviewer MoxF (2/4)

2024-11-21

	norm window	16	64	128	256	512
Exchange Rate	96	0.081	0.085	0.090	0.092	0.091
	192	0.179	0.187	0.189	0.191	0.197
	336	0.328	0.338	0.356	0.362	0.345
	720	0.842	0.900	0.976	1.003	1.036
	Avg	0.358	0.378	0.403	0.412	0.417

Q2:Comparing PSFormer’s performance with and without SAM, or with other optimization methods such as Adam or SGD.

Regarding SAM[3], as described in Appendix A.6 and B.1 of our paper, it modifies the typical gradient descent update to seek a flatter optimum, which helps improve generalization. Specifically, SAM achieves this by perturbing the weights in the direction of the steepest gradient ascent within a small neighborhood of the current weights, with the parameter \rho controlling the magnitude of the perturbation. Then, SAM calculates the gradient at this perturbed position and updates the weights accordingly. This process helps the model converge to solutions that are robust to small changes in the parameter space, making it less prone to overfitting.

SAM is nested on top of a base optimizer, such as Adam or SGD. In our experiments, we used Adam as the base optimizer (please check exp part source code of our anonymous repository）. We also compared different base optimizers (e.g., AdamW, SGD) and found their difference impact to be minimal. For the performance without SAM, please refer to Figure 3 in Appendix B.1 of the paper, where a \rho value of 0 indicates that no additional perturbation is applied during gradient updates. This causes SAM to degrade into the basic Adam optimizer. To minimize the potential influence of SAM, we compared PSformer with other models that also use the SAM optimizer (e.g., SAMformer[4], TSMixer[5]) and verified that PSformer achieved better performance. The specific experimental results are presented in Table 11.

Q3 and W3: Conducting more experiments on the effects of positional encoding on different time series types (e.g., seasonal vs. non-seasonal, stationary vs. non-stationary).

We added a set of comparative experiments using positional encoding to better illustrate its impact. Specifically, we tested the encoding performance under two different data transformation modes: pos emb (time series), where positional encoding is applied to the original time series before dimension transformation; and pos emb (segment), where positional encoding is applied to the transformed segments. The default case, No pos emb, refers to the absence of positional encoding.

To highlight the differences between seasonal vs. non-seasonal and stationary vs. non-stationary characteristics, we selected the ETTh1 and ETTh2 datasets (relatively seasonal and stable) as well as the Exchange dataset (relatively non-seasonal and non-stationary). The experimental results are shown below.

The degraded performance of pos emb (time series) might be due to the incompatibility of the positional encoding with dimension transformation, as the original temporal order is lost in the segment dimension, making it unsuitable for dot-product attention calculations. On the other hand, pos emb (segment) shows smaller changes compared to the No pos emb case, but the performance still deteriorates slightly. This suggests that the significance of positional encoding in the context of multivariate time series forecasting might need to be re-evaluated, as there are fundamental differences between NLP and time series data when applying attention mechanisms.

		pos emb (time series)	pos emb (segment)	No pos emb
ETTh1	96	0.378	0.353	0.352
	192	0.412	0.388	0.385
ETTh2	96	0.298	0.276	0.272
	192	0.352	0.338	0.335
exchange rate	96	0.189	0.095	0.091
	192	0.314	0.201	0.197
	336	0.525	0.37	0.345
	720	1.574	1.041	1.036

To be continued...

评论- Response to Reviewer MoxF (3/4)

2024-11-21

Q4 and W1: Could the authors detail their hyperparameter tuning process, or discuss recommended tuning approaches (e.g., grid search, random search, Bayesian optimization)?

We believe that such fine-tuning relies more on judgment and intuition. For example, when selecting the number of encoder layers, we consider factors such as the dataset size and the level of noise in the data. For smaller datasets or those with low signal-to-noise ratio conditions, it is preferable to use fewer encoder layers to avoid the risk of overfitting. For parameters like the number of segments, which are less intuitive to adjust, we initially followed the settings of PatchTST, dividing the time series into 32 segments. While we also tested other potential segmentations, we found little improvement, so we ultimately standardized the setting to 32 segments.

Additionally, We believe parameter tuning is essential in practical business scenarios. For instance, in quantitative investment, some hyperparameter configurations may significantly enhance overall model performance. Therefore, efficient hyperparameter search methods, such as Optuna[6] or Hyperopt[7], are highly recommended for use in real-world applications. However, defining a good search space is equally important, as it directly impacts the effectiveness and robustness of the parameter search process.

Q5: how parameter sharing in the PS Block affects the model’s temporal pattern capture? Compare learned representations across layers with and without parameter sharing, and illustrating its influence on time series modeling.

In PSformer, the PS Block is the essential component of the two-stage SegAtt mechanism. It generates Q, K, V matrices to assign higher attention weights to critical temporal features. Within the single-channel temporal attention submaps under parameter sharing, we observe the model's ability to effectively capture temporal patterns. Besides, in the fusion stage, the PS Block integrates the features extracted in the first two stages of SegAtt.

To better illustrate this phenomenon of parameter sharing and its impact on spatio-temporal pattern capturing, we have provided additional visualizations and discussions in the README of our anonymous repository. Specifically, we compared the attention maps with and without parameter sharing, decomposing them into single-channel attention and inter-channel attention submaps. Furthermore, we compared the two-stage attention matrices under the non-parameter-sharing setup to analyze cross-layer variations.

Difference between with Parameter Sharing and without Parameter Sharing:

Numerical Range on Attention map: From the attention maps, we observe that attention matrices with parameter sharing generally have values within [-3, 3], whereas without parameter sharing, the range broadens significantly to [-30, 40]. This larger variation in attention weights without parameter sharing might accelerate convergence during training. On the other hand, parameter sharing tends to stabilize the learning process by limiting extreme variations, which can be beneficial for training stability, particularly on smaller datasets.
Cross-Channel Relationships: With parameter sharing, the relationships between channels are simpler and more interpretable. Clear, gradual transitions in attention weights across grid cells can be observed, indicating a more structured representation of spatio-temporal dependencies. In contrast, without parameter sharing, while progressive changes are still evident, the temporal relationships become more complex and harder to align with specific temporal positions. This highlights the role of parameter sharing in creating more structured attention patterns.
Layer-wise Feature Extraction: Comparing the two layers of attention maps without parameter sharing, the first layer appears to focus on capturing basic temporal patterns, while the second layer refines and combines these patterns into more complex features.

[1] Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In The Eleventh International Conference on Learning Representations.

[2] Kim, T., Kim, J., Tae, Y., Park, C., Choi, J. H., & Choo, J. (2021). Reversible instance normalization for accurate time-series forecasting against distribution shift. In International Conference on Learning Representations.

[3] Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. Sharpness-aware Minimization for Efficiently Improving Generalization. In International Conference on Learning Representations.

[4] Ilbert, R., Odonnat, A., Feofanov, V., Virmaux, A., Paolo, G., Palpanas, T., & Redko, I. SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention. In Forty-first International Conference on Machine Learning.

To be continued...

评论- Response to Reviewer MoxF (4/4)

2024-11-21

[5] Chen, S. A., Li, C. L., Arik, S. O., Yoder, N. C., & Pfister, T. TSMixer: An All-MLP Architecture for Time Series Forecast-ing. Transactions on Machine Learning Research.

[6] Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International conference on knowledge discovery & data mining.

[7] Bergstra, J., Yamins, D., & Cox, D. (2013). Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning.

评论- Response to Rebuttal

2024-11-25

Thank you for the detailed responses and clarifications. Below is a concise summary of what I understood from your answers:

SegAtt in Univariate Time Series: Experiments confirmed PSFormer’s robust performance in univariate time series, outperforming PatchTST on the Exchange dataset.
Positional Encoding: Positional encoding showed limited effectiveness in multivariate time series, especially when applied before segment transformations. The findings suggest its role requires re-evaluation.
SAM Optimizer Impact: SAM improves generalization by finding flatter optima, with minimal differences between other optimizers. PSFormer outperformed SAM-equipped baselines like SAMformer and TSMixer.
Hyperparameter Tuning: Intuition is needed. ( Adjustments such as fewer encoder layers help avoid overfitting on smaller datasets.) Effective tuning methods like Optuna or Hyperopt can be used.
Parameter Sharing in PS Block: Parameter sharing stabilizes training and creates structured, interpretable attention patterns, enhancing spatio-temporal modeling, shown with more qualitative visualizations
Weaker Performance on Specific Datasets: For Exchange, non-stationarity required careful RevIN tuning, improving results significantly. Electricity data demands further testing to refine normalization techniques.

Although your responses addressed the concerns, supporting PSFormer’s robustness and adaptability across diverse scenarios, I will maintain my score due to the paper’s novelty, presentation, and quality. Thank you for your effort.

2024-11-26

Thank you for your valuable comments, which have greatly contributed to improving our work. We sincerely learned a lot from your feedback. Additionally, we have uploaded the rebuttal version of the paper, which integrates the content added during the rebuttal period to better showcase the paper's innovations, improving clarity, and enhancing the overall quality. We would be delighted to hear your further feedback.

审稿意见

评分: 5置信度: 32024-11-02

This paper introduces PSformer, a novel transformer-based architecture for time series forecasting that incorporates parameter sharing techniques and a Spatial-Temporal Segment Attention mechanism to capture local spatio-temporal dependencies.

优点

(1) The introduction of parameter sharing techniques in transformer-based models demonstrates its effectiveness and potential in the field as validated by experimental results.

(2) Experimental results demonstrate that PSformer achieves state-of-the-art performance in the most of the datasets.

缺点

(1) There are several writing errors present in the article.

(2) Although the parameter sharing techniques demonstrated effectiveness according to the ablation study, the authors did not provide detailed analysis or empirical studies to further elucidate this technique, such as comparisons of convergence rates with and without parameter sharing or analysis of how parameter sharing affects model capacity.

问题

(1) Referring to Weaknesses (2), could you provide a more in-depth analysis or empirical studies to illustrate the effectiveness of parameter sharing in time series forecasting?

(2) According to the authors, attention is applied within each segment to enhance the extraction of local spatio-temporal relationships. However, in PSformer, a token represents a down-sampled sequence along a channel. This might be confusing because it suggests that the attention is applied to capture the global dependencies across channels. Could you provide a more detailed explanation of how the segmentation process preserves local temporal information while allowing for cross-channel interactions.

伦理问题详情

No ethics concerns.

评论- Response to Reviewer Pbae (1/2)

2024-11-21

We appreciate the reviewer's efforts and comments on our paper. Below, we offer detailed responses to each of your concerns.

W1: Writing errors present in the article.

Thank you for pointing out the writing errors. We will address them in the revised version of the manuscript.

Q1 and W2: Comparisons of convergence rates with and without parameter sharing; Illustrate the effectiveness of parameter sharing in time series forecasting.

Comparisons of convergence rates with and without parameter sharing. We compare the impact of parameter sharing on the convergence rate using the ETTh1 and Exchange datasets, recording the total number of epochs and the MSE loss under the same settings. (For the Exchange dataset, we use the current best-performing setup, specifically with the lookback window length in RevIN set to 16. For more details, please refer to our response to Reviewer 2(xcmx)'s W2.)

The experimental results are shown in the table below, where the values represent epochs/MSE loss. We observe that the number of epochs required with or without parameter sharing on the ETTh1 dataset varies depending on the prediction length. However, for the Exchange dataset, the convergence rate is faster without parameter sharing.

Additionally, in terms of MSE loss, using parameter sharing leads to greater reductions in loss and also results in fewer parameters (as discussed in our response to Reviewer 3(xM2V)'s W2). Therefore, there exists a trade-off between convergence rates, loss reduction, and parameter efficiency.

ETTh1

	96	192	336	720
w parameter sharing	83/0.352	70/0.385	53/0.411	35/0.440
w/o parameter sharing	46/0.359	108/0.392	39/0.423	36/0.441

Exchange rate

	96	192	336	720
w parameter sharing	71/0.081	51/0.179	82/0.328	32/0.842
w/o parameter sharing	47/0.084	32/0.183	43/0.333	31/0.855

Analysis of how parameter sharing affects model capacity. In the current encoder with parameter sharing, the PS_block network performs parameter sharing in seven places, including six networks for constructing the Q, K, and V matrices in the two-stage SegAtt, as well as the network for merging the output feature information of the two-stage SegAtt. This results in the encoder's parameter count under the parameter-sharing mode being one-seventh of that in the non-parameter-sharing mode. The total number of parameters in PSformer is the sum of the encoder's parameters and the final linear mapping. This means that as the number of encoders increases and the number of hidden layer nodes in the PS_block increases, the savings in total parameter count from parameter sharing become more significant. We have provided the total model parameter changes for different numbers of encoders; please refer to the response to Reviewer 3 (xM2V) regarding W2.

Q2: Providing detailed explanation of how the segmentation process preserves local temporal information while allowing for cross-channel interactions.

Maybe there was a misunderstanding as said "a token represents a down-sampled sequence along a channel", so we want to clarify as follows: The only instance we use "token" to describe sequence is in Appendix A.5, where we explain the reason for eliminating positional encoding. In this context, we discuss why positional encoding is not applied to segments, and we use a token to represent a temporal patch of a specific channel within a segment. It is not a down-sampled sequence but rather a local, continuous temporal sequence.

In the "The PSformer Framework" section of the paper, we provide a detailed explanation of the segment definition and the dimensional processing of SegAtt. We believe your confusion might mainly lie in how the PSformer Encoder handles local temporal dimension information and cross-channel information. To address this, we can perhaps clarify your concerns from both mathematical and visualization perspectives.

To be continued...

评论- Response to Reviewer Pbae (2/2)

2024-11-21

From the mathematical perspective: First, for each time series of length $L$ , we divide it evenly into $N$ patches of length $P$ , since $L = P \times N$ . For $M$ time series, the $i$ -th patch from each of the $M$ series is combined into the $i$ -th segment, where $i \in \{1, 2, \dots, N \}$ . Thus, dot-product attention is applied within each segment, with the attention matrix $QK^T \in \mathbb{R}^{C \times C}$ , where $C = M \times P$ . This indicates that attention is applied to local spatial-temporal segments.

Additionally, the process of capturing global temporal information across all $N$ segments is further refined. Specifically, this is achieved through the weights of PS Block, $W^S \in \mathbb{R}^{N \times N}$ , which operate on all $N$ segments. More precisely, $W^S$ constructs $Q$ , $K$ , and $V$ matrices via matrix multiplication with the input data $X$ . This allows the model to assign weights to different segments, enabling the selection of important segments from the global temporal space and extracting their local spatial-temporal information, which is then fused into global information during the final fusion stage of the PS Block. In summary, the overall process encompasses both global (cross-segment PS Block in SegAtt and the fusion stage PS Block) and local (segment-dimension attention) mechanisms.

From the visualization perspective. In the anonymous code repository's README, we further visualize and discuss the attention matrix of the first SegAtt, which may help in understanding the capture of local temporal and spatial information. In Figures 1.1 and Figure 1.2 of the README, you can observe the changes in attention weights within SegAtt across both channel and temporal dimensions.

Additionally, we further decompose the local spatio-temporal attention matrix into single-channel attention maps (e.g. Figure 2.1) and inter-channel pairwise attention maps (e.g. Figure 4.1). By comparing these maps with the corresponding temporal segments, we identify the temporal positions associated with high attention weights. This discussion may provide a more intuitive understanding of the mechanism behind the capture of local spatio-temporal information.

评论- Official Comment by Authors

2024-11-26

Thank you for your valuable comments, from which we benefit a lot for improving the work. Additionally, we have uploaded a new version of the paper, which integrates the content added during the rebuttal period to better showcase the paper's innovations, improving clarity, and enhancing the overall quality. We would be delighted to hear your further feedback.

2024-11-27

Thank you for your detailed and thoughtful rebuttal. I appreciate the effort you have put into addressing my concerns and clarifying various aspects of your work. However, despite these clarifications, some of my original concerns remain unresolved. While your rebuttal has improved the clarity and depth of certain aspects, it does not fully address the key issues I raised. Therefore, I will maintain my original score.

2024-12-01

Thank you for your feedback. In our previous responses, we primarily addressed two key concerns.

The effectiveness of the parameter sharing technique, where we followed your suggestions to provide experiments and analyses from the perspectives of convergence rate comparisons and the impact on model capacity.
Explanation about how segment attention captures local spatio-temporal information. Since you mentioned that some of your original concerns remain unresolved, we have attempted to provide further discussions on these two aspects.

For the experiments on the convergence rate of parameter sharing in Q1, the experimental results indicate that under the same settings, models without parameter sharing generally converge faster than those with parameter sharing, as reflected in the number of epochs required. Additionally, we compared the attention maps of models with and without parameter sharing in Sections B.5.1 and B.5.4 of the updated paper. It is worth noting that the range of attention weights is larger in the non-parameter-sharing setting, ranging from [-20, 40], compared to approximately [-3, 3] in the parameter-sharing setting. Larger attention weights correspond to faster gradient updates and quicker convergence. However, the convergence speed under parameter sharing can still be accelerated by increasing the learning rate. Furthermore, the convergence rate is also related to the extent of loss reduction, as achieving lower losses typically requires more epochs. This is also demonstrated by the convergence experiments, where models without parameter sharing require fewer epochs but exhibit higher MSE loss.

For local spatio-temporal relationships. In Figure 7 of the paper, we provide attention maps in time series forecasting. Due to parameter sharing, these maps become more structured, enabling us to more easily correlate multivariate time series with their corresponding attention maps. This allows for intuitive analyses of attention in single-channel (Section B.5.2) and cross-channel (Section B.5.3) settings, demonstrating that cross-spatio-temporal attention has been captured. Moreover, parameter sharing enhances the interpretability of spatio-temporal relationships in time series forecasting.

We have also supplemented the paper with additional experiments and analyses, which may help address your concerns. We would be delighted to hear your further feedback.

审稿意见

评分: 6置信度: 32024-11-03

This article presents a framework for multivariate time series forecasting, integrating parameter sharing and a Spatial-Temporal Segment Attention mechanism. Extensive experiments demonstrate that this approach consistently achieves higher accuracy and efficiency compared to other state-of-the-art models

优点

a) This paper presents an innovative SegAtt mechanism and a parameter-sharing approach, effectively advancing methods in time series forecasting and aligning well with the field’s objectives.

b) The experimental evaluation is notably thorough, offering readers a well-rounded perspective on the framework’s performance and the contributions of its various components.

c) The writing is clear and of high quality, making the paper accessible and easy to understand.

缺点

a) While SegAtt demonstrates notable strengths, it would be advantageous to evaluate the selection of segmentation numbers across a more diverse range of datasets beyond ETTh1 and ETTm1. Providing further guidance on practical approaches for selecting segmentation numbers would also be valuable.

b) The experimental section could be enhanced by further validating the framework’s parameter-saving capacity and examining its implications for pre-trained models.

c) To strengthen the robustness of the findings, incorporating results with multiple random seeds would provide additional confirmation of the framework’s superior performance.

d) A deeper exploration of channel-mixing techniques could enrich the analysis. A comparative discussion, similar to PatchTST A.7, on the benefits of channel independence versus channel-mixing strategies would be particularly insightful.

e) Regarding computational efficiency, could you clarify which specific components within the framework contribute to its computational advantages over other state-of-the- art models?

问题

In the weakness part.

评论- Response to Reviewer xM2V (1/2)

2024-11-21

We appreciate the thoughtful and helpful comments from the reviewer. Below, we offer detailed responses to each of your concerns.

W1: Evaluate segmentation numbers across more datasets; Providing further guidance on practical approaches for selecting segmentation numbers.

We further conducted tests on the selection of segmentation numbers using the ETTh2 and Exchange Rate datasets. The results are shown in the table below. For prediction lengths of 192 and 336, a segment number of 8 further improves the forecasting performance (compared with default segment number of 32). This suggests that there can not be a perfect fixed value for the choice of the number of segments. The choice of the number of segments may be influenced by the characteristics of different datasets, prediction lengths, as well as the overall increase in the scale of model parameters.

Therefore, for datasets without clear seasonality and with non-stationary characteristics (e.g., financial asset time series, including exchange rates), it is preferable to choose a larger segmentation number. This allows each segment to capture smaller local spatio-temporal patterns, better addressing unstable variation modes. On the other hand, for datasets with significant seasonality or relatively stationary characteristics (e.g., electricity and traffic), a relatively smaller segmentation number often facilitates model training.

ETTh2

	96	192	336	720
8	0.273	0.333	0.353	0.387
16	0.273	0.333	0.357	0.387
32	0.272	0.335	0.356	0.389
64	0.275	0.337	0.357	0.394

Exchange Rate

	96	192	336	720
8	0.091	0.183	0.340	1.071
16	0.092	0.200	0.345	1.071
32	0.091	0.197	0.345	1.036
64	0.098	0.235	0.417	1.037

W2: Validating the parameter-saving capacity and examining its implications for pre-trained models.

For further validating the framework’s parameter-saving capacity, we compared the parameter count of the PSformer under Parameter Sharing and No Parameter Sharing scenarios, including the comparison for 1-layer and 3-layer Encoders, as well as for different layers same as GPT2 models (GPT2-small, GPT2-medium, GPT2-large, and GPT2-xl) at 12-layer, 24-layer, 36-layer, and 48-layer configurations. The results are reported in the table below. In addition, if the hidden layer dimension is expanded from 32 to 1024 (as in GPT2), or if multi-head attention is adopted, the total number of parameters will also increase significantly.

	1	3	12	24	36	48
Parameter Sharing	52,416	58,752	87,264	125,280	163,296	201,312
No Parameter Sharing	71,424	115,776	315,360	581,472	847,584	1,113,696

W3: Provide multiple random seeds results to strengthen the robustness.

Below are the variances in MSE across 5 datasets. We repeated the experiments with 5 different random seeds. The complete metrics variations will be updated in the revised version of the paper.

	ETTh1
	96	192	336	720
mse	0.352±0.002	0.385±0.002	0.411±0.001	0.440±0.003
mae	0.385±0.001	0.406±0.001	0.424±0.001	0.456±0.002

	ETTh2
	96	192	336	720
mse	0.272±0.002	0.335±0.002	0.356±0.001	0.389±0.001
mae	0.337±0.001	0.379±0.002	0.411±0.001	0.431±0.001

	ETTm1
	96	192	336	720
mse	0.282±0.004	0.321±0.003	0.352±0.002	0.413±0.002
mae	0.336±0.003	0.360±0.003	0.380±0.001	0.412±0.001

	ETTm2
	96	192	336	720
mse	0.167±0.003	0.219±0.001	0.269±0.001	0.347±0.002
mae	0.258±0.002	0.292±0.001	0.325±0.001	0.376±0.001

To be continued...

评论- Response to Reviewer xM2V (2/2)

2024-11-21

	weather
	96	192	336	720
mse	0.149±0.001	0.193±0.001	0.245±0.002	0.314±0.002
mae	0.200±0.001	0.243±0.001	0.282±0.001	0.332±0.001

W4: Channel Independence versus Channel-Mixing strategies.

In the discussion of PatchTST A.7, they suggest that predicting unrelated time series relies on different attention patterns, while similar time series produce similar attention maps. However, this does not fully explain why cross-channel temporal information cannot be used to improve forecasting performance. In fact, in many traditional multivariate time series models, it is common for multiple independent variables to determine the temporal patterns of the dependent variable. Additionally, they suggest that in the channel-independent mode, each time series generates its own attention maps, whereas in the channel-mixed mode, all time series share the same attention patterns.

Regarding these issues, in the ReadMe of our anonymous code repository, we analyze attention maps in Appendix B.4 of the paper in terms of local temporal and spatial attention. We find that in our channel-mixed mode, the initial attention maps can be decomposed into components that capture single-variable time series information, similar to the channel-independent mode, and additional components that capture cross-channel spatio-temporal patterns.

For example, (Figure 1.1 in ReadMe.md), we observe distinct high-attention regions (e.g., the top-left corner) and high-attention areas between different time steps and variables (e.g., the non-diagonal symmetric grid section in the top-left corner). From this, we can see that the coordinate positions within the spatial attention submatrix (inside the grids) exhibit the same or gradually changing attention weights across different time steps (between the grids). This suggests that the model may capture temporal consistency between variables, local stationary features, and long-term dependencies across time steps. We further separate channel-mixed attention maps into attention maps for individual channels and cross-channel attention maps. More related visual analysis and discussion can be found in the anonymous repository.

Besides, We think the challenges associated with multi-channel techniques, such as the risk of overfitting during training and the increased computational cost of attention, should not hinder their use. Instead, these issues should be addressed through other methods, such as parameter sharing and sparse attention, and so on, to mitigate these concerns.

W5: Which components contribute to its computational advantages over other state-of-the- art models?

Regarding computational efficiency, there are several components in the framework that contribute to its advantages over other state-of-the-art models:

Parameter Sharing Techniques: The use of parameter sharing reduces the overall number of parameters that need to be optimized, leading to improved computational efficiency without compromising performance.
Segment Attention Mechanism: By segmenting the input sequence and applying attention at the segment level, the input dimension for the ps_block is reduced to the number of segments, significantly decreasing the parameter count and computational overhead.
Robustness Without Dropout or Positional Encoding: The inherent robustness of the model eliminates the need for operations like dropout and positional encoding, which are commonly used in other models. This further reduces computational costs while maintaining performance.

评论- Response to Rebuttal

2024-11-23

I have reviewed the authors' rebuttal, and while they have addressed most of my concerns, I will maintain my current score due to the paper's quality, novelty, and presentation.

2024-11-25

Thank you for your valuable comments and acknowledgement of our efforts. We learned a lot from your feedback and we have also uploaded a new version of the paper, which integrates the content added during the rebuttal period to better showcase the paper's innovations, improving clarity, and enhancing the overall quality. We would be delighted to hear your further feedback.

审稿意见

评分: 6置信度: 42024-11-04

This paper presents a Transformer architecture for time series forecasting that highlights parameter sharing and cross-channel patching. Specifically, it applies attention across both channels and patches for spatio-temporal information fusion, and aligns the parameters of linear projections within an encoder block. Experiments show that these two designs contributes to the overall performance improvement on popular benchmarks.

优点

The method proposed in the paper has clear motivation and solid intuition.
The idea is presented with clarity and easy to follow.
Sufficient analysis is done to highlight the efficacy of the proposed updates on existing Transformer architectures.

缺点

The accuracy improvement is relatively marginal, especially in ablation studies, and variances in metrics should be reported to support the significancy of the contribution from the proposed ideas.
An analysis about why PSformer does not work well on exchange would be helpful.

问题

See weakness

评论- Response to Reviewer xcmx (1/2)

2024-11-21

We thank the reviewer for the valuable comments on our paper. Below, we offer detailed responses to each of your concerns.

W1: The accuracy improvement is marginal, especially in ablation studies, and variances in metrics should be reported.
For the performance improvement of the model, it is important to highlight that many of the baselines we selected are recent and highly competitive models, making accuracy improvements both challenging and meaningful. Regarding the relatively marginal improvements observed in the ablation studies, This is because we conducted experiments in four aspects within the ablation study:

The first two aspects focus on analyzing the impact of the number of segments and encoders in PSformer. From Table 3 and Table 4, the observed changes in these areas are indeed minor (with an average MSE variation of 0.002 on the ETTh1 and ETTm1 datasets), demonstrating that PSformer is robust to these two critical hyperparameters.
The latter two aspects involve ablation studies on the model's key innovations: parameter sharing and segment attention. From Table 5 and Table 6, it can be observed that the variances in metrics are highly significant. Specifically, parameter sharing achieved an average MSE reduction of 0.016 on ETTm1, while SegAtt achieved an average MSE reduction of 0.017 on ETTh1. These results demonstrate that the two core contributions of PSformer significantly improve its performance, making it more competitive compared to a wide range of baseline models.

Below are the variances in MSE across 5 datasets. We repeated the experiments with 5 different random seeds. The complete metrics variations will be updated in the revised version of the paper.

	ETTh1
	96	192	336	720
mse	0.352±0.002	0.385±0.002	0.411±0.001	0.440±0.003
mae	0.385±0.001	0.406±0.001	0.424±0.001	0.456±0.002

	ETTh2
	96	192	336	720
mse	0.272±0.002	0.335±0.002	0.356±0.001	0.389±0.001
mae	0.337±0.001	0.379±0.002	0.411±0.001	0.431±0.001

	ETTm1
	96	192	336	720
mse	0.282±0.004	0.321±0.003	0.352±0.002	0.413±0.002
mae	0.336±0.003	0.360±0.003	0.380±0.001	0.412±0.001

	ETTm2
	96	192	336	720
mse	0.167±0.003	0.219±0.001	0.269±0.001	0.347±0.002
mae	0.258±0.002	0.292±0.001	0.325±0.001	0.376±0.001

	weather
	96	192	336	720
mse	0.149±0.001	0.193±0.001	0.245±0.002	0.314±0.002
mae	0.200±0.001	0.243±0.001	0.282±0.001	0.332±0.001

W2: Why PSformer not work well on exchange?

We conducted a relevant analysis based on the work of DLinear[1] and SAN[2] and believe that the reason for not work well on the exchange rate dataset is related to the characteristics of the exchange rate data and the application of RevIN[3]. Financial time-series data, including exchange rates, typically exhibit non-stationarity and random walk models. Therefore, when using RevIN to estimate relatively stable mean and variance statistics based on past windows, it becomes difficult, as the mean and variance values calculated by RevIN are highly sensitive to the length of the statistical look-back window. This sensitivity is a key reason for PSformer currently does not work well on exchange rate dataset.

To be continued...

评论- Response to Reviewer xcmx (2/2)

2024-11-21

To validate this hypothesis, we kept the overall experimental setup for the exchange rate data unchanged (i.e., with an input length of 512) while varying the look-back window length used by RevIN to calculate the mean and variance. The specific window lengths tested were [16, 64, 128, 256, 512], and we compared the final prediction MSE loss, as shown in the table below. From the results, we observed that by simply reducing the look-back window for RevIN's statistical calculations, the prediction loss improved significantly. When the window length was set to 16, the average loss for the exchange rate was 0.3575, which is a 16.5% reduction compared to the current loss. This interesting phenomenon validates our idea that selecting an appropriate look-back window for RevIN based on the characteristics of different datasets helps identify more stable statistical means and variances, thereby facilitating model training. A deeper analysis of this phenomenon is worth further investigation in future research.

	norm window	16	64	128	256	512
Exchange Rate	96	0.081	0.085	0.090	0.092	0.091
	192	0.179	0.187	0.189	0.191	0.197
	336	0.328	0.338	0.356	0.362	0.345
	720	0.842	0.900	0.976	1.003	1.036
	Avg	0.358	0.378	0.403	0.412	0.417

[1] Zeng, A., Chen, M., Zhang, L., & Xu, Q. (2023). Are transformers effective for time series forecasting?. In Proceedings of the AAAI conference on artificial intelligence.

[2] Liu, Z., Cheng, M., Li, Z., Huang, Z., Liu, Q., Xie, Y., & Chen, E. (2024). Adaptive normalization for non-stationary time series forecasting: A temporal slice perspective. Advances in Neural Information Processing Systems.

[3] Kim, T., Kim, J., Tae, Y., Park, C., Choi, J. H., & Choo, J. (2021). Reversible instance normalization for accurate time-series forecasting against distribution shift. In International Conference on Learning Representations.

评论- Thank you for the detailed rebuttal.

2024-11-25

I highly appreciate the author's feedback on my comments. For concern 1, the provided variance is very helpful and the additional introduction on purpose of ablation is necessary. For concern 2, the impact of norm window size for RevIn is an interesting point of view that inspires me a lot. Overall I feel my concerns are well addressed by the rebuttal, and would like to champion this work for acceptance. But I will keep the current rating since the novelty of the ideas is not sufficiently ground-breaking in this domain.

2024-11-25

Thank you for the support of our work. You comments have greatly helped to improve the quality of our work. By the way, we have uploaded the new version of manuscript, which integrates the content added during the rebuttal period to better showcase the paper's innovations, improving clarity, and enhancing the overall quality. We would be delighted to hear your further feedback.

审稿意见

评分: 5置信度: 52024-11-04

The article introduces a novel Transformer architecture, PSformer, for time series forecasting, incorporating parameter sharing (PS) and Spatial-Temporal Segment Attention (SegAtt). The authors conducted experiments on various benchmark datasets, comparing PSformer with baseline methods.

优点

The design ideas and motivations behind the model are very interesting.
The way the article is written is very good, making it very easy to follow.

缺点

The article has some minor errors, for example, I couldn't seem to find Table 27.
The article lacks research on important references, such as the author's focus on iTransformer at ICLR 2024, but does not consider contemporaneous models like TimeMixer[1], and FITS[2]. Furthermore, this still lacks a comparison with GNN-based methods like CrossGNN[3] and FourierGNN[4] from NeurIPS 2023 and other methods such as MICN[5] at ICLR 2023.

[1] Wang, S., Wu, H., Shi, X., Hu, T., Luo, H., Ma, L., ... & ZHOU, J. (2024). TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting. In The Twelfth International Conference on Learning Representations.

[2] Xu, Z., Zeng, A., & Xu, Q. (2024). FITS: Modeling Time Series with $10 k$ Parameters. In The Twelfth International Conference on Learning Representations.

[3] Huang, Q., Shen, L., Zhang, R., Ding, S., Wang, B., Zhou, Z., & Wang, Y. CrossGNN: Confronting Noisy Multivariate Time Series Via Cross Interaction Refinement. In Thirty-seventh Conference on Neural Information Processing Systems.

[4] Yi, K., Zhang, Q., Fan, W., He, H., Hu, L., Wang, P., ... & Niu, Z. FourierGNN: Rethinking Multivariate Time Series Forecasting from a Pure Graph Perspective. In Thirty-seventh Conference on Neural Information Processing Systems.

[5] Wang, H., Peng, J., Huang, F., Wang, J., Chen, J., & Xiao, Y. (2023). Micn: Multi-scale local and global context modeling for long-term series forecasting. In The Eleventh International Conference on Learning Representations.

The experimental comparisons are insufficient. The methods mentioned in W2 and TimesNet[6] were also not compared by the authors, therefore, it cannot be concluded that PSformer achieves SOTA performance.

[6] Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., & Long, M. (2023). TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. In The Eleventh International Conference on Learning Representations.

The author seems to have not provided details about the experimental platform. Additionally, it appears that the author did not specify the parameter search space for the compared methods. Has the author completed the work of determining the best model parameters on the validation set? If this work has not been done, it would not be fair to compare the performance of all models, as the specific choices of the experimental platform and model parameters can have a significant impact on the experimental conclusions. It is hoped that the author can clarify this point.

问题

See the Weaknesses.

评论- Response to Reviewer GW6d (1/4)

2024-11-21

We appreciate the reviewer's thoughtful comments on our paper. Below, we offer detailed responses to each of your concerns.

W1: The article has some minor errors, and couldn't find Table 27.

Thank you for pointing out this issue. This is actually a file compilation problem—Table 27 refers to Table 11 for the full long-term forecasting results. We apologize for the misunderstanding caused by this and will fix this issue in the next manuscript version.

W2: Lack research on important references, such as TimeMixer, FITS, CrossGNN, FourierGNN and MICN.

Thank you for your advice. In recent years, there have been many notable and excellent works in the time-series forecasting field, such as TimeMixer[1], FITS[2], and GNN-based methods like CrossGNN[3] and FourierGNN[4] and MICN[5]. When selecting baseline models for our work, we considered three main factors:

First, since our work draws inspiration from SAMformer[6] by applying SAM[7] optimization techniques to time-series training, we wanted to include SAMformer, which also uses SAM optimization, alongside TSMixer[8] as part of our baseline. This allows us to more clearly demonstrate the performance of our work under the same optimization method.
At the same time, we wanted to include as many recent works from various time-series models as possible to present a comprehensive evaluation, including iTransformer[9], SAMformer, MOMENT[10], GPT4TS[11], and ModernTCN[12].
Additionally, some important works should not be overlooked, such as PatchTST[13], FEDformer[14], and Autoformer[15]. RLinear[16] is also crucial as it demonstrates the forecasting performance of pure linear mappings.

Since papers in the time-series field typically use around 10 models as baselines, we selected the current baseline models to allow for a more comprehensive comparison of our work. However, this does not imply that other models are of lesser reference value. In fact, they are equally important and help readers gain a more complete understanding of the latest developments in the time-series forecasting field. Therefore, we will improve the inclusion of more important models and works in the revised version of our paper.

W3: The PSformer should compare with models mentioned in W2 and with TimesNet.

We have gathered and organized these excellent recent works for a more comprehensive comparison in the table below:

ETTh1

Models	Metric	96	192	336	720	Avg
PSformer (F, noDL)	MSE	0.352	0.385	0.411	0.440	0.397
TimeMixer (F, noDL)	MSE	0.375	0.429	0.484	0.498	0.447
CrossGNN (F, noDL)	MSE	0.382	0.427	0.465	0.472	0.437
MICN (F, DL)	MSE	\	\	\	\	\
TimesNet (F, DL)	MSE	0.384	0.436	0.491	0.521	0.458
FITS (M, noDL)	MSE	0.372	0.404	0.427	0.424	0.407

ETTh2

Models	Metric	96	192	336	720	Avg
PSformer (F, noDL)	MSE	0.272	0.335	0.356	0.389	0.338
TimeMixer (F, noDL)	MSE	0.289	0.372	0.386	0.412	0.365
CrossGNN (F, noDL)	MSE	0.309	0.390	0.426	0.445	0.393
MICN (F, DL)	MSE	\	\	\	\	\
TimesNet (F, DL)	MSE	0.340	0.402	0.452	0.462	0.414
FITS (M, noDL)	MSE	0.271	0.331	0.354	0.377	0.333

ETTm1

Models	Metric	96	192	336	720	Avg
PSformer (F, noDL)	MSE	0.282	0.321	0.352	0.413	0.342
TimeMixer (F, noDL)	MSE	0.320	0.361	0.390	0.454	0.381
CrossGNN (F, noDL)	MSE	0.335	0.372	0.403	0.461	0.393
MICN (F, DL)	MSE	\	\	\	\	\
TimesNet (F, DL)	MSE	0.338	0.372	0.410	0.478	0.400
FITS (M, noDL)	MSE	0.303	0.337	0.366	0.415	0.355

To be continued...

评论- Response to Reviewer GW6d (2/4)

2024-11-21

ETTm2

Models	Metric	96	192	336	720	Avg
PSformer (F, noDL)	MSE	0.167	0.219	0.269	0.347	0.251
TimeMixer (F, noDL)	MSE	0.175	0.237	0.298	0.391	0.275
CrossGNN (F, noDL)	MSE	0.176	0.240	0.304	0.406	0.282
MICN (F, DL)	MSE	0.179	0.307	0.325	0.502	0.328
TimesNet (F, DL)	MSE	0.187	0.249	0.321	0.408	0.291
FITS (M, noDL)	MSE	0.162	0.216	0.268	0.348	0.249

Weather

Models	Metric	96	192	336	720	Avg
PSformer (F, noDL)	MSE	0.149	0.193	0.245	0.314	0.225
TimeMixer (F, noDL)	MSE	0.163	0.208	0.251	0.339	0.240
CrossGNN (F, noDL)	MSE	0.159	0.211	0.267	0.352	0.247
MICN (F, DL)	MSE	\	\	\	\	\
TimesNet (F, DL)	MSE	0.172	0.219	0.280	0.365	0.259
FITS (M, noDL)	MSE	0.143	0.186	0.236	0.307	0.218

Electricity

Models	Metric	96	192	336	720	Avg
PSformer (F, noDL)	MSE	0.133	0.149	0.164	0.203	0.162
TimeMixer (F, noDL)	MSE	0.153	0.166	0.185	0.225	0.182
CrossGNN (F, noDL)	MSE	0.173	0.195	0.206	0.231	0.201
MICN (F, DL)	MSE	0.164	0.177	0.193	0.212	0.187
TimesNet (F, DL)	MSE	0.168	0.184	0.198	0.220	0.192
FITS (M, noDL)	MSE	0.134	0.149	0.165	0.203	0.163

Exchange Rate

Models	Metric	96	192	336	720	Avg
PSformer (F, noDL)	MSE	0.091	0.197	0.345	1.036	0.417
TimeMixer (F, noDL)	MSE	0.090	0.187	0.353	0.934	0.391
CrossGNN (F, noDL)	MSE	0.084	0.171	0.319	0.805	0.345
MICN (F, DL)	MSE	0.102	0.172	0.272	0.714	0.315
TimesNet (F, DL)	MSE	0.107	0.226	0.367	0.964	0.416
FITS (M, noDL)	MSE	\	\	\	\	\

Traffic

Models	Metric	96	192	336	720	Avg
PSformer (F, noDL)	MSE	0.367	0.390	0.404	0.439	0.400
TimeMixer (F, noDL)	MSE	0.462	0.473	0.498	0.506	0.485
CrossGNN (F, noDL)	MSE	0.570	0.577	0.588	0.597	0.583
MICN (F, DL)	MSE	0.519	0.537	0.534	0.577	0.542
TimesNet (F, DL)	MSE	0.593	0.617	0.629	0.640	0.620
FITS (M, noDL)	MSE	0.385	0.397	0.410	0.448	0.410

To be continued...

评论- Response to Reviewer GW6d (3/4)

2024-11-21

We have collected the main experimental results of the relevant models. We did not include FourierGNN, as it did not report performance on most of the benchmark datasets. Although there are differences in experimental setups across each work, which may affect the results and prevent a completely fair comparison, we have provided some key settings to help better understand the model performance.

For each model, the name is followed by two values in parentheses: the first value represents the look-back window mode (with "F" indicating fixed input length and "M" indicating a grid search was performed across different input lengths), and the second value indicates whether the test set dataloader drops the last batch of data (noDL: does not drop last; DL: drops last).

From the results, it is evident that:

Compared to models with fixed windows, PSformer performed best on 7/8 of the prediction tasks. This further highlights PSformer's competitive performance in forecasting.
Even when compared to non-fixed window models like FITS, PSformer performed best on 4/7 of the prediction tasks.
Regarding the reason why the PSformer model did not achieve the best performance on the Exchange dataset, we provided a discussion in Reviewer 2 (xcmx)'s W2. Additionally, after reconstructing the RevIN[17] window, we achieved up to a 16.5% reduction in loss and achieved competitive performance compared with selected baseline models.

Besides, we suggest that the main purpose of comparing more models is not only to pursue better forecasting performance but also to provide readers with a more comprehensive understanding of the latest research advancements in this field. It is important to draw greater attention to the innovative techniques behind these models.

W4: Providing details about the experimental platform; Specify the parameter search space; Have the best model parameters been determined using the validation set?

Experimental Setup. All experiments were conducted on two servers, each equipped with an 80GB NVIDIA A100 GPU and 4 Intel Xeon Gold 5218 CPUs.

Parameter Search Space. As stated in the paper, PSformer has a parameter-efficient structure, with high robustness and a limited number of hyperparameters. The main hyperparameters of PSformer include: 1. the number of encoders, 2. the number of segments, and 3. the SAM hyperparameter rho. For the number of encoders, we primarily searched within 1-3 layers. For the number of segments, we maintain the same as the number of patches in PatchTST，we also analyzed values that divide the input length evenly (specifically: 2, 4, 8, 16, 32, 64, 128, 256), and ultimately set all prediction tasks to 32 to avoid performance improvements caused by complex hyperparameter tuning. For the SAM hyperparameter rho, we referred to SAMformer and performed a search across 11 parameter points evenly spaced in the range [0, 1] (please refer to PSformer Appendix B.1 and B.2 for details). The number of encoders and segments can be found in the ablation study in section 4.2. Additionally, for the learning rate, we mainly tested values of 1e-3 and 1e-4, and for the learning rate scheduler, we tested OneCycle[18] and MultiStepLR.

Best Model Parameters. Regarding whether the best model parameters were determined based solely on the validation set, We cannot guarantee that all our model hyperparameters were exclusively determined using the validation set (with no reference to the test set performance). This might be a common issue in the time-series domain. However, We would like to emphasize that our model hyperparameters are relatively less and robust, and we try to avoid excessive hyperparameter tuning, as shown in Table 7 in Appendix A.2.

To be continued...

评论- Response to Reviewer GW6d (4/4)

2024-11-21

[2] Xu, Z., Zeng, A., & Xu, Q. (2024). FITS: Modeling Time Series with Parameters. In The Twelfth International Conference on Learning Representations.

[6] Ilbert, R., Odonnat, A., Feofanov, V., Virmaux, A., Paolo, G., Palpanas, T., & Redko, I. SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention. In Forty-first International Conference on Machine Learning.

[7] Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. Sharpness-aware Minimization for Efficiently Improving Generalization. In International Conference on Learning Representations.

[8] Chen, S. A., Li, C. L., Arik, S. O., Yoder, N. C., & Pfister, T. TSMixer: An All-MLP Architecture for Time Series Forecast-ing. Transactions on Machine Learning Research.

[9] Liu, Y., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., & Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. In The Twelfth International Conference on Learning Representations.

[10] Goswami, M., Szafer, K., Choudhry, A., Cai, Y., Li, S., & Dubrawski, A. MOMENT: A Family of Open Time-series Foundation Models. In Forty-first International Conference on Machine Learning.

[11] Zhou, T., Niu, P., Sun, L., & Jin, R. (2023). One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems.

[12] Luo, D., & Wang, X. (2024). Moderntcn: A modern pure convolution structure for general time series analysis. In The Twelfth International Conference on Learning Representations.

[13] Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In The Eleventh International Conference on Learning Representations.

[14] Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., & Jin, R. (2022). Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International conference on machine learning (pp. 27268-27286). PMLR.

[15] Wu, H., Xu, J., Wang, J., & Long, M. (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in neural information processing systems, 34, 22419-22430.

[16] Li, Z., Qi, S., Li, Y., & Xu, Z. (2023). Revisiting long-term time series forecasting: An investigation on linear mapping. arXiv preprint arXiv:2305.10721.

[17] Kim, T., Kim, J., Tae, Y., Park, C., Choi, J. H., & Choo, J. (2021). Reversible instance normalization for accurate time-series forecasting against distribution shift. In International Conference on Learning Representations.

[18] Smith, L. N., & Topin, N. (2019). Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications.

2024-11-26

Thank you for the response, but my concerns have not been addressed. Specifically, in the experimental section, the experiments mentioned by the authors in W3 and W4 currently lack controlled variables. Some models use "noDL," while others use "DL," which could introduce noise in the model training process, thus making it difficult to ensure the fairness of the experiments. Additionally, do the baseline models selected in the original paper also suffer from similar issues?

On the other hand, the authors mentioned that they fixed the parameters of the forecasting tasks to avoid performance enhancements resulting from complex hyperparameter tuning. This approach is unreasonable because models exhibit varying sensitivities to parameters across different tasks. Even for the same model, it is challenging to guarantee that the optimal parameters are consistent across different tasks. Therefore, the presented experimental results currently struggle to support conclusions.

Furthermore, the authors stated that they cannot guarantee that all their model hyperparameters were exclusively determined using the validation set (with no reference to the test set performance), citing this as a common issue in the time-series domain. Given that the authors acknowledge this issue but seem reluctant to actively address it, this appears incongruent with a positive attitude towards the advancement of the domain. Therefore, it is hoped that the authors can address these issues to enhance the quality of this paper.

2024-11-27

Thank you for your constructive feedback. Your concern seems not about our paper's concern but the concern about time series field. However, we believe discussing these concerns is meaningful and constructive, as it contributes to the advancement of the time series domain. To address your concern, we aim to provide more complete content, which may require a bit more time. We will update our response in a timely manner. Once again, thank you for your timely feedback.

2024-11-27

Thank you for the authors' timely reply. It is important to emphasize that the issues I am concerned about are specific to this paper, and it is the authors who have pointed out that certain problems in this paper may also be issues in the field of time series. To help improve the quality of this paper, I believe that since the authors have identified these problems, they should be addressed rather than just being highlighted as existing in the time series field without resolution in this paper, in order to align with the positive development of the entire field. No matter what the intention behind the author's statement in the reply, 'Your concern seems not about our paper's concern but the concern about the time series field,' may be, given that the authors have expressed willingness to address the issues in this paper, I still look forward to the authors' subsequent response.

2024-11-26

Thank you for your valuable comments, and we have tried our best to clarify the concerns on the paper. Additionally, we have uploaded the rebuttal version of the paper, which integrates the content added during the rebuttal period to better showcase the paper's innovations, improving clarity, and enhancing the overall quality.

With the discussion period nearing completion in less than two days, we would appreciate it if you could let us know if any aspects remain unclear. We truly appreciate this opportunity to improve our work and shall be most grateful for any feedback you could give us.

2024-11-28

Thank you for your feedback and patience. We have summarized your concerns into the following four questions. For the convenience of reading, we restate the meanings of the related abbreviations as follows:

DL: The test set dataloader in the model experiment is set with drop_last=True.
noDL: The test set dataloader in the model experiment is set with drop_last=False.
F: The input sequence length in the model experiment is fixed.
M: The input sequence length in the model experiment is determined through grid search across different lengths.

Q1:Models with noDL and DL make it difficult to ensure the fairness of the experiments.

First of all, the experimental results of these baseline models used for comparison are taken from the results reported in their corresponding original papers, including TimeMixer, CrossGNN, MICN, TimeNet, and FITS. The settings for DL or noDL are derived from the official code repositories of these models, and I only present these settings in the table for better comparison. Similarly, the settings for F or M are also determined by the experimental configurations and official code of the respective papers. A common understanding is that the performance of models under the DL or M settings tends to be better than under the noDL or F settings.
For more discussion on the impact of noDL and DL settings, I recommend two resources to better understand how this configuration affects model performance:

The most direct approach is to analyze line 32 (drop_last=False) in the data_factory.py file of the current version of the Time-Series-Library and verify its impact on prediction performance under different settings. Of course, directly reviewing the relevant code sections of each model can also help in understanding this configuration.
Additionally, we appreciate and recommend the discussion in TFB [1], specifically Issue 3 in the introduction section, as well as the discussion in the README of the FITS[2] official code repository. These address the effect of the DL setting, which drops the data from the last batch of the test set. From Table 2 in TFB, it is shown that as the number of samples in the last dropped batch increases, the predictive loss caused by DL continuously decreases. This impact is related to factors such as data stationarity, batch size, and input length.
A general opinion is that for non-stationary datasets, the DL setting is likely to have a more significant effect on model performance as the batch size increases. This is because a larger batch size may result in more samples being dropped in the last batch, and the dropped samples tend to be more challenging to predict due to the non-stationarity of the data. Consequently, this can lead to a noticeable impact on the overall MSE loss for small-scale datasets.

Unifying all baseline models under the noDL and F settings and testing the corresponding experimental results may yield outcomes different from those reported in the original papers. However, we have not done so previously for two main reasons:

The limited time during the initial rebuttal phase setting and the overall time allocation for addressing all comments.
We consider PSformer to be based on the noDL and F settings. From the perspective of hypothesis inference, If we agree that the DL issue leads to reduced predictive loss, and that the M setting also leads to reduced predictive loss compared to the fixed input length F setting, then if PSformer outperforms the baseline models in the DL or M settings, it would still outperform those baselines under the noDL and F settings.

[1] Qiu, X., Hu, J., Zhou, L., Wu, X., Du, J., Zhang, B., ... & Yang, B. (2024). TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods. VLDB2024.

[2] Xu, Z., Zeng, A., & Xu, Q. FITS: Modeling Time Series with $10k$ Parameters. ICLR2024.

Q2:Do the baseline models selected in the original paper also suffer from DL issues or M issues?

We stated the baseline settings in Section A.2. The experimental results for ModernTCN come from our own experiments. The difference between our experiment and the official implementation of ModernTCN is that we standardized the look-back window to 512 and set drop_last=False for the test set in the dataloader. The results for other baselines were collected from the SAMformer[1], iTransformer[2], and MOMENT[3] papers. Since the DL issue gained broader recognition this year, these baseline models also suffer from DL issues. However, all baseline models use a fixed input length.

2024-11-28

[1] Ilbert, R., Odonnat, A., Feofanov, V., Virmaux, A., Paolo, G., Palpanas, T., & Redko, I. SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention. ICLM2024.

[2] Liu, Y., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., & Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. In The Twelfth International Conference on Learning Representations. ICLR2024.

[3] Goswami, M., Szafer, K., Choudhry, A., Cai, Y., Li, S., & Dubrawski, A. MOMENT: A Family of Open Time-series Foundation Models. ICML2024.

Q3：Fixed parameters across different tasks is challenging to guarantee that the optimal parameters.

Our original statement is: “For the number of segments, we maintain the same as the number of patches in PatchTST，we also analyzed values that divide the input length evenly (specifically: 2, 4, 8, 16, 32, 64, 128, 256), and ultimately set all prediction tasks to 32 to avoid performance improvements caused by complex hyperparameter tuning. ”

With this, we intend to convey two key points:

Although we analyzed the predictive performance of models with different segment lengths, we ultimately set the number of segments to 32 across different tasks.
This does not mean that we believe the same hyperparameters should be used for all tasks. Instead, the choice of hyperparameters depends on their specific characteristics and how they influence predictive performance.

Specifically, PatchTST is an important early work that introduced the patching technique to the time series forecasting domain. Following this, a series of subsequent works adopted this idea. We identified related models with complete running scripts for various datasets in their official code repositories, such as ModernTCN, GPT4TS, TimeXer[1], and TimeLLM. These works use patching and its derived methods for temporal dimension, aiming to enhance the extraction of local features in time series data.

We have summarized the patch parameters used in these models across different datasets and prediction tasks in the table below. All data is sourced from the respective official code repositories. The values in the table represent the patch length for each case. Since we set the segment number to 32 and perform non-overlapping segmentation along the 512, the length of each segment in the temporal dimension is also 16.

		PSformer	PatchTST	ModernTCN	TimeLLM	GPT4TS	TimeXer
ETTh1	96	16	16	8	16	16	16
	192	16	16	8	16	16	16
	336	16	16	8	16	16	16
	720	16	16	8	16	16	16
ETTh2	96	16	16	8	16	16	16
	192	16	16	8	16	16	16
	336	16	16	8	16	16	16
	720	16	16	8	16	16	16
ETTm1	96	16	16	8	16	16	16
	192	16	16	8	16	16	16
	336	16	16	8	16	16	16
	720	16	16	8	16	16	16
ETTm2	96	16	16	8	16	16	16
	192	16	16	8	16	16	16
	336	16	16	8	16	16	16
	720	16	16	8	16	16	16

2024-11-28

		PSformer	PatchTST	ModernTCN	TimeLLM	GPT4TS	TimeXer
Weather	96	16	16	8	16	16	16
	192	16	16	8	16	16	16
	336	16	16	8	16	16	16
	720	16	16	8	16	16	16
Electricity	96	16	16	8	16	16	16
	192	16	16	8	16	16	16
	336	16	16	8	16	16	16
	720	16	16	8	16	16	16
Traffic	96	16	16	8	16	16	16
	192	16	16	8	16	16	16
	336	16	16	8	16	16	16
	720	16	16	8	16	16	16

From the table, it can be observed that these works tend to provide consistent settings for this parameter in their original papers. Moreover, we did not find configurations for the exchange dataset in the official repositories for most models. However, in the Time-Series-Library repository (a highly cited library in the time-series domain), the default script for the exchange dataset uses a patch length of 16. Additionally, the impact of patch length on forecasting performance is discussed in Section A.4.1 of the PatchTST paper, where it was ultimately shown that the MSE score does not vary significantly with different patch lengths. For specific results, please refer to Figure 4 in PatchTST. In response, we have also provided more analysis on the impact of segment count on predictive performance in Appendix B.13 of the rebuttal version of the paper (or in the Response to Reviewer xM2V’s W1).

Regarding whether different tasks should adopt different hyperparameters. We generally agree with this perspective, but it depends on specific circumstances. For instance, in the case of the non-stationary or non-seasonal characteristics exhibited by the exchange dataset, we found that such characteristics make it difficult for RevIN to obtain stable statistical mean and variance over different look-back windows. In this situation, a smaller look-back window for RevIN is more conducive to model training, as this parameter aligns better with the intrinsic properties of the data (more information can be found in Appendix B.7). Similar cases include selecting the number of network layers based on the size of the training dataset (as we do for the number of encoders).

On the other hand, some hyperparameter adjustments may not be particularly constructive, or the improvements they bring may not align with the primary focus of a paper. For example, setting the initial learning rate to 1e-4 or 1.25e-4, while such grid search may enhance model performance, could also negatively impact the robustness and generalization of the model.

[1] Wang, Y., Wu, H., Dong, J., Qin, G., Zhang, H., Liu, Y., ... & Long, M. (2024). Timexer: Empowering transformers for time series forecasting with exogenous variables. NIPS2024.

2024-11-28

Q4：they cannot guarantee that all their model hyperparameters were exclusively determined using the validation set (with no reference to the test set performance).

We have compiled information from models presented at top conferences regarding whether they determine the best model parameters on the validation set, as well as their methods for hyperparameter tuning as follows:

Regarding ModernTCN. There is no information about determining the best model parameters on the validation set. In the Model Parameters section, it defaults to using 1 block and stacks 3 blocks for larger datasets (ETTm1 and ETTm2). For small datasets, it recommends using a smaller FFN ratio to mitigate overfitting.
Regarding PatchTST. There is no information about determining the best model parameters on the validation set. In the Model Parameters section, it defaults to three encoder layers with 16 heads and a latent space of size 128. For small datasets, the parameter count is reduced to prevent overfitting, and dropout=0.2 is used for all experiments.
Regarding TimeLLM. There is no information about determining the best model parameters on the validation set. In Appendix B.4, they provide a table of model hyperparameters, which remain largely consistent across different long-term forecasting tasks.
Regarding SAMformer. There is no information about determining the best model parameters on the validation set. In A.1, they use a one-layer transformer, a batch size of 32, set the model dimension to 16, and keep it consistent across all tasks.
Regarding TimeMixer. There is no information about determining the best model parameters on the validation set. Table 7 shows their hyperparameter settings, and in Appendix D, they evaluate the number of layers, ultimately setting it to 2 to balance efficiency and performance.
Regarding CrossGNN. There is no information about determining the best model parameters on the validation set. In A.7, they set the scale number to 5 and K to 10 for all datasets, with the channel dimension set to 8 for smaller datasets and 16 for larger datasets.
Regarding MICN. There is no information about determining the best model parameters on the validation set. In A.2, they set the batch size to 32 and the hyperparameter i to {12, 16}.
Regarding TimesNet. There is no information about determining the best model parameters on the validation set. Table 7 shows that they use the same hyperparameters for all long-term forecasting tasks.
Regarding FITS. FITS is the only model that explicitly emphasizes performing parameter grid search on the validation set. In Section 4.2, they state that they choose the hyper-parameter based on the performance of the validation set. From their code scripts, these parameters likely include batch size, sequence length, random seed, and other hyperparameters related to the model. However, as they candidly acknowledged in their README, they initially did not address the DL bug, and their validation set configuration still has drop_last=True. This might have amplified the impact of the DL issue on the model during the parameter search, even though they conducted the parameter search on the validation set.

Our Hyperparameter Settings. Similar to most works presented at top conferences, as stated in Table 7 and B.12 of our paper, the number of encoders is set to 1 by default and adjusted to 3 for larger datasets. For the parameter $\rho$ in SAM, we adopt the same setting as SAMformer. In C.2 of their paper, they note that SAMformer has smooth behavior with $\rho$ , and we have also verified this in B.1 of our paper. Regarding the choice of segment number, we follow the common practice of patch-based methods, setting the segment length in the time-series dimension to 16.

2024-11-28

Validation Set Hyperparameter Grid Search vs. Robust Hyperparameter Settings.

As highlighted in our investigation above, most papers in top conferences adopt robust hyperparameter settings, typically making slight adjustments based on dataset size to avoid overfitting on both the validation and test sets.
In contrast, while hyperparameter tuning on the validation set intuitively helps mitigate overfitting on the test set, this approach has a unique challenge in the time series domain. Unlike the CV field, the three data splits in time series forecasting (training, validation, and test sets) follow a strict chronological order. This means that the distributions of the training, validation, and test sets cannot be as consistent as those in image datasets. This issue is prevalent in mainstream time series datasets, especially in non-stationary datasets like exchange, where distinct loss fluctuation patterns can be observed across the training, validation, and test sets. Consequently, hyperparameter tuning on the validation set may lead to overfitting to the validation set (a phenomenon is well exemplified by market style shifts in quantitative finance).
Moreover, if both the validation and test dataloaders have drop_last=True, hyperparameter search on the validation set can amplify the DL issue’s impact on the test set. From a temporal perspective, the further the samples in the validation set are from the time period of the training set, the more likely they are to exhibit distributions different from those of the training set. Therefore, based on the characteristics of time series data and our analysis of the DL bug, we believe that even on the validation set, hyperparameter overfitting should be avoided.

The above is our discussion of the concerns you remain. We hope these discuss are helpful for improving the quality of our paper. Given the extended rebuttal period, we are willing to provide additional responses to clarify these concerns.

2024-11-28

Thank you for your feedback. The issues regarding noDL and DL have been resolved, which is undoubtedly a positive development. However, the hyperparameter search remains an unresolved key issue. In my initial review comments, I clearly pointed out that if the authors fail to conduct hyperparameter searches for all models on their platform and determine the best parameters through the validation set, this could lead to unfair comparisons and weaken the credibility of the paper's conclusions.

In previous responses, the authors mentioned they cannot guarantee that all their model hyperparameters were exclusively determined using the validation set (with no reference to the test set performance), citing it as a common issue in the field of time series. Furthermore, in their latest response, the authors provided some comparisons from earlier studies regarding hyperparameter determination as supporting evidence.

It is important to emphasize that while studies predating FITS or those contemporaneous with FITS may not have explicitly emphasized the process of conducting hyperparameter grid searches on the validation set, FITS, as a spotlight and representative paper of ICLR 2024, is considered one of the SOTA works, and its emphasis on hyperparameter grid searches on the validation set undoubtedly holds significant constructive value.

Therefore, the issue of hyperparameter search in this paper remains a crucial matter that needs to be addressed. Simultaneously, attention should also be given to the resolution of the DL bugs proposed by the authors to ensure the accuracy and reliability of model training and hyperparameter search. However, from my initial feedback to the present moment, I have yet to see the authors' updated presentation of relevant experimental results.

2024-11-30

Q5: More discussion on hyperparameter search.

Thank you for your feedback, and we appreciate your rigor and dedication. We are pleased to hear that we have reached a consensus on the issue of noDL and DL, leaving the primary concern to be the application of hyperparameter search techniques in time series model and a fairer comparison with FITS. Our response to this concern can be divided into three main aspects:

We provide experimental results comparing FITS using the same experimental settings of PSformer.
We present the experimental results of PSformer following FITS in hyperparameter search on the validation set.
Finally, we summarize and further discuss the pros and cons of hyperparameter search on the validation set.

Comparison of PSformer and FITS under the same noDL and F settings.

Following the experimental settings described in Section A.2 of the paper, we report the results of FITS under seq_len=512 and noDL. Due to the small parameter size of FITS and its fast runtime, we evaluated its performance across 8 datasets. Given the multiple versions of runtime scripts in the official repository, we prioritized using the "Best" script where available; for datasets without a "Best" script, we selected other available scripts. As there were no scripts provided for the Exchange Rate dataset, we followed FITS's hyperparameter search methodology to identify the best experimental settings in terms of cut-off frequency, batch size, and training mode.

The summarized experimental results are presented in the table below. From these results, we observe that the performance of FITS under the F setting is not significantly different from that under the M setting. The predictive performance degrades primarily on the ETTm2 dataset, suggesting that frequency-domain-based models may be less sensitive to input lengths in the time domain. Overall, under the experimental settings of Section A.2, PSformer outperforms FITS on 6 out of the 8 mainstream datasets.

	ETTh1					ETTh2					ETTm1					ETTm2
Horizon	96	192	336	720	Avg	96	192	336	720	Avg	96	192	336	720	Avg	96	192	336	720	Avg
PSformer (F, noDL)	0.352	0.385	0.411	0.440	0.397	0.272	0.335	0.356	0.389	0.338	0.282	0.321	0.352	0.413	0.342	0.167	0.219	0.269	0.347	0.251
FITS (F, noDL)	0.372	0.405	0.425	0.425	0.407	0.270	0.331	0.353	0.378	0.333	0.306	0.338	0.368	0.421	0.358	0.165	0.219	0.272	0.359	0.254
IMP	0.020	0.020	0.014	-0.015	0.010	-0.002	-0.004	-0.003	-0.011	-0.006	0.024	0.017	0.016	0.008	0.016	-0.002	0.000	0.003	0.012	0.003

	Weather					ECL					Exchange					Traffic
Horizon	96	192	336	720	Avg	96	192	336	720	Avg	96	192	336	720	Avg	96	192	336	720	Avg
PSformer (F, noDL)	0.149	0.193	0.245	0.314	0.225	0.133	0.149	0.164	0.203	0.162	0.081	0.179	0.328	0.842	0.358	0.367	0.390	0.404	0.439	0.400
FITS (F, noDL)	0.146	0.189	0.242	0.315	0.223	0.137	0.152	0.167	0.206	0.166	0.099	0.202	0.350	0.933	0.396	0.386	0.398	0.411	0.449	0.411
IMP	-0.003	-0.004	-0.003	-0.001	-0.002	0.004	0.003	0.003	0.003	0.004	0.018	0.023	0.022	0.091	0.038	0.019	0.008	0.007	0.010	0.011

2024-12-01

Experimental comparison of PSformer with hyperparameter search on the validation set.

To better demonstrate PSformer’s performance under hyperparameter search, we followed FITS’s hyperparameter search methodology from its experiments. Since the FITS model consists of only a single linear layer, its training speed is very fast. However, most Transformer-based models require more time for parameter search. To better reflect PSformer’s performance within a limited timeframe, we restricted the search to essential hyperparameters. The hyperparameter search space was based on W4 and extended to include six input lengths: [64, 96, 384, 512, 640, 768], ensuring non-overlapping segment divisibility.

We conducted experiments on ETTh2， Exchange and ETTh1 datasets to capture the performance on distinct time series characteristics. To address the engineering complexity introduced by the high-dimensional hyperparameter space, we used Optuna, optimizing for validation set loss as the objective, instead of grid search. These settings allowed us to present experimental results promptly, as summarized in the table below.

From the results, PSformer outperforms FITS in most prediction tasks, except for the experiment on ETTh1 and ETTh2 with pred_len=720. This may be due to FITS' ability to filter out minor frequency components, reduce noise interference, and better capture long-term patterns through transformations from the time domain to the frequency domain. Additionally, PSformer demonstrates superior predictive performance on datasets like Exchange, which lack clear seasonality and stationarity.

	ETTh1				ETTh2				Exchange
Horizon	96	192	336	720	96	192	336	720	96	192	336	720
PSformer	0.354	0.389	0.408	0.438	0.268	0.327	0.352	0.382	0.081	0.177	0.315	0.821
FITS	0.372	0.404	0.427	0.424	0.270	0.331	0.353	0.378	0.096	0.192	0.345	0.933
IMP	0.018	0.015	0.019	-0.014	0.002	0.004	0.001	-0.004	0.015	0.015	0.030	0.112

Summary and further discussion on the pros and cons of hyperparameter search on the validation set.

Like two sides of a coin, based on our experimental experience, we summarize the advantages and disadvantages of hyperparameter search as follows:

Advantages:
- As discussed above, performing hyperparameter search on the validation set effectively prevents overfitting hyperparameters to the test set data.
- Moreover, adjusting the model’s hyperparameters to an optimal state based on the characteristics of the validation set helps improve model performance in a targeted manner according to the dataset's specific traits.
Disadvantages:
- Regarding time series data. Due to the non-stationarity of time series data and the strict chronological order used in dataset splits, the three subsets often exhibit time-varying statistical characteristics. Even the validation set should minimize hyperparameter overfitting, and the definition of the search space requires caution.
- Regarding model types. For FITS, as a frequency-domain-based model, the input length in the time domain may mainly act as a hyperparameter for converting information to the frequency domain. As a result, FITS can employ various training modes regarding input length, including converting long sequences formed by concatenating X and Y into frequency-domain information or further fine-tuning pre-trained parameters on Y. However, the impact of input length on time-domain models is likely more complex, requiring careful consideration when treating input sequence length as a hyperparameter.
- Regarding engineering effort. One of FITS’s key advantages is its small parameter size, typically consisting of only a fully connected layer for frequency upsampling, making it highly efficient to train. In contrast, Transformer-based models, CNN-based models, or large pre-trained models may face substantial engineering challenges from hyperparameter search, especially on large datasets like traffic, significantly increasing research difficulty.

2024-12-01

In summary, incorporating hyperparameter search on the validation set undoubtedly enables models to better adapt to dataset characteristics, prevents test set overfitting, and increases confidence in model robustness. However, like the two sides of a coin, the associated drawbacks should not be overlooked. Strategies to mitigate these challenges, such as adopting efficient hyperparameter optimization techniques to reduce computational costs or using differential transformations to alleviate the impact of time series non-stationarity, should be considered.

At present, most previous studies have not reported results on validation set hyperparameter optimization or noDL settings. While we have conducted experiments and attempted to guide time series research toward fairer comparisons, it is indeed difficult to address this problem in a unified and fair manner, and more attention as well as collective effort from the community are urgently needed.

2024-12-02

Thank you to the authors for their efforts. I have reviewed the comparison results between PSformer and FITS, and I agree with the authors' view that the community needs more attention and collective efforts to address experimental comparison issues in a unified and fair manner. Unfortunately, I did not see more methods for a fair comparison with PSformer, as I mentioned in my original review comments. Therefore, I believe that the experiments in this paper still do not sufficiently support the conclusions. Considering the innovation, presentation, and quality of this paper, there is much room for improvement, and I believe that through enhancements, this paper will have greater potential for development.

However, the issues of time series field highlighted by the authors in the rebuttal are still worth considering, and based on this, I am willing to increase my rating.

2024-12-03

Thank you for your constructive comments and feedback on our work. We have made efforts to address each of your concerns, and we are pleased that we reached a consensus on most of them. We also acknowledge that a more complete comparison is needed. In response, we have further organized the experimental results. In the table below, we present the unified experimental results of these models under the F and noDL settings, where the results for MICN and TimesNet are from Leddam[1]. To better illustrate the comparison between PSformer and each model, we use MSED to represent the degree of MSE loss reduction of PSformer compared to the corresponding baseline model for each prediction task. Apart from FITS, PSformer outperforms most models in the majority of tasks, and only underperforms MICN on 4 prediction tasks and GrossGNN on 3 prediction tasks. When compared to FITS, PSformer performs better than 22 prediction tasks, with MSED greater than 0.005 in 16 tasks, and only 2 tasks where MSED is below -0.005.

data	ETTh1				ETTh2				ETTm1				ETTm2
Horizon	96	192	336	720	96	192	336	720	96	192	336	720	96	192	336	720
TimeMixer (F, noDL)	0.375	0.429	0.484	0.498	0.289	0.372	0.386	0.412	0.320	0.361	0.390	0.454	0.175	0.237	0.298	0.391
CrossGNN (F, noDL)	0.382	0.427	0.465	0.472	0.309	0.390	0.426	0.445	0.335	0.372	0.403	0.461	0.176	0.240	0.304	0.406
MICN (F, noDL)	0.421	0.474	0.569	0.770	0.299	0.441	0.654	0.956	0.324	0.366	0.408	0.481	0.179	0.307	0.325	0.502
TimesNet (F, noDL)	0.384	0.436	0.491	0.521	0.340	0.402	0.452	0.462	0.338	0.374	0.410	0.478	0.187	0.249	0.321	0.408
FITS (F, noDL)	0.372	0.405	0.425	0.425	0.270	0.331	0.353	0.378	0.306	0.338	0.368	0.421	0.165	0.219	0.272	0.359
PSformer (F, noDL)	0.352	0.385	0.411	0.440	0.272	0.335	0.356	0.389	0.282	0.321	0.352	0.413	0.167	0.219	0.269	0.347
MSED-TimeMixer	0.023	0.044	0.073	0.058	0.017	0.037	0.030	0.023	0.038	0.040	0.038	0.041	0.008	0.018	0.029	0.044
MSED-CrossGNN	0.030	0.042	0.054	0.032	0.037	0.055	0.070	0.056	0.053	0.051	0.051	0.048	0.009	0.021	0.035	0.059
MSED-MICN	0.069	0.089	0.158	0.330	0.027	0.106	0.298	0.567	0.042	0.045	0.056	0.068	0.012	0.088	0.056	0.155
MSED-TimeNet	0.032	0.051	0.080	0.081	0.068	0.067	0.096	0.073	0.056	0.053	0.058	0.065	0.020	0.030	0.052	0.061
MSED-FITS	0.020	0.020	0.014	-0.015	-0.002	-0.004	-0.003	-0.011	0.024	0.017	0.016	0.008	-0.002	0.000	0.003	0.012

2024-12-03

data	Exchange				Electricity				Weather				Traffc
Horizon	96	192	336	720	96	192	336	720	96	192	336	720	96	192	336	720
TimeMixer (F, noDL)	0.090	0.187	0.353	0.934	0.153	0.166	0.185	0.225	0.163	0.208	0.251	0.339	0.462	0.473	0.498	0.506
CrossGNN (F, noDL)	0.084	0.171	0.319	0.805	0.173	0.195	0.206	0.231	0.159	0.211	0.267	0.352	0.570	0.577	0.588	0.597
MICN (F, noDL)	0.102	0.172	0.272	0.714	0.164	0.177	0.193	0.212	0.161	0.220	0.278	0.311	0.519	0.537	0.534	0.577
TimesNet (F, noDL)	0.107	0.226	0.367	0.964	0.168	0.184	0.198	0.220	0.172	0.219	0.280	0.365	0.593	0.617	0.629	0.640
FITS (F, noDL)	0.099	0.202	0.350	0.933	0.137	0.152	0.167	0.206	0.146	0.189	0.242	0.315	0.386	0.398	0.411	0.449
PSformer (F, noDL)	0.081	0.179	0.328	0.842	0.133	0.149	0.164	0.203	0.149	0.193	0.245	0.314	0.367	0.390	0.404	0.439
MSED-TimeMixer	0.009	0.008	0.025	0.092	0.020	0.017	0.021	0.022	0.014	0.015	0.006	0.025	0.095	0.083	0.094	0.067
MSED-CrossGNN	0.003	-0.008	-0.009	-0.037	0.040	0.046	0.042	0.028	0.010	0.018	0.022	0.038	0.203	0.187	0.184	0.158
MSED-MICN	0.021	-0.007	-0.056	-0.128	0.031	0.028	0.029	0.009	0.012	0.027	0.033	-0.003	0.152	0.147	0.130	0.138
MSED-TimeNet	0.026	0.047	0.039	0.122	0.035	0.035	0.034	0.017	0.023	0.026	0.035	0.051	0.226	0.227	0.225	0.201
MSED-FITS	0.018	0.023	0.022	0.091	0.004	0.003	0.003	0.003	-0.003	-0.004	-0.003	0.001	0.019	0.008	0.007	0.010

Additionally, regarding the results of the hyperparameter search, we further conducted experiments on other datasets. We have organized the results for ETTh1 and combined them with the previous results in Q5. It can be observed that the results on ETTh1 are similar to those on ETTh2.

We hope our additional responses address your suggestion to include more methods for a fair comparison with PSformer. We are striving to present better experimental comparisons to strengthen your confidence in our work. Furthermore, We acknowledge that we have not achieved the best performance in all prediction tasks, so we will replace the SOTA claim with the statement that our work demonstrates competitive performance. Thank you again for your rigor and contributions. We believe that by incorporating your valuable feedback, we can further improve our work. We also hope that our work will be seen by a wider audience in the community and receive more support from you. Once again, thank you for your contribution to our work.

[1] Yu, G., Zou, J., Hu, X., Aviles-Rivero, A. I., Qin, J., & Wang, S. Revitalizing Multivariate Time Series Forecasting: Learnable Decomposition with Inter-Series Dependencies and Intra-Series Variations Modeling. ICML2024

2024-12-03

Thank you to the authors for their efforts. Based on the above, I am willing to keep my score at 5, and I hope the authors can make further improvements to enhance the overall quality of the paper.

2024-12-03

Thank you for your suggestion. We will incorporate the rebuttal period content into the next version to enhance the overall quality of the paper. We truly appreciate your valuable feedback and thoughtful review, which have been instrumental in improving our work.

AC 元评审

2024-12-16

This work proposes a new Transformer architecture consisting of i) pre-processing of time series, ii) spatiotemporal segment attention, iii) parameter sharing block, and iv) post-processing. The proposed architecture involves many recent enhancements for time series, i.e., RevIN, patching, and various attention mechanisms. Overall, the proposed method is a mixture of existing discoveries and the reviewers are also around the border level.

审稿人讨论附加意见

The authors provide various rebuttal messages on additional baselines, the evaluation protocol of time series forecasting, and the justification of their model designs. However, many reviewers are still below the marginal acceptance level.

最终决定Reject

2025-01-22

Reject