UniTST: Effectively Modeling Inter-Series and Intra-Series Dependencies for Multivariate Time Series Forecasting
摘要
评审与讨论
This paper proposes a novel method to capture inter-series and intra-series information from multivariate time series. This paper utilizes a unified attention mechanism on the flattened patch tokens, and adds a dispatcher module which reduces the complexity and makes the model feasible for high-dimensional inputs. The model achieves compelling performance in multiple datasets for time series forecasting.
优点
The paper is overall well written and the experiment section is clearly expained with newest benchmark methods. Effectively capturing the inter- and intra- series information is an important issue in LTSF, and the idea of unified attention seems interesting.
缺点
-
The paper claims that 'previous transformer models lack ability to simultaneously capture both inter-variate and intra-variate dependencies', which is not accurate. In fact, there is some literature focusing on modeling inter- and intra- series information in a Transformer structure, for example, CATS: Enhancing Multivariate Time Series Forecasting by Constructing Auxiliary Time Series as Exogenous Variables (ICML 2024) adopts a very similar idea by constructing an auxiliary series to capture inter- and intra- series. The author should revise their literature review accordingly and explain their difference with CATS.
-
Time series forecast using LLM has been a trand in recent years. The author is encouraged to include more LLM-based methods as benchmarks, such as LLM4TS or GPT4TS.
问题
The author flattens all patches from different variates into a unified sequence. I wonder whether the sequence is (1) normalized, (2) univariate or multivariate. Furthermore, if the input dimension is high, the unified sequence may be too long.
What is the difference of constructing a unified sequence and simple concatenation? It seems very similar to me.
How does the author determine the order of raw input series to construct the unified sequence? If the order is random then I suspect that the encoder may not be time-aware, making the output less reliable without additional input of time.
Q1 and Q2: The author flattens all patches from different variates into a unified sequence. I wonder whether the sequence is (1) normalized, (2) univariate or multivariate. Furthermore, if the input dimension is high, the unified sequence may be too long. What is the difference of constructing a unified sequence and simple concatenation? It seems very similar to me.
The sequence is normalized and multivariate. We conduct the instance normalization before patching and de-normalization before output (adding back the mean and standard deviation back to the prediction), which is also done in PatchTST [1].
We agree that when the input dimension is high, the unified sequence can be long. That is also the purpose of using dispatchers to reduce the memory consumption when the length is large. The flattened patch sequence is the same as doing concatenation with multiple variates of time series, where we allow the model to capture dependencies on different variates and different time.
Q3: How does the author determine the order of raw input series to construct the unified sequence? If the order is random then I suspect that the encoder may not be time-aware, making the output less reliable without additional input of time.
We use the original order of the multivariate, i.e., we do not change the order. We apply a learnable additive position encoding as in [1] to capture the order information.
Reference:
[1] A TIME SERIES IS WORTH 64 WORDS: LONG-TERM FORECASTING WITH TRANSFORMERS. ICLR 2023.
Dear Review uAEX,
We appreciate your valuable feedback and suggestions. We would like to clarify and answer questions below:
W1: The paper claims that 'previous transformer models lack ability to simultaneously capture both inter-variate and intra-variate dependencies', which is not accurate. In fact, there is some literature focusing on modeling inter- and intra- series information in a Transformer structure, for example, CATS: Enhancing Multivariate Time Series Forecasting by Constructing Auxiliary Time Series as Exogenous Variables (ICML 2024) adopts a very similar idea by constructing an auxiliary series to capture inter- and intra- series. The author should revise their literature review accordingly and explain their difference with CATS.
Thanks for pointing out this work. We woud like to clarify that CATS and our work both aim to capture inter- and intra- series dependencies, however, CATS construct auxiliary series and capture inter-series dependencies from auxiliary series. In contrast, our method is applied directly on the original series with considering all multivariate as a unified sequence. We will revise the literature review accordingly in our pdf.
W2: Time series forecast using LLM has been a trand in recent years. The author is encouraged to include more LLM-based methods as benchmarks, such as LLM4TS or GPT4TS.
Thanks for the suggestions. We provide the table to compare our model UniTST with GPT4TS below. As mentioned in the manuscript, we are using sequence length as 96.
| GPT4TS | UniTST | ||||
|---|---|---|---|---|---|
| Pred_len | MSE | MAE | MSE | MAE | |
| ETTh1 | 96 | 0.3806 | 0.3958 | 0.383 | 0.398 |
| 192 | 0.4308 | 0.4262 | 0.434 | 0.426 | |
| 336 | 0.471 | 0.4474 | 0.471 | 0.445 | |
| 720 | 0.4878 | 0.4699 | 0.479 | 0.469 | |
| ETTh2 | 96 | 0.3035 | 0.3549 | 0.292 | 0.342 |
| 192 | 0.389 | 0.4081 | 0.37 | 0.39 | |
| 336 | 0.4247 | 0.4369 | 0.382 | 0.408 | |
| 720 | 0.4353 | 0.4549 | 0.409 | 0.431 | |
| ETTm1 | 96 | 0.3262 | 0.3624 | 0.313 | 0.352 |
| 192 | 0.3678 | 0.3826 | 0.359 | 0.38 | |
| 336 | 0.3995 | 0.4045 | 0.395 | 0.404 | |
| 720 | 0.4625 | 0.4389 | 0.449 | 0.44 | |
| ETTm2 | 96 | 0.1776 | 0.2632 | 0.178 | 0.262 |
| 192 | 0.244 | 0.3064 | 0.243 | 0.304 | |
| 336 | 0.3064 | 0.3465 | 0.302 | 0.341 | |
| 720 | 0.4098 | 0.4086 | 0.398 | 0.395 | |
| Electricity | 96 | 0.1856 | 0.2726 | 0.1348 | 0.2299 |
| 192 | 0.1901 | 0.277 | 0.1514 | 0.2467 | |
| 336 | 0.205 | 0.2925 | 0.1645 | 0.262 | |
| 720 | 0.2446 | 0.3233 | 0.1941 | 0.2913 | |
| Traffic | 96 | 0.4701 | 0.3116 | 0.402 | 0.255 |
| 192 | 0.4779 | 0.3113 | 0.426 | 0.268 | |
| 336 | 0.4906 | 0.318 | 0.44 | 0.275 | |
| 720 | 0.5103 | 0.325 | 0.489 | 0.297 | |
| Weather | 96 | 0.1837 | 0.2237 | 0.156 | 0.202 |
| 192 | 0.2302 | 0.2625 | 0.207 | 0.25 | |
| 336 | 0.2853 | 0.3019 | 0.263 | 0.292 | |
| 720 | 0.3611 | 0.3505 | 0.34 | 0.341 |
We found that UniTST usually outperform GPT4TS on most of the cases(especially on Electricity, Traffic, and Weather) while GPT4TS slightly outperforms UniTST on ETT datasets with a few prediction lengths. Interestingly, we can also see that GPT4TS favors the relatively short prediction lengths (i.e., with the long prediction length, it is generally worse than UniTST).
In this work, we mainly focus on investigating fundmental building blocks of Transformer-based multivariate methods. Therefore, we leave the discussion and investigation for LLM-based methods as future works.
The response to other questions is posted in the next comment.
I appreciate the efforts of the authors on revising the paper and solving my concerns. The literature has been enriched and many SOTA models have been added as benchmarks, which makes the paper more technically solid. I will gladly raise my score.
The article proposes a dependencies across different variates and different times and presents a corresponding Transformer modeling method. However, this greatly increases the number of tokens. To reduce complexity, a Dispatch method is proposed.
优点
Experiments have been conducted on the major datasets currently available. The experiments are relatively sufficient and the analysis is comprehensive, including hyperparameter analysis and efficiency analysis. It is recommended to add some visualized prediction results and compare them with the state-of-the-art. This is a relatively common way of presenting results.
缺点
- For the current research situation, this contribution is relatively ordinary. There are many almost identical practices in the past. For example, Different sEnsors at Different Timestamps (DEDT) in https://arxiv.org/abs/2309.05305 is completely the same as "across different variates and different time" in the second line of the third paragraph of the Introduction. Cross-correlation coefficient in https://arxiv.org/abs/2401.17548 is almost same with the Definition 1, just with different names. I think the explanation of the problem and the drawings are not as clear as those in past articles. At the end of the second paragraph of the Introduction, the problems in the two stages are mutually influential. I think there is a lack of experimental proof for the problem. Moreover, the two-stage method in the past https://arxiv.org/pdf/2402.19072 is not inferior to UNITST in terms of effect.
- The Dispatcher is also the same as the router in Crossformer . There is no innovation.
- No code is provided for reproducibility testing, the authenticity of the experimental results is reserved.
- Incidentally, the article uses the template of ICLR 2024. When reviewing, it is not convenient to locate the exact line.
问题
- What is the value of t' corresponding to the correlations in Figure 3?
- Since the router operation, that is, the Dispatcher proposed in the article, is used, it may not be possible to directly visualize the attention weights between variables learned by the model to correspond to the correlation calculated by Definition 1. Then, can you provide the correlation value calculated for the model's prediction results and compare it with the correlation based on real data given in Figure 3? Does the model truly capture the dependencies across different variates and different times that you proposed or other fitting results?
- It is recommended to provide code for reproducibility.
Q1: What is the value of t' corresponding to the correlations in Figure 3?
The x-axis means t in variate 10 and the y-axis indicates t' in variate 0. t and t' are represented as the patch indices (basically aggregate several time stamps to a patch).
Q2: Since the router operation, that is, the Dispatcher proposed in the article, is used, it may not be possible to directly visualize the attention weights between variables learned by the model to correspond to the correlation calculated by Definition 1. Then, can you provide the correlation value calculated for the model's prediction results and compare it with the correlation based on real data given in Figure 3? Does the model truly capture the dependencies across different variates and different times that you proposed or other fitting results?
As Figure 3 we showed in the manuscript is the input data on the training set, we are not able to plot the correlation of our model's prediction and compare them. But in the revised manuscript, we provide two case visualization on the correlation map of multivariate relationship in the predicted time series from Solar-Engery in Appendix D (Figure 10). Compared with iTransformer, we can see that the correlation map of UniTST is more aligned with the ground truth time series.
W3 and Q3: It is recommended to provide code for reproducibility.
Thanks for the suggestions. We added our code as the supplementary materials.
Dear URks,
We appreciate your valuable feedback and suggestions. We have revised our manuscript based on your comments. Also we would like to clarify and answer questions below:
W1: For the current research situation, this contribution is relatively ordinary. There are many almost identical practices in the past. For example, Different sEnsors at Different Timestamps (DEDT) in https://arxiv.org/abs/2309.05305 is completely the same as "across different variates and different time" in the second line of the third paragraph of the Introduction. Cross-correlation coefficient in https://arxiv.org/abs/2401.17548 is almost same with the Definition 1, just with different names. I think the explanation of the problem and the drawings are not as clear as those in past articles. At the end of the second paragraph of the Introduction, the problems in the two stages are mutually influential. I think there is a lack of experimental proof for the problem. Moreover, the two-stage method in the past https://arxiv.org/pdf/2402.19072 is not inferior to UNITST in terms of effect.
We respectfully disagree on this point. Althrough "Different sEnsors at Different Timestamps (DEDT[1]) in https://arxiv.org/abs/2309.05305" is somewhat similar to ours "dependencies across different variates and different time" from the concept perspective. We are designing a different method to capture these: our method is a simple transformer-based method while theirs is a GNN-based model.
Compare with LIFT[2] (https://arxiv.org/abs/2401.17548), we provided evidence to demonstrate that the cross-correlation coefficient commonly exist in the real-world datasets. In terms of the methodology, we design a method to directly capture cross-variate cross-time dependencies instead of explicitly calculating the leading indicators.
For the two-stage method TimeXer [3] (https://arxiv.org/pdf/2402.19072) you mentioned, we provide the comparison between UniTST and TimeXer as follows:
| Dataset | UniTST | TimeXer | ||
|---|---|---|---|---|
| MSE | MAE | MSE | MAE | |
| ETTh1 | 0.442 | 0.435 | 0.437 | 0.437 |
| ETTh2 | 0.363 | 0.393 | 0.367 | 0.396 |
| ETTm1 | 0.379 | 0.394 | 0.382 | 0.397 |
| ETTm2 | 0.280 | 0.326 | 0.274 | 0.322 |
| ECL | 0.166 | 0.262 | 0.171 | 0.270 |
| Traffic | 0.439 | 0.274 | 0.466 | 0.287 |
| Weather | 0.242 | 0.271 | 0.241 | 0.271 |
We can see that, for MAE, UniTST is better than TimeXer on 5 out of 8 datasets. For MSE, UniTST outperforms TimeXer on 4 out of 8 datasets. We would like to point out that TimeXer still falls into the category where the models sequentially capture cross-time and cross-varaite dependencies. In contrast, we aim to propose a simple unified module to simultaneously capture intra-variate and inter-variate dependencies.
References:
[1] Fully-Connected Spatial-Temporal Graph for Multivariate Time-Series Data. AAAI 2024
[2] Rethinking Channel Dependence for Multivariate Time Series Forecasting: Learning from Leading Indicators. ICLR 2024
[3] TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables. NeurIPS 2024.
W2: The Dispatcher is also the same as the router in Crossformer . There is no innovation.
The design of dispatchers is similar. However, we would like to point out the whole architecture and the way we are capturing cross-time and cross-variate dependencies are different. Crossformer is basically a two-stage method to sequentially capture these dependencies, while our method is an one-stage method to simulanteously capture cross-time and cross-variate dependencies.
W4: Incidentally, the article uses the template of ICLR 2024. When reviewing, it is not convenient to locate the exact line.
Thanks for pointing out this issue. We have submitted a revised manuscripts with the correct template. More discussions on related work and additional case studies are provided (highlighted in blue).
Thank you for your reminder.
First, regarding DEDT, what I'm referring to is the degree of innovation in terms of ideas. I'm well aware that the papers I cited are based on the GNN method, while your article uses the Transformer. I've noticed that the opinions of other reviewers are also in line with mine, including Router in Crossformer. I don't think this is a personal cognitive bias on my part.
Secondly, for Definition 1, you stated that "we provided evidence to demonstrate that the cross-correlation coefficient commonly exists in the real-world datasets." However, LIFT has already presented this general theory last year. When it comes to discussing theories, you then shift to the differences in methods. Earlier, when discussing the idea, you also brought up the differences in the models. Also, I haven't seen that your method significantly surpasses TimXer (ETTm2, ETTh1, Weather).
In conclusion: The ideas (both intra and inter), the method (Router), and the theory in this article are all patchworks of what already exists in this field. I'm maintaining a score of 3.
Dear Reviewer URks,
Thanks for the comments. We respectfully disagree with some of your statements and provide our clarification below:
Secondly, for Definition 1, you stated that "we provided evidence to demonstrate that the cross-correlation coefficient commonly exists in the real-world datasets." However, LIFT has already presented this general theory last year.
We would like to clarify that we provided empirical evidence to demonstrate that the cross-correlation coefficient commonly exists in the real-world datasets. In contrast, without any empirical evidence in real data, LIFT presented their Definition 1 on Cross-correlation coefficient (if that is the "theory" you're referring). Therefore, we do not believe that our statement is imprecise. Meanwhile, we did not claim that we have new theories in our paper
When it comes to discussing theories, you then shift to the differences in methods. Earlier, when discussing the idea, you also brought up the differences in the models.
We would like to kindly request if you can provide any pointers on these two points. We are truly not aware on these points, especially the first one (as we did not provide theories in our paper). Meanwhile, we also believe that (which you may not agree), when we talk about contributions of a paper, it is more from a holistic view instead of seeing if any components have been used in previous papers. For example, before PatchTST [1], the idea and the method of Patching has been used in CV (Vision Transformer [2]), which does not harm the contribution of PatchTST in our opinion.
In conclusion: The ideas (both intra and inter), the method (Router), and the theory in this article are all patchworks of what already exists in this field.
We disagree on this point. In our humble opinion, to capture intra- and inter- series dependencies is more about the motivation not the idea. We first show the empirical evidence in real data to justify this motivation, which, to our best knowledge, has not been provided in previous work. After that, we provide a simple method using flattened sequences with routers to achieve capturing intra- and inter- series dependencies.
We look forward to your further comments and clarification. We are willing to further respond and clarify.
References:
[1] Nie, Yuqi, et al. "A time series is worth 64 words: Long-term forecasting with transformers." ICLR 2023.
[2] Dosovitskiy, Alexey. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021.
Dear Author, Regarding the ''theory'', ''idea'', and ''method'' I put forward, you can describe them in any words you like. Apart from these three parts, could you please tell me what other contributions this article has? The performance of the model isn't particularly remarkable either. Don't even mention comparing it with the baselines in the article. Not a single model with the same concept has been compared. Your nitpicking here is completely meaningless. And you inadvertently admitted that "Dispatcher" is just another term for "router". ''After that, we provide a simple method using flattened sequences with routers to achieve capturing intra- and inter- series dependencies.''
If you do a good job of piecing something from other fields, maybe I will give a high score. But if you just piece it together directly from this field, aren't you treating the people in this field like fools? Could you please at least show some respect for others? I've made it very clear. I don't think other reviewers would raise the same issue if it's just a deviation in my understanding.
In addition, I noticed an interesting thing. The reviewer who originally gave a score of 10 has just changed it to 8. The score was first increased and then reverted back, yet without any explanation or comment. The reasons lurking behind this are rather thought-provoking.
Dear URks, I was going to score it as 8 but I mistyped 10 and that is why I changed the score as soon as I found out the mistake. The establishment of inter- and intra-series relationship in the paper is an interesting topic which I personally favor, so I chose to give it a higher score and let the AC decide if the paper is qualified for acceptance. I do not know what brought you up, but it is uncultural to charge against someone with no evidence just because he holds a different opinion. As a reviewer I have every right to rate a paper independently, regardless of whether you hate this paper or not, as your opinion means nothing to me. In this community, minimum level of civilization is required, so please stop using harassing phrases like "treating the people in this field like fools?" or " Could you please at least show some respect for others". You are no better than anyone. Thank you.
Dear uAEX, If my words have affected you, I hereby solemnly declare that I respect every reviewer and author very much. I don't have any complaints about you personally or your review. Your scoring has nothing to do with me. I am doing my best to complete the review objectively. Moreover, I didn't express any disrespect for anyone. I gave an objective assessment of the content of the reply I received and expressed my subjective feelings. On the contrary, "you are no better than anyone." is a declarative comment about me personally. May I ask whether, in your opinion, your comment on me is a civilized behavior?
Dear Reviewer URks,
First, we appreciate your feedback and suggestions again. We would like to kindly remind you that the deadline for pdf updates is in just a few hours. We are looking forward to your reply, so that we can clarify further if needed. Following your suggestions, we have provided some comparisons with other methods and explained the difference. Additionally, we have also provided the code for reproducibility and revised the manuscript with more literature reviews.
we would like to explain/re-emphasize that our contributions are not mainly on the methodology itself. It is more on how we find the right problem to solve and come out with a simple method: first identify the important limitation of previous works by showing empirical evidence in real-world data (i.e., failing to capture inter-series and intra-series dependencies as pointed out in Line 44 to 82), and propose a simple and effective method to directly and effectively solve this limitation.
Hope you are satisfied with our response. If you still have any concerns, we are also willing to discuss/clarify further. We are looking forward to your reply.
Dear all,
I am the AC of this paper and I have noticed all the messages, as well as the "mistyped score".
I kindly remind you all please be polite and respectful to the authors and reviews. The discussion is still open and please be free to continue on the results/contribution discussions in a thoughtful and respectful way. Between November 26th and December 3rd, authors can reply to the messages.
I will also initiate an AC-Reviewer discussion soon.
Thank you all for your submission/review/contribution to ICLR-2025.
Your AC
We are aware that you made a comment here: https://openreview.net/forum?id=cuFnNExmdq¬eId=9nj2gK3Gq8. We are quoting your full comments as below to better respond:
I share your concerns, and at the same time, the author's initial response was also delayed, failing to genuinely address the doubts regarding motivation.
We would like to ask for clarification on why you are saying "the author's initial response was also delayed, failing to genuinely address the doubts regarding motivation." We did not really get the logic between "the author's initial response was also delayed" and "failing to genuinely address the doubts regarding motivation."
Meanwhile, we would like to also point out that our initial response to you was posted on 24 Nov 2024, 16:04, which is 3+ days before the original deadline. You may consider it is a bit late (really up to different people). But we want to point out that we were also asked to add more experimental results by reviewers. And we all know experiments take time. Additionally, we faced computational resources shortage within our organization, which is out of our control.
We tried our best during the discussion phase and replied all comments politely. Therefore, we would like to kindly ask for some respect for our efforts and hope you can understand our situation, if possible.
UniTST is designed to address existing models' limitations in capturing complex dependencies across variate and temporal dimensions in multivariate time series (MTS) data. The authors argue that previous Transformer models have fallen short in simultaneously capturing inter-variate (between different series) and intra-variate (within the same series) dependencies, critical for accurate forecasting in real-world data.
The authors introduce a unified attention mechanism within UniTST that operates on flattened patch tokens, allowing the model to directly and explicitly learn the intricate inter-series and intra-series dependencies. To manage the increased complexity associated with a large number of variates, a dispatcher module is integrated into the model, reducing the computational complexity from quadratic to linear, thus making the model scalable.
The paper contributes to the field by highlighting the importance of inter- and intra-variate dependencies in MTS forecasting, proposing a simple yet effective model architecture to capture these dependencies, and empirically demonstrating its superiority over existing methods. The findings emphasize the necessity of simultaneously modelling variate and temporal dynamics in multivariate time series analysis.
优点
The paper clearly outlines the limitations of existing Transformer models in capturing both inter-variable and intra-variable dependencies within multivariate time series data. To address this issue, it introduces the UniTST model.
The proposed UniTST model employs a unified attention mechanism alongside a scheduler module to simultaneously capture inter-variable and intra-variable dependencies. This innovative design effectively enhances the handling of multivariate time series data.
The experimental results presented in the paper are reproducible, reliable, and credible.
缺点
The motivation behind the study is not clearly articulated, which makes it difficult to fully understand the underlying rationale and significance of the research problem. Additionally, the proposed approach lacks sufficient novelty, as it does not introduce substantially new concepts or techniques compared to existing methods. Strengthening the motivation and highlighting the unique contributions of the work would enhance its originality and impact within the field.
In the ablation study, the effectiveness of the dispatcher is evaluated based on memory usage, an approach that is relatively uncommon. While comparing the memory consumption of the dispatcher module with other modules within the model provides useful insights, However, it is not sufficient to conduct ablation experiments on only one component within the model. It is also essential to assess the whole model's memory impact relative to other models. A more rigorous evaluation would involve additional comparisons with state-of-the-art (SOTA) models, focusing on computational cost and model parameter size both before and after integrating the dispatcher. This broader comparison would offer a clearer understanding of the dispatcher’s role and efficiency within the model.
问题
Training was conducted using an A100 40GB GPU, which, under typical conditions, rarely runs out of memory—except when working with large models such as TimesNet on the Traffic dataset. However, the ablation study does not specify the batch size used, also there are questions about the choice of memory usage as a comparative metric. A detailed explanation is needed to clarify why memory usage was selected as a benchmark for this experiment and why one out of four ablation experiments resulted in an out-of-memory (OOM) error.
The motivation for this study is not clearly defined and appears somewhat unconvincing. Although the results from the experiment in Figure 3 are cited as the primary source of motivation, the rationale behind conducting the experiment in Figure 3 itself is unclear. Most existing models that aim to capture inter-variable and intra-variable dependencies in multivariate data do not employ patch operations. Thus, the introduction of patching in the experiment warrants further explanation. A more rigorous clarification of the motivation, particularly regarding the choice to incorporate patch operations, would strengthen the foundation and relevance of this study.
Dear Reviewer 3TG2,
We appreciate the valuable feedback and suggestions. We would like to clarify and answer the questions as below:
W1: The motivation behind the study is not clearly articulated, which makes it difficult to fully understand the underlying rationale and significance of the research problem. ... Strengthening the motivation and highlighting the unique contributions of the work would enhance its originality and impact within the field.
Thanks for pointing out the issue. We would like to highlight again that our main contribution is first pointing out the importance of simultaneously capturing inter- and intra-series dependencies with evidence in real-world data (see Figure 2 and 3) and previous works lack this ability. We believe this is also valuable as no work provides evidence to explain why we need to simultaneously capture inter- and intra-series dependencies, to our best knowledge. With this motivation, we propose a simple and effective method to directly and effectively solve the limitation of previous methods.
W2: It is also essential to assess the whole model's memory impact relative to other models. A more rigorous evaluation would involve additional comparisons with state-of-the-art (SOTA) models, focusing on computational cost and model parameter size both before and after integrating the dispatcher. This broader comparison would offer a clearer understanding of the dispatcher’s role and efficiency within the model.
Thanks for the suggestions. We provide more evaluation results on computational cost and model parameter size for UniTST w/ and w/o dispatchers, PatchTST, and iTransformer. The memory cost is obtained on Traffic datasets with batch size 64 for all models.
| UniTST w/ dispatchers | UniTST w/o dispatchers | PatchTST | iTransformer | |
|---|---|---|---|---|
| Model Parameter Size | 1,974,496 | 1,871,200 | 681,186 | 6,411,872 |
| Memory Cost | 22,874MB | OOM | 9,155MB | 23,347MB |
We can see that UniTST w/ dispatchers has slightly more parameters than w/o dispatchers while w/ dispatchers have lower memory cost by avoiding the huge attention matrices. In addition, we also found that iTransformer has much more model parameters than UniTST and it requires slightly more memory compared with UniTST w/ dispatchers.
Q1: The ablation study does not specify the batch size used, also there are questions about the choice of memory usage as a comparative metric.. A detailed explanation is needed to clarify why memory usage was selected as a benchmark for this experiment and why one out of four ablation experiments resulted in an out-of-memory (OOM) error.
In the ablation study, as pointed out on the top of page 8, we ensure that these hyperparameters (the number of layers, hidden dimensions, batch sizes) are set as the same for both w/ and w/o routers. The specification are: batch size 64, hidden dimensions as 128 and 4 layers.
Q2: Although the results from the experiment in Figure 3 are cited as the primary source of motivation, the rationale behind conducting the experiment in Figure 3 itself is unclear. Most existing models that aim to capture inter-variable and intra-variable dependencies in multivariate data do not employ patch operations. Thus, the introduction of patching in the experiment warrants further explanation.
Thanks for raising this issue on the motivation figure and the patching operations. We would like to provide further explanations on why we use patches. Patching essentially aggregate several adjacent time stamps as a sub-series and map to an embedding. In fact, patching has been used in several previous work of different types, e.g., patchTST (channel-independent) and Crossformer (aiming to capture inter-variable and intra-variable dependencies). The models work on the whole series (such as iTransformer) can be also considered as setting patch size as series size.
In our humber opinion, patching avoid the noisy information at individual time stamp level and enhances the local semantic meaning. With patching, our proposed attention module can capture the dependencies between patches (sub-series). In contrast, if not using patching, the attention is applied on individual time stamp level, which has no semantic meaning. Additionally, without patching, we cannot investigate the correlation between two scalar values of different time stamp. In language domain, each word has real semantic meaning, while in time series, each value at one time stamp has no semantic meaning. This difference created the difficulty of time series modeling. Based on these reason, we believe patching can help modeling ability. Therefore, we incorporate patch operations in our method.
The authors claim that "each value at one time stamp has no semantic meaning" in time series data, which is an oversimplification. In reality, every time point in many time series datasets holds important information that's crucial for forecasting and analysis. For example, each timestamp of stock prices or hourly temperature records carries significant value in practical applications. Moreover, this paper haven't clearly explained how patching specifically enhances the unified Attention mechanism's ability to capture dependencies. Without a solid theoretical foundation, the rationale behind combining patching with unified Attention doesn't seem sufficiently justified. This raises questions about the overall novelty and scientific merit of the approach.
The proposed method is too similar to existing models like PatchTST in some aspects, lacking significant innovation. This kind of incremental improvement based on existing methods doesn't provide enough new value to stand out or justify substantial attention and acceptance.
Dear Reviewer 3TG2,
Thanks for the comments. We understand that your concerns mainly remain on Patching. However, we respectfully disagree with some of your statements and provide the further clarification below:
- We respectfully disagree with your statement "The proposed method is too similar to existing models like PatchTST in some aspects". In our humble opinion, except using patching, our proposed method is not similar to PatchTST. Our method capture inter- and intra-series dependencies while PatchTST is operated individually on each series (i.e., only capture intra-series dependencies). We kindly request your clarification on your statement "too similar to PatchTST in some aspects". Which aspects are you referring to?
- In terms of your question on why we are using patching, we have explained in our previous comments "to avoid the noisy information at individual time stamp level and enhances the local semantic meaning". In our humble opinion, Patch is a common module used to preserve more semantic meaning, which is used in many previous work, such as PatchTST[1], Crossformer[2], TimeXer[3], and even a recent time series foundation model TimesFM[4]. We would like to clarify that we did not list patching as our contributions of this submission. It is rather an empirical choice, which we following previous papers. We don't fully understand your comments "Moreover, this paper haven't clearly explained how patching specifically enhances the unified Attention mechanism's ability to capture dependencies. Without a solid theoretical foundation, the rationale behind combining patching with unified Attention doesn't seem sufficiently justified. ". Could you please clarify the reason why this explanation might be required? As patching is not our contribution and out of scope of our submission, we did not investigate further on this. We can leave it as our future work.
- Another reason of using patching is that, as we clarified in our previous response, the attention is meaningful on sub-series / patch levels. It means that, between two patches, there is a semantic meaning about how strong these two patches are correlated (potential dependencies exist). For example, two patches with the similar increment trends may be highly correlated. In constrast, on individual time stamp level, no semantic meaning on how strong these two individual points are correlated.
- We agree on your point "each timestamp of stock prices or hourly temperature records carries significant value ". However, two individual timestamps are highly independent and no semantic meaning on the dependencies/correlations considering two individual timestamps.
In addition to our clarification, we also notice that you just changed the score from 5 to 3 after we received the response from Reviewer URks. However, your score remained unchanged at 5 when you replied to our response at 26 Nov 2024, 11:15. We'd like kindly remind you that, during this period, our submission remained the same for both the manuscript and the supplementary materials. Therefore, may we kindly ask the reason of this score change?
We look forward to your further clarifications/questions/comments. We are willing to respond and clarify further.
References:
[1] Nie, Yuqi, et al. "A time series is worth 64 words: Long-term forecasting with transformers." ICLR 2023.
[2] Zhang, Yunhao, and Junchi Yan. "Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting." ICLR 2023.
[3] Wang, Yuxuan, et al. "Timexer: Empowering transformers for time series forecasting with exogenous variables." NeurIPS 2024.
[4] Das, Abhimanyu, et al. "A decoder-only foundation model for time-series forecasting." ICML 2024.
Dear Authors,
Thank you for your rebuttal and for the clarifications provided. However, I would like to address several points of concern regarding the sequence of interactions and the substance of your response.
1. Clarification on the Scoring and Response Sequence
As mentioned in your rebuttal, you indicated that after I submitted my initial review and score (5), you did not receive a follow-up from me addressing your motivation. However, it is important to clarify that I did respond to your submission after your first reply. In my response, I raised several concerns, including the lack of clear motivation behind certain design choices, which I found to be an important point for further justification.
Instead of addressing these concerns, I noticed that you chose to focus on Reviewer URks' comments and left my questions unresolved. This led me to reconsider my score. After reading the response to Reviewer URks, I came to the conclusion that the initial score of 5 may have been too high, considering that some key theoretical aspects remained unclear. As a result, I revised my score to 3, reflecting these concerns.
It is only after I modified my score that I saw further responses from you. This sequence of events suggests that my comments may not have been fully considered in the earlier stages of the rebuttal process, and I am now addressing these concerns in more detail.
2. On the Relationship to PatchTST
Regarding the comparison to PatchTST [1], I still maintain that your method shares a significant similarity with it. While I agree that your method introduces some novel elements, the core structure of your approach (i.e., the use of patching as a method for data flow modeling) closely follows the work presented in PatchTST. The main distinction, as you pointed out, is how you handle attention routing between the patches, and it is clear that this idea is inspired by CrossFormer [2].
However, the main contribution of your paper appears to be the optimization of the attention routing process to reduce time complexity, which I initially recognized as a potential innovation. This is why I awarded a score of 5 in my first review. However, upon further reflection, the theoretical justification for why this optimization is necessary and how it improves upon previous methods remains unclear. I would expect a deeper analysis of how the attention routing mechanism is fundamentally different and beneficial in your context, especially in comparison to existing solutions like PatchTST and CrossFormer. Without this additional theoretical support, the work seems more like an incremental improvement rather than a groundbreaking contribution.
3. Motivation and Novelty of Your Approach
In your abstract, you mention the importance of capturing both inter-series and intra-series dependencies, which is a central motivation for your proposed model. However, it is important to note that several recent works have also addressed this challenge, and the motivation behind your approach lacks significant novelty. For example, models like FTSCGNN [4], LIFT [5], CARD [6], Client [7], and DSFormer [8] also aim to capture these dependencies in multivariate time series forecasting (MTSF).
- FTSCGNN [4] introduces a spatial-temporal graph to capture both inter- and intra-series dependencies in MTS data.
- LIFT [5] rethinks channel dependence and learns from leading indicators to capture dependencies.
- CARD [6] aligns channels and blends multiple sources of information to improve forecasting.
- Client [7] integrates cross-variable linear dependencies, which are crucial for MTSF.
- DSFormer [8] uses a double sampling strategy to enhance long-term prediction performance.
These works have already demonstrated the significance of capturing both intra- and inter-series dependencies, and I believe this concept is now well-established in the literature. Therefore, the novelty of your motivation may be limited, especially given that it has already been explored by several other recent works.
4. Suggestions for Further Theoretical Analysis
I believe that further clarification on a few points would strengthen your submission:
- First, is the routing scheme you propose inherently tied to the patching approach, or could it be applied in other contexts, such as linear models? Providing this kind of analysis would broaden the scope and theoretical significance of your work.
- Second, regarding your view that individual timestamps contain no meaningful information, I believe this perspective warrants reconsideration. Recent work, such as GLAFF [3], has demonstrated that even individual timestamps can carry important periodic information, such as daily or hourly patterns. This contradicts the idea that timestamps are independent and devoid of semantic meaning. A more detailed discussion of how timestamp-level features can be integrated into your model could strengthen your argument and broaden its applicability.
5. Conclusion
To summarize, I initially gave your paper a score of 5 based on the potential novelty of your attention routing optimization, but after reviewing your rebuttal and considering additional feedback from Reviewer URks, I revised my score to 3. The main reason for the downgrade is the lack of a strong theoretical contribution and the unexplored implications of your routing mechanism. I encourage you to further justify the theoretical aspects of your model and explore how timestamp-level information might be incorporated in future versions of the work.
Best regards,
Reviewer 3TG2
References
[1] Nie, Yuqi, et al. "A time series is worth 64 words: Long-term forecasting with transformers." ICLR 2023.
[2] Zhang, Yunhao, and Junchi Yan. "Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting." ICLR 2023.
[3] Wang, Chengsen. "Rethinking the Power of Timestamps for Robust Time Series Forecasting: A Global-Local Fusion Perspective." NeurIPS 2024.
[4] Wang, Yucheng, et al. "Fully-Connected Spatial-Temporal Graph for Multivariate Time-Series Data." AAAI 2024.
[5] Zhao, Lifan, et al. "Rethinking Channel Dependence for Multivariate Time Series Forecasting: Learning from Leading Indicators." ICLR 2024.
[6] Wang, Xue, et al. "CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting." ICLR 2024.
[7] Gao, Jiaxin, et al. "Client: Cross-variable Linear Integrated Enhanced Transformer for Multivariate Long-Term Time Series Forecasting." Arxiv 2023.
[8] Yu, Chengqing, et al. "DSformer: A Double Sampling Transformer for Multivariate Time Series Long-term Prediction." CIKM 2023.
Further Analysis and insights on Patching
There might be some misunderstandings. We would like to first clarify two different concepts: 1. individual timestamp, 2. individual time points. In our previous replies, we use time point and timestamp interchangablely (similar to what you mentioned here "In reality, every time point in many time series datasets holds important information that's crucial for forecasting and analysis. For example, each timestamp of stock prices or hourly temperature records carries significant value in practical applications.").
When we said "on individual time stamp level, no semantic meaning on how strong these two individual points are correlated.", we were referring to the numerical values (e.g., n variates -> n values) at two different time points instead of the timestamp (a digital record like 2024-01-02 07:10).
From here on, to make sure we are on the same page, we use time points to refer to the numerical values and timestamp to refer to a digital record like 2024-01-02 07:10 as additional time features. For our previous replies, we argued that using patching can avoid the noisy information at individual time points, and there is no semantic meaning on the dependencies/correlations considering two individual time points. For this, we provide the empirical results as below:
| Individual time point (patch len = 1) | UniTST (patch len = 16) | ||||
|---|---|---|---|---|---|
| Pred_len | MSE | MAE | MSE | MAE | |
| ETTh1 | 96 | 0.387 | 0.403 | 0.383 | 0.398 |
| 192 | 0.443 | 0.434 | 0.434 | 0.426 | |
| 336 | 0.482 | 0.45 | 0.471 | 0.445 | |
| 720 | 0.516 | 0.485 | 0.479 | 0.469 | |
| ETTh2 | 96 | 0.298 | 0.345 | 0.292 | 0.342 |
| 192 | 0.379 | 0.395 | 0.37 | 0.39 | |
| 336 | 0.387 | 0.413 | 0.382 | 0.408 | |
| 720 | 0.429 | 0.445 | 0.409 | 0.431 | |
| ETTm1 | 96 | 0.337 | 0.37 | 0.313 | 0.352 |
| 192 | 0.379 | 0.384 | 0.359 | 0.38 | |
| 336 | 0.414 | 0.418 | 0.395 | 0.404 | |
| 720 | 0.487 | 0.466 | 0.449 | 0.44 | |
| ETTm2 | 96 | 0.183 | 0.271 | 0.178 | 0.262 |
| 192 | 0.251 | 0.31 | 0.243 | 0.304 | |
| 336 | 0.312 | 0.352 | 0.302 | 0.341 | |
| 720 | 0.418 | 0.409 | 0.398 | 0.395 | |
| Weather | 96 | 0.162 | 0.212 | 0.156 | 0.202 |
| 192 | 0.212 | 0.255 | 0.207 | 0.25 | |
| 336 | 0.268 | 0.295 | 0.263 | 0.292 | |
| 720 | 0.345 | 0.343 | 0.34 | 0.341 | |
| Exchange | 96 | 0.086 | 0.203 | 0.08 | 0.198 |
| 192 | 0.171 | 0.294 | 0.173 | 0.296 | |
| 336 | 0.351 | 0.433 | 0.314 | 0.406 | |
| 720 | 1.18 | 0.812 | 0.838 | 0.693 | |
| Solar | 96 | 0.202 | 0.2238 | 0.189 | 0.228 |
| 192 | 0.223 | 0.255 | 0.222 | 0.253 | |
| 336 | 0.268 | 0.295 | 0.242 | 0.275 | |
| 720 | 0.338 | 0.343 | 0.247 | 0.282 |
From the table, we observe that using individual time points instead of patching is generally worse than using patching. Specifically, when prediction length is relatively long (e.g., 720), using individual time pointing is significantly worse than using Patching. This observation verify the effectiveness of using patching to avoid noisy information at individual time points.
For GLAFF [1], we would like to point out that the timestamp information they are using is the digital record (e.g., 2024-01-02 07:10) as additional time features, which is not what we are discussing regarding individual time point. In our work, we mainly focus on using numeric values in the time series without considering timestamp-level features. We agree that incorporating timestamp-level features might be an interesting topic for our future work, although it is out of scope of our current version.
Additionally, GLAFF is publicly avaialbe on arxiv since 27 Sep 2024 12:34:08 UTC, which is literally 4 days before ICLR 2025 submission Deadline. Although according to reviewer guide (https://iclr.cc/Conferences/2025/ReviewerGuide), we believe that we do not necessarily to be aware and discuss this work, we still provide our discussion above.
Reference:
[1] Wang, Chengsen. "Rethinking the Power of Timestamps for Robust Time Series Forecasting: A Global-Local Fusion Perspective." NeurIPS 2024.
Dear Reviewer 3TG2,
Thank you for your comments and clarifications. We provide our further clarifications and questions below:
Clarifications and Questions on Clarification on the Scoring and Response Sequence
You mentioned:
As mentioned in your rebuttal, you indicated that after I submitted my initial review and score (5), you did not receive a follow-up from me addressing your motivation.
We disagree with this statement. We believe that we did not say that we did not receive a follow-up from you. If we are remembering incorrectly, could you please provide some pointers?
Instead of addressing these concerns, I noticed that you chose to focus on Reviewer URks' comments and left my questions unresolved. This led me to reconsider my score.
We tried to collect responses from all reviewers in the first round and respond to all together. Therefore, we sent the reminder to Reviewer URks first before responding to your first-round response. If this behavior (i.e., sending a reminder to a reviewer before responding to another reviewer) is not allowed, we apologize and kindly ask for a reference. We also would like to kindly ask why "choosing to focus on Reviewer URks' comments and leaving your questions unresolved" led you to "reconsider your score". We are not sure how this behavior affects our paper's quality in a way that would lead you to reconsider the score.
Please note that after we received the first-round response from Reviewer URks, we first responded to your first-round comments on 28 Nov 2024, 18:26 UTC, and then responded to Reviewer URks' first-round comments on 28 Nov 2024, 19:07 UTC.
It is only after I modified my score that I saw further responses from you. This sequence of events suggests that my comments may not have been fully considered in the earlier stages of the rebuttal process.
We replied to your first-round comments merely because we received comments from all reviewers, so that we started replying to first-round comments from different reviewers. We respectfully disagree on the accusation "This sequence of events suggests that my comments may not have been fully considered in the earlier stages of the rebuttal process". We assure that "my comments may not have been fully considered in the earlier stages of the rebuttal process" is not sure from our perspective, however, we cannot control others' own perception.
Now, let us move on to address your remaining concerns.
Clarifications on Motivation
We would like to emphasize that, regarding the motivation of capturing inter-series and intra-series dependencies, we provide empirical evidence to show that these dependencies (i.e., Figure 3 in our manuscript shows correlation between time patches from different variates) do exist in real-world data. This demonstrates the strong reason why we want to capture these dependencies.
For the previous work you mentioned, while they also aim to capture inter-series and intra-series dependencies, they do so without clear evidence to justify the reason for capturing these dependencies. For the statement "These works have already demonstrated the significance of capturing both intra- and inter-series dependencies," we believe they show the "significance" through demonstrating good overall performance. In contrast, we show empirical real-world evidence for our motivation which we did not see in the reference you mentioned, and furthermore, we show the visualization of multivariate correlations captured by our model in Appendix D Figure 10, which further verifies the effectiveness of our method. For DSFormer[1], it only shows the variable correlations in real-world data.
Besides the main difference discussed above, regarding the specific work you mentioned, we discussed CARD [2] in our original manuscript and added LIFT [3] in our revision. For the remaining works, we will add them in our future version (as PDF updates are not currently permitted).
References:
[1] Yu, Chengqing, et al. "DSformer: A Double Sampling Transformer for Multivariate Time Series Long-term Prediction." CIKM 2023.
[2] Wang, Xue, et al. "CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting." ICLR 2024.
[3] Zhao, Lifan, et al. "Rethinking Channel Dependence for Multivariate Time Series Forecasting: Learning from Leading Indicators." ICLR 2024.
Dear Authors,
Thank you for your detailed response and for addressing some of my concerns. I would like to further clarify my position and elaborate on the reasons behind my previous comments and score adjustment.
1. Clarification on the Scoring and Response Sequence
I want to emphasize that my concerns are consistent with those of Reviewer URks. In your responses to Reviewer URks, I did not observe substantial clarification about the essence of your method. Instead, the explanations provided were mostly conceptual and lacked depth, which is insufficient to address the fundamental issues. This aligns with my concerns regarding the lack of novelty in your paper.
My decision to adjust the score was based on the lack of adequate responses to these critical concerns, particularly regarding the empirical analysis related to timestamps and patching. After my initial review and your subsequent reply, I did not receive some clarification on these points until just now. The delay in addressing these important issues contributed to my reassessment of the paper's contribution.
2. Remaining Concerns
a. Questions on Patching and Routing Mechanism
My main question remains: Why is the coupling of patching with the CrossFormer routing mechanism beneficial? Could you provide theoretical analysis or empirical evidence to support this design choice? Additionally, how would the performance be affected if the routing mechanism used linear attention instead? Is there any theoretical foundation that explains the advantages of your approach over existing methods?
Furthermore, can the overall model architecture be applied to linear models? Exploring this possibility could broaden the impact of your work and provide deeper insights into the applicability of your method.
b. Need for Theoretical Analysis
While empirical performance on public datasets is important, it is a necessary but not sufficient condition to demonstrate the novelty and effectiveness of your design. To make your model structure more convincing, it is essential to provide theoretical analysis. Specifically, a mathematical definition of inter-series and intra-series dependencies, and an explanation of how the coupling of routing and patching addresses these dependencies, would strengthen your contribution.
Providing mathematical proofs or theoretical justification would enhance the credibility of your work and offer valuable guidance for future research in this area. Without such analysis, the current approach appears to be a combination of existing modules without a clear underlying rationale.
3. Conclusion
In summary, the lack of theoretical foundation and the over-coupling of modules remain significant concerns. The in-depth analysis of these aspects has not yet been presented. I encourage you to address these points thoroughly, as doing so would significantly strengthen your paper. If you can provide solid theoretical foundations and demonstrate how your approach uniquely solves the identified problems, I would be willing to reconsider my assessment.
Best regards,
Reviewer 3TG2
Dear Reviewer 3TG2,
Thanks for your comments. We provide the clarification below:
My decision to adjust the score was based on the lack of adequate responses to these critical concerns, particularly regarding the empirical analysis related to timestamps and patching. After my initial review and your subsequent reply, I did not receive some clarification on these points until just now. The delay in addressing these important issues contributed to my reassessment of the paper's contribution.
We would like to clarify that we kept clarifying on the points about “time points” and patching. We did not provide the empirical analysis until the last response as we (Authors and you) seem to have some misalignments on “time points” and “timestamps”. Based on your responses so far, we are still not sure and have received your clarification on what you mean by “timestamp” vs "time points". Could you clarify here?
From our side, in our last repose, we provided the clarification on these. Additionally, we would like to kindly ask if you have any comments/suggestions on our provided empirical analysis on individual time points vs patching.
My main question remains: **Why is the coupling of patching with the CrossFormer routing mechanism beneficial?**Could you provide theoretical analysis or empirical evidence to support this design choice?
We would like to clarify that, although we did not provide the theoretical analysis, we did provide the empirical evidence showing that patching provides better performance compared to using individual time points as patching avoids the noisy information at individual time points.
Specifically, a mathematical definition of inter-series and intra-series dependencies, and an explanation of how the coupling of routing and patching addresses these dependencies, would strengthen your contribution. ..... Providing mathematical proofs or theoretical justification would enhance the credibility of your work and offer valuable guidance for future research in this area.
We would like to point out that we provide the definition on cross-time cross variate correlation on Definition 1. For theoretical proof to justify our model structure, we leave it for future work. Moreover, we would like to point out that the references you provided (such as FTSCGNN, LIFT, Client) also do not have theoretical analysis.
This paper addresses the issue that previous attention-based methods were unable to model both temporal and channels simultaneously. It proposes a flattening approach and draws inspiration from the attention mechanisms of ETC and Crossformer to reduce the complexity of attention calculations. Finally, it attempts to demonstrate the effectiveness of the method through experiments.
优点
S1: The multivariate time series forecasting problem focused on in this paper is worthy of investigation.
S2: The author's motivation for the study makes sense.
S4: The paper is well-written and well-organized, and the content reads smoothly.
缺点
W1: The authors' model design lacks innovation. While flattening patches makes sense, there is a lack of innovation in the model design, as the Transformer architecture used does not appear to offer any significant novelty. The approach to Attention with dispatchers and the setup resembling ETC[1] and Crossformer[2] are nearly identical.
[1] Ravula, A., Alberti, C., Ainslie, J., Yang, L., Pham, P. M., Wang, Q., ... & Fisher, Z. (2020, June). ETC: Encoding long and structured inputs in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[2] Zhang, Y., & Yan, J. (2023, May). Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In The eleventh international conference on learning representations.
W2: The authors' lack of thorough research on past methods is concerning. While authors took note of iTransformer from ICLR 2024, they failed to consider contemporary state-of-the-art methods such as TimeMixer[3], FITS[4], and ModernTCN[5]. Furthermore, this still lacks a comparison with GNN-based methods like CrossGNN[6] and FourierGNN[7] from NeurIPS 2023.
[3] Wang, S., Wu, H., Shi, X., Hu, T., Luo, H., Ma, L., ... & ZHOU, J. (2024). TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting. In The Twelfth International Conference on Learning Representations.
[4] Xu, Z., Zeng, A., & Xu, Q. (2024). FITS: Modeling Time Series with Parameters. In The Twelfth International Conference on Learning Representations.
[5] Luo, D., & Wang, X. (2024). Moderntcn: A modern pure convolution structure for general time series analysis. In The Twelfth International Conference on Learning Representations.
[6] Huang, Q., Shen, L., Zhang, R., Ding, S., Wang, B., Zhou, Z., & Wang, Y. CrossGNN: Confronting Noisy Multivariate Time Series Via Cross Interaction Refinement. In Thirty-seventh Conference on Neural Information Processing Systems.
[7] Yi, K., Zhang, Q., Fan, W., He, H., Hu, L., Wang, P., ... & Niu, Z. FourierGNN: Rethinking Multivariate Time Series Forecasting from a Pure Graph Perspective. In Thirty-seventh Conference on Neural Information Processing Systems.
W3: The experimental comparisons are not enough insufficient. The methods mentioned in W2 were also not compared by the authors, therefore, it cannot be concluded that UniTST achieves SOTA performance.
W4: The lack of details in reproducibility.
问题
-
Could the authors further expand the search range for hidden dimensions to evaluate all models' performance fairly? Also, Please publicly disclose the optimal parameters for all models across various tasks as determined through validation sets (batch size, hidden dimensions, etc.). This is crucial to validate the performance of UniTST and mitigate any impact from selective parameter choices.
-
I noticed the experimental results in Table 6, which seem to show minimal improvement compared to iTransformer in many cases. Could the authors, after addressing question 1, provide new experimental results, analyze the reasons behind them, and conduct a comparative analysis of the efficiency (time/memory) of these two methods?
-
Can the authors provide a case study of a variable prediction curve?
Q1: Could the authors further expand the search range for hidden dimensions to evaluate all models' performance fairly? Also, Please publicly disclose the optimal parameters for all models across various tasks as determined through validation sets (batch size, hidden dimensions, etc.). This is crucial to validate the performance of UniTST and mitigate any impact from selective parameter choices.
For the results of previous models (including iTransformer), we reuse the results from iTransformer paper [1] as we are using the same experimental setting, which we believe should be the best results of these models (or at least for iTransformer which is generally the second best model as shown in our evaluation result tables).
Here, we disclose the hyper-parameters of UniTST used on all datasets:
| Batch Size | Hidden Dimension | Number of Layers | |
|---|---|---|---|
| ETTh1 | 128 | 64 | 3 |
| ETTh2 | 128 | 64 | 3 |
| ETTm1 | 128 | 64 | 3 |
| ETTm2 | 128 | 64 | 3 |
| Electricity | 32 | 512 | 3 |
| Traffic | 64 | 128 | 3 |
| Weather | 128 | 512 | 3 |
| Exchange | 32 | 512 | 3 |
| Solar | 128 | 512 | 3 |
| PEMS03 | 64 | 512 | 4 |
| PEMS04 | 64 | 512 | 4 |
| PEMS07 | 64 | 512 | 3 |
| PEMS08 | 128 | 512 | 3 |
Q2: I noticed the experimental results in Table 6, which seem to show minimal improvement compared to iTransformer in many cases. Could the authors, after addressing question 1, provide new experimental results, analyze the reasons behind them, and conduct a comparative analysis of the efficiency (time/memory) of these two methods?
Training Time Comparison : (second/iteration)
| Dataset | UniTST (ours) | iTransformer |
|---|---|---|
| ECL | 0.1472 | 0.0290 |
| Exchange | 0.0678 | 0.0196 |
| Weather | 0.0492 | 0.1034 |
Memory and model parameter size comparison:
| UniTST | iTransformer | |
|---|---|---|
| Model Parameter Size | 1,974,496 | 6,411,872 |
| Memory Cost | 22,874MB | 23,347MB |
The memory cost is calculated on Traffic dataset with batch size 64 for both models.
On the above results, we can see that UniTST can be trained faster on Exchange and Weather datasets. For model parameter size, UniTST uses significantly fewer parameters than iTransformer. Additionally, UniTST also has slightly lower memory cost.
Q3: Can the authors provide a case study of a variable prediction curve?
Sorry we are not sure what it refers to. Could you clarify the meaning of the variable prediction curve? Varying prediction length and see how the performance changes? Or the plots directly on the predictions generated by different models?
Reference:
[1] iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
Thank you for your comprehensive rebuttal. I appreciate the effort you've made to address the concerns raised in my initial review. After careful consideration of your responses, although I have concerns about innovation, specifically the difference between Crossformer, I have decided to revise my score to 6, reflecting the improvements you've made.
Dear Reviewer wpST,
We appreciate the valuable feedback and suggestions. We would like to clarify and answer the questions as below:
W1: The authors' model design lacks innovation. While flattening patches makes sense, there is a lack of innovation in the model design, as the Transformer architecture used does not appear to offer any significant novelty. The approach to Attention with dispatchers and the setup resembling ETC[1] and Crossformer[2] are nearly identical.
We would like to point out that we provided the discussiong between our model and Crossformer in Section 2 and 4.2. Despite the model design on attention with dispatchers may be similar, we are using this different purposes: our method uses dispatchers to reduce the memory complexity for allowing the attention to simultaneously capture inter- and intra- series dependencies on the unified sequence. Crossformer uses a two-stage method to sequentially capture intra-series dependencies and then inter-series dependencies, which is relatively limited as we discussed in Section 3.
W2 and W3: The authors' lack of thorough research on past methods is concerning. While authors took note of iTransformer from ICLR 2024, they failed to consider contemporary state-of-the-art methods such as TimeMixer[3], FITS[4], and ModernTCN[5]. Furthermore, this still lacks a comparison with GNN-based methods like CrossGNN[6] and FourierGNN[7] from NeurIPS 2023. The experimental comparisons are not enough insufficient. The methods mentioned in W2 were also not compared by the authors, therefore, it cannot be concluded that UniTST achieves SOTA performance.
Thanks for the suggestions. We added several methods, ModernTCN [1] (TCN-based), TimeMixer[2] (MLP-based), CrossGNN[3] (GNN-based), and TSLANet[4] (CNN-based) to be compared with our method. Following iTransformer, we use sequence length as 96 for all models. We show the average results over 4 prediction lengths (i.e., 96, 192, 336, 720) on the following table.
| Dataset | TSLANet | ModernTCN | UniTST | TimeMixer | CrossGNN | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | |
| ETTh1 | 0.447 | 0.441 | 0.435 | 0.428 | 0.442 | 0.435 | 0.447 | 0.440 | 0.437 | 0.434 |
| ETTh2 | 0.354 | 0.391 | 0.353 | 0.387 | 0.363 | 0.393 | 0.364 | 0.395 | 0.393 | 0.418 |
| ETTm1 | 0.379 | 0.397 | 0.385 | 0.400 | 0.379 | 0.394 | 0.382 | 0.381 | 0.395 | 0.404 |
| ETTm2 | 0.283 | 0.327 | 0.281 | 0.327 | 0.280 | 0.326 | 0.275 | 0.323 | 0.282 | 0.330 |
| ECL | 0.199 | 0.282 | 0.202 | 0.285 | 0.166 | 0.262 | 0.182 | 0.272 | 0.201 | 0.300 |
| Exchange | 0.353 | 0.400 | 0.353 | 0.401 | 0.351 | 0.398 | N.A. | N.A. | 0.345 | 0.395 |
| Traffic | 0.494 | 0.314 | 0.625 | 0.377 | 0.439 | 0.274 | 0.484 | 0.297 | 0.583 | 0.323 |
| Weather | 0.259 | 0.279 | 0.243 | 0.273 | 0.242 | 0.271 | 0.240 | 0.271 | 0.247 | 0.289 |
In the table, we can see that, for MSE, UniTST achieves the best performance on ETTm1, ECL, Traffic (3 out of 8 datasets). TimeMixer and ModernTCN are the best for MSE on 2 out of 8 datasets. For MAE, UniTST are the best model on ETTm1, ECL, Traffic, and Weather (4 out of 8 datasets).
W4: The lack of details in reproducibility.
Thanks for the suggestions. We added the code to the supplementary materials.
References:
[1] Luo, Donghao, and Xue Wang. Moderntcn: A modern pure convolution structure for general time series analysis. ICLR 2024.
[2] Wang, Shiyu, et al. Timemixer: Decomposable multiscale mixing for time series forecasting. ICLR 2024
[3] Huang, Qihe, et al. Crossgnn: Confronting noisy multivariate time series via cross interaction refinement. NeurIPS 2023.
[4] Eldele, Emadeldeen, et al. Tslanet: Rethinking transformers for time series representation learning. ICML 2024
Hi, ALL,
I am the AC of this paper and I have finished a deep dive into the paper as well as all the discussion before.
I would like to give a few clarifications and try to highlight the next-step discussions to get the opinions converged. Authors and reviewers please read my clarifications and follow my highlights to continue discussions.
I kindly remind you all please be polite and respectful to the authors and reviewers. The discussion is still open and please be free to continue on the results/contribution discussions in a thoughtful and respectful way. Between November 26th and December 3rd, authors can reply to the messages.
Best,
AC
Clarifications
- There were a few impolite messages from the reviewers. Please be sure to be polite and constructive.
- ICLR-2025 has made a Discussion Period Extension to encourage the author-review discussion. So it is ok if the author make a late initial response or reply, a couple of days ahead of the deadline instead of the last minute reply. Currently, the authors have replied all the questions from the reviewer, it is recommended that the reviewer make a decision whether the authors' reply has resolved your concern. A bit late reply from the authors should not be blamed since we still have a few days of discussion.
- According to the current information, I would like to assume that Reviewer uAEX had made a "mistyped score" and left this mistake behind. If anyone have further information, please send me a private message about this "mistyped score".
- I have noticed a template misuse of the initial submission, the ICLR-2024 template. I will get you back after I figure out that this mistake would affect the review process.
Highlights on Next-step Discussion
I would like to give a few focus to better discuss here.
Innovation and Uniqueness of the Contribution
Some reviewers pointed out that this paper "simply combines“ the ideas from PatchTST and Crossformer.
- May I ask the reviewer 3TG2&URks and read the following authors' response and see if it resolve your concern?
W1: The authors' model design lacks innovation. While flattening patches makes sense, there is a lack of innovation in the model design, as the Transformer architecture used does not appear to offer any significant novelty. The approach to Attention with dispatchers and the setup resembling ETC[1] and Crossformer[2] are nearly identical.
We would like to point out that we provided the discussiong between our model and Crossformer in Section 2 and 4.2. Despite the model design on attention with dispatchers may be similar, we are using this different purposes: our method uses dispatchers to reduce the memory complexity for allowing the attention to simultaneously capture inter- and intra- series dependencies on the unified sequence. Crossformer uses a two-stage method to sequentially capture intra-series dependencies and then inter-series dependencies, which is relatively limited as we discussed in Section 3.
- May I ask the reviewer wpST and uAEX to check reviewer 3TG2&URks's comment, does their comments affect your opinion?
- I have a further question here: could the authors provide some insights why UniTST can be trained faster and consume less memory overhead compared with iTransformer?
Theoretical analysis
One reviewer asked about the theoretical analysis/insights in this link: https://openreview.net/forum?id=cuFnNExmdq¬eId=Yma5V5YDuK. May I ask the authors to give a reponese here?
I have contacted with the program committee on the template misuse issue. We agree that the template deviation of this submission is ok since 1) there are only tiny differences between the deviations and 2) this submission still have half a page leftover, and the deviation doesn’t give them more space.
Best,
AC
This paper proposes UniTST for multivariate time series forecasting. While the authors have made efforts in response to reviewers, significant issues remain. The model design lacks substantial innovation, with similarities to existing methods such as PatchTST and Crossformer. The motivation and theoretical underpinnings are not convincingly presented, and the empirical evidence provided does not adequately support the claimed contributions. At last, reviewers does not reach a consensus, which suggests a reject.
审稿人讨论附加意见
During the rebuttal period, several key points were raised by reviewers and addressed by the authors. One reviewer (anon id: 3TG2) pointed out the lack of novelty in the model design, citing similarities to existing methods like PatchTST and Crossformer. The authors responded by highlighting the differences in how their method uses dispatchers to capture dependencies compared to Crossformer. They also noted that their main contribution was identifying the importance of simultaneously capturing inter- and intra-series dependencies with empirical evidence, which they claimed was lacking in previous works. Another reviewer (anon id: URks) questioned the innovation of the Dispatcher, to which the authors explained that while the design is similar, their overall architecture and the way of capturing cross-time and cross-variate dependencies differed from Crossformer. Regarding the concern about the lack of theoretical analysis, the authors provided some empirical evidence related to patching but admitted that a full theoretical proof was left for future work.
In my final decision, the lack of clear novelty in the model design was a significant factor. Although the authors tried to distinguish their work, the similarities to existing methods remained a concern. The insufficient theoretical foundation also weighed heavily, as it made it difficult to assess the true value and uniqueness of the proposed approach. The empirical evidence provided was not strong enough to overcome these drawbacks. Additionally, the overall contribution in the context of the existing literature did not seem substantial enough to merit acceptance, leading to a reject decision.
Reject