6.3

/10

Poster4 位审稿人

最低4最高7标准差1.3

3.8

置信度

正确性3.3

贡献度3.0

表达3.5

NeurIPS 2024

Ada-MSHyper: Adaptive Multi-Scale Hypergraph Transformer for Time Series Forecasting

Zongjiang Shang,Ling Chen,Binqing Wu,Dongliang Cui

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

TL;DR

We propose an adaptive multi-scale hypergraph transformer for time series forecasting.

摘要

关键词

Time series forecastingtransformermulti-scale modelinghypergraph neural networkhypergraph learning

评审与讨论

审稿意见

评分: 7置信度: 32024-07-08

This paper presents an adaptive hypergraph learning module for modeling group-wise multi-scale interactions to improve transformer-based model for time series data. Given a time series data, Multi-Scale Feature Extraction (MFE) Module first converts it to a hypergraph. Then, intra-scale and inter-scale learning modules are performed. Comprehensive experiments demonstrate the effectiveness of the proposed method.

优点

The motivation behind Ada-MSHyper is well-founded and insightful.
The design of the proposed method effectively addresses the target challenge.
Extensive experiments were carried out, demonstrating the clear superiority of Ada-MSHyper.
The writing of the paper is clear and easy to follow.

Overall, this is a solid paper with excellent presentation.

缺点

This paper is generally well-written. My only suggestion is to provide more insights into the performance differences of Ada-MSHyper across different datasets. Consider analyzing the reasons behind its performance on long-range, short-range, and ultra-long-range datasets. Does Ada-MSHyper perform better on one type compared to the others? What might be the reasons for this? For example, the node constraints on different dataset could be quite different. Node clustering might behave poorly on datasets with weaker temporal patterns. Answering these questions can help understand the effectiveness and limitations of Ada-MSHyper.

问题

Considering that one of the challenges addressed by this paper is "temporal variations entanglement," how do the authors compare their method with frequency domain analysis techniques? For instance, have they considered converting the time-series data into a spectrogram using Short-Time Fourier Transform? Can Ada-MSHyper easily adapt to the 2D spectrogram data in time-frequency domain?

局限性

More insightful analysis could be added to the experiments. See weaknesses and questions.

作者回复

2024-08-07

Comment:

Many thanks to Reviewer VVbq for providing the insightful reviews and comments.

Q1: In long-range, short-range, and ultra-long-range time series forecasting, does Ada-MSHyper perform better on one type compared to the others? What might be the reasons for this?

Thanks for your valuable feedback and scientific rigor. Compared to long-range and ultra-long-range time series forecasting, short-range time series forecasting has better performance (reducing prediction errors by an average of 10.38%). The reason may be that we use PEMS datasets in short-range time series forecasting, which are influenced by human activities and have obvious multi-scale pattern information. This also explains why Ada-MSHyper also performs better on Electricity and Traffic datasets for long-range time series forecasting (achieving the best performance on all forecasting horizons).

As observed by the reviewer, the cause of the above phenomenon is the design of the node constraint, which is used to cluster nodes with similar semantic information. To investigate the impact of node constraint, we have newly added ablation studies on Electricity dataset by carefully designing the following variant:

-w/o NC: It removes the node constraint.

The experimental results are shown in Table 1, which has also been included in the $\underline{\text{revised paper}}$ .

Table 1.

Variation	-w/o NC	Ada-MSHyper
Metric	MSE MAE	MSE MAE
96	0.169 0.245	0.135 0.238
336	0.184 0.275	0.168 0.266
720	0.237 0.403	0.212 0.293

From Table 1 and the experimental results of -w/o NC in $\underline{\text{Section 5.3 of the original paper}}$ , we have the following observation: (1) The performance drop of -w/o NC is smaller compared to other variants (e.g., -w/o NHC) on ETTh1 dataset, possibly because there are weaker temporal patterns on ETTh1 dataset, and on that dataset Ada-MSHyper may focus more on macroscopic variation interactions rather than detailed group-wise interactions between nodes with similar semantic information. (2) Compared to its performance on ETTh1 dataset, the performance of -w/o NC shows a significant drop on Electricity dataset. The reason may be that Electricity dataset has obvious multi-scale pattern information, making the NC mechanism used to cluster nodes with similar semantic information appear to be more important. (3) Ada-MSHyper still performs better than -w/o NC in almost all cases, showing the effectiveness of node constraint.

Q2: How do the authors compare their method with frequency domain analysis techniques and can Ada-MSHyper easily adapt to the 2D spectrogram data in time-frequency domain?

To solve the problem of temporal variations entanglement, some frequency domain analysis methods (e.g., TimesNet [1], FiLM [2], FEDformer [3], and Autoformer [4]) adopt simplistic series decomposition with frequency domain analysis techniques to differentiate temporal variations at different scales. We have compared Ada-MSHyper with these methods and added analysis in $\underline{\text{Section 5.2 of the paper}}$ .

In addition, Ada-MSHyper may not be directly applicable to 2D data. To investigate the performance of Ada-MSHyper in the frequency domain, we flatten the 2D spectrogram data into 1D features and conduct ablation studies by carefully designing the following variant:

w/ STFT: It converts the multi-scale subsequences into 2D spectrogram data using Short-Time Fourier Transform (STFT) and then flattens it into 1D feature representations before send to the AHL model.

The experimental results on ETTh1 dataset are shown in Table 2, which has also been included in the $\underline{\text{revised paper}}$ . From Table 2 we can observe that:

-w/ STFT performs worse than Ada-MSHyper. The reason may be that flattening 2D spectrogram data into 1D features will mix frequency domain features with time domain features, thereby impacting the forecasting performance of the model. However, the valuable feedback from the reviewer inspired us to consider that time series features need no to be limited to the form of "scalars" or "vectors". The "tensors" (e.g., 2D spectrogram data) may offer a better representation for time series forecasting. Modifying the model to support 2D spectrogram data instead of flattening it into 1D features may have better performance. Considering the scope of our paper, we would like to leave the exploration in the future work.

Table 2.

Variation	-w/ STFT	Ada-MSHyper
Metric	MSE MAE	MSE MAE
96	0.390 0.399	0.372 0.393
336	0.478 0.443	0.422 0.433
720	0.479 0.466	0.445 0.459

[1] Wu H, Hu T, Liu Y, et al. TimesNet: Temporal 2d-variation modeling for general time series analysis. ICLR, 2023.

[2] Zhou T, Ma Z, Wen Q, et al. FiLM: Frequency improved legendre memory model for long-term time series forecasting. NeurIPS, 2022.

[3] Zhou T, Ma Z, Wen Q, et al. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. ICLR, 2022.

[4] Wu H, Xu J, Wang J, et al. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. NeurIPS, 2021.

2024-08-08

Thank you for your feedback. That addresses my concern. I would like to raise my rating.

2024-08-09

We would like to thank Reviewer VVbq for providing a detailed and valuable review, which has greatly assisted us in the paper revision.

Thanks again for your dedication in reviewing our paper. It helps us a lot.

审稿意见

评分: 4置信度: 42024-07-11

(1) Design an AHL module to model the abundant and implicit group-wise node interactions and a multi-scale interaction module to model group-wise pattern interactions at different scales. (2)Introduce a NHC mechanism to cluster nodes with similar semantic information and differentiate the temporal variations within each scales.

优点

(1) The first work that incorporates adaptive hypergraph modeling into time series forecasting; (2) Design AHL module to solve semantic information sparsity, and NHC mechanism to solve temporal variations entanglement, it is interesting; (3) Achieve state-of-the-art (SOTA) performance. (4) Lots of experiments and ablation studies, and visualizations to demonstrate the effectiveness of the proposed methodology.

缺点

(1) The pipeline is kind of complicated, and the color of it can be improved; (2) Some expression in the paper can be more formal, like “entries hnm” can be replace by “entries Hnm” for more formal.

问题

(1) In your Hyperedge Constraint, you use both cosine similarity α and Euclidean distance D, are your Euclidean distance normalized? It seems that cosine similarity and normalized Euclidean distance are similar, so will it provide too much redundant information to use αD as your hyperedge loss? Can you do some ablation studies? (2) I am kind of confused about your Multi-Scale Interaction Module, why your Intra-Scale Interaction part use HGNN for message passing, and your Inter-Scale Interaction Module part use transformer for updating features, is there any reason for this? Cause both interaction part are using attention, can they just use the same method for updating features? (3) Note that you use lots of attention mechanism to model group-wise pattern interactions, you use the cosine similarity and the Euclidean distance, both are known to us well. I wonder whether there are some other distance for modeling the pattern interactions we can try? Like Wasserstein distance.

局限性

Authors addressed the limitations in appendix I that their datasets are not really large, so the generalization capabilities of their models may not be really good. The paper mainly focuses on scientific research and has no obvious negative social impact.

作者回复

2024-08-07

Comment:

Many thanks to Reviewer HJMp for providing the insightful reviews and comments.

Q1: The pipeline can be improved and some expressions in the paper can be more formal.

Thanks for your valuable suggestions and scientific rigor. We have improved the color of the pipeline, see Figure 1 in $\underline{\text{global response}}$ . In addition, we have performed thorough proofreading and used more formal expressions.

Q2: Are your Euclidean distance normalized? Do some ablation studies to verify whether using $\alpha D$ as hyperedge loss will introduce redundant information.

We do not normalize the Euclidean distance $D$ . The cosine similarity $\alpha$ and normalized Euclidean distance are indeed similar as both of them compare the direction of two vectors in feature space, reducing the impact of feature scale differences. However, relying solely on cosine similarity neglects differences in hyperedge representations regarding relative scale and distance (i.e., the magnitude of two vectors). For instance, one hyperedge connects group-wise nodes with larger values indicating a "peak variation", while another hyperedge connects group-wise nodes with smaller values indicating a "trough variation". After normalization, the differences between hyperedge representations would become less noticeable. To address this, we add Euclidean distance and use $\alpha D$ as hyperedge loss.

To investigate the effectiveness of the hyperedge loss, we conduct ablation studies by carefully designing the following three variants:

-w/o $\alpha$ : It removes cosine similarity and uses Euclidean distance as the hyperedge loss.
-w/o $D$ : It removes Euclidean distance and uses cosine similarity as the hyperedge loss.
-Wass: It uses Wasserstein distance as the hyperedge loss.

The experimental results on ETTh1 dataset are shown in Table 1, which has also been included in the $\underline{\text{revised paper}}$ . From Table 1 we can observe that: (1) -w/o $\alpha$ and -w/o $D$ perform worse than Ada-MShyper, showing the effectiveness of Euclidean distance and cosine similarity in hyperedge loss, respectively. (2) Ada-MSHyper performs better than -Wass, which demonstrates the effectiveness of the hyperedge loss in the adaptive hypergraph learning module.

Table 1.

Variation	-w/o $\alpha$	-w/o $D$	-Wass	Ada-MSHyper
Metric	MSE MAE	MSE MAE	MSE MAE	MSE MAE
96	0.400 0.415	0.406 0.414	0.405 0.421	0.372 0.393
336	0.494 0.460	0.457 0.440	0.482 0.447	0.422 0.433
720	0.525 0.495	0.536 0.502	0.492 0.479	0.445 0.459

Q3: Explain the reasons for the different designs of the intra-scale and inter-scale modules. Can they just use the same method for updating features?

The intra-scale interaction module is used to capture detailed group-wise interactions between nodes with similar semantic information, while the inter-scale interaction module focuses on capturing the interactions of macroscopic variations through hyperedges. If both modules use hypergraph convolution attention for updating features, there are two limitations for the inter-scale interaction modules: (1) There is no hypergraph structure regarding the relationships between hyperedges. (2) Hyperedges already represent group-wise interactions by connecting multiple nodes. Modeling group-wise hyperedges interactions through the hypergraph structure may introduce redundant information and cause the overfitting problem.

If we want to use the same method for updating features, one direct way is to use attention for both modules. However, this would cause the intra-scale interaction module lack the ability to capture group-wise interaction between nodes with similar semantic information and pair-wise attention would result in $\mathcal{O}(N^2)$ computation cost.

To investigate the feasibility of using the same method for updating features, we conduct ablation studies by carefully designing the following variant:

-r/ att: It replaces the hypergraph convolution attention with the attention mechanism used in the inter-scale interaction module to update node features.

The experimental results on ETTh1 dataset are shown in Table 2, which have also been included in the $\underline{\text{revised paper}}$ . From Table 2 we can observe that: Ada-MSHyper performs better than -r/ att, which demonstrates the effectiveness of the hyperedge convolution attention used in the intra-scale interaction module.

Table 2.

Variation	-r/ att	Ada-MSHyper
Metric	MSE MAE	MSE MAE
96	0.418 0.419	0.372 0.393
336	0.483 0.454	0.422 0.433
720	0.514 0.507	0.445 0.459

Q4: Whether there are some other distances for modeling the pattern interactions we can try? Like Wasserstein distance.

Other distances can also be introduced as constraint for modeling pattern interactions. As per your suggestion, we normalize the hyperedge features and use Wasserstein distance as the hyperedge loss. See response to Q2 for detailed analysis.

审稿意见

评分: 7置信度: 42024-07-12

This paper introduces a hypergraph-based multi-scale time series forecasting model. By treating multi-scale feature representations as nodes, the proposed AHL module automatically generates incidence matrices to model implicit group-wise node interactions at different scales. Node constraints and hyperedge constraints are introduced to effectively aggregate similar semantic nodes and distinguish temporal variations at different scales. The experimental results confirm the predictive ability of the model, and it can still maintain low MSE and MAE on the ultra-long range series.

优点

The authors propose the AHL module, which is applied to a hypergraph constructed from time series, enabling the discovery of pattern interactions at various scales.
The experimental results demonstrate that the model has achieved state-of-the-art (SOTA) performance in short, long, and ultra-long sequence prediction.
The paper exhibits a clear structure and substantial content. It provides a detailed introduction to the structure and function of each module in the model.

缺点

In the ablation experiment, only ETTh1 dataset was utilized to conduct experiments at three prediction lengths: {96, 326, 720}. This setup lacks support for evaluating ultra-long-range performance. For instance, in the case of w/o NC (without Noise Conditioning), the performance difference between the proposed model and its counterpart becomes only marginally noticeable as the sequence length increases. Since there is a lack of additional datasets and ultra-long-range experiments, it is reasonable to interpret this result as a normal experimental error rather than a significant performance difference.
The experiments conducted to assess model efficiency are too simplistic to effectively validate the author's viewpoint.

问题

The paper claims that most of the experimental results for long-range prediction come from the DLinear model. However, there are no experimental results of the Crossformer model in the DLinear paper. Moreover, the results presented in this paper differ significantly from those in the ITransformer paper for certain datasets. It would be helpful to have an explanation for these differences.
In Section 4.1, the paper introduces multi-scale sequence construction, while Section 4.2 discusses the AHL module. One aspect that raises curiosity is the process of mapping from the sequence to the hypergraph. Specifically, is each time point with multiple features in the time series mapped to a node in the hypergraph?
Regarding model efficiency, the paper provides a comparison of model parameters and training time on a specific dataset. However, a single experiment result may not be sufficient to establish convincing evidence. It would be beneficial to explain the model's efficiency from the perspective of theoretical complexity or provide information on model parameters and training time for longer input sequences and on additional datasets.
In the ablation experiment section, it would be beneficial to include experiments with longer prediction lengths to further evaluate the model's performance.

局限性

The authors adequately addressed the limitations

作者回复

2024-08-07

Comment:

Many thanks to Reviewer Y3pA for providing the insightful reviews and comments.

Q1: About some different long-range prediction results from those in the DLinear and iTransformer paper.

Thanks for your careful check and scientific rigor. As for the long-range time series forecasting, we have two kinds of settings, i.e., multivariate settings and univariate settings. Due to some methods (e.g., iTransformer and Crossformer) do not have results under univariate settings, we use their official codes and fine-tune their key hyperparameters. We have added an explanation about the Crossformer results in the $\underline{\text{revised paper}}$ . As for the multivariate settings, some newly added baselines are rerun by us and other results are from iTransformer. We have carefully rechecked the experimental results and addressed the aforementioned issues in the $\underline{\text{revised paper}}$ . See Table 1 in $\underline{\text{global response}}$ .

Q2: Is each time point with multiple features in the time series mapped to a node in the hypergraph?

Yes, to be precise, Ada-MSHyper maps the input sequence into subsequences at different scales. The features of these subsequences are then treated as nodes in the hypergraph.

Q3: It would be beneficial to explain the model's efficiency from theoretical complexity or provide results on longer input sequence and additional datasets.

To provide a more comprehensive evaluation of the model's efficiency, we have included the theoretical complexity analysis and added additional computation cost results for longer input sequence and additional datasets. The results are shown as follows, which has also been included in the $\underline{\text{revised paper}}$ .

Theoretical complexity analysis: For the MFE module, the time complexity is $\mathcal{O}(Nl)$ , where $N$ is the number of nodes at the finest scale and $N$ is equal to the input length $T$ . $l$ is the aggregation window size at the finest scale. For the AHL module, the time complexity is $\mathcal{O}(MN+M^2)$ , where $M$ is the number of hypergraphs at the finest scale. For the intra-scale interaction module, since $\mathbf{D}_v$ and $\mathbf{D}_e$ are diagonal matrices, the time complexity is $\mathcal{O}(MN)$ . For the inter-scale interaction module, the time complexity is $M^2$ . In practical operation, $M$ and $l$ is the hyperparameter and is much smaller than $N$ . As a result, the total time complexity is bounded by $\mathcal{O}(N)$ .

Computation cost results: We have newly added computation cost on Traffic and Weather datasets with the 720-96 input-output length. The experimental results are shown in Table 1 and Table 2, from which we can observe that:

Although Ada-MSHyper has a large number of parameters, it achieves lower training time and lower GPU occupation due to the matrix sparsity strategy in the model and the optimization of hypergraph computation provided by torch_geometry in PyTorch.
Ada-MSHyper can maintain better performance even with longer input length. Considering the forecasting performance and computation cost, Ada-MSHyper demonstrates its superiority over existing methods.

Table 1.

Methods	Input-Output	Dataset	Training Time/epochs	Parameters	GPU Occupation	MSE results
Ada-MSHyper	720-96	Weather	2.273s	5,684,229	1,249MB	0.149
iTransformer	720-96	Weather	2.482s	5,153,376	1,538MB	0.180
PatchTST	720-96	Weather	11.546s	1,517,152	14,100MB	0.152

Table 2.

Methods	Input-Output	Dataset	Training Time/epochs	Parameters	GPU Occupation	MSE results
Ada-MSHyper	720-96	Traffic	10.093s	19,575,100	6,154MB	0.342
iTransformer	720-96	Traffic	12.352s	18,490,786	7,113MB	0.348
PatchTST	720-96	Traffic	—	—	—	—

Q4: In the ablation experiment section, Include longer prediction lengths to further evaluate the model's performance.

We have added additional ablation studies on ETTh1 dataset to verify the performance of Ada-MSHyper on longer prediction lengths. The results are shown in Table 3, which has also been included in the $\underline{\text{revised paper}}$ . From Table 3 we have the additional observation:

For longer prediction length, -w/o NC has smaller performance degradation than other variations. The reason may be that when the prediction length increases, the model tends to focus more on macroscopic variation interactions and diminishes its emphasis on fine-grained node constraint.
Ada-MSHyper performs better than -w/o NC and -w/o HC even with longer prediction length, showing the effectiveness of node constraint and hyperedge constraint, respectively.

Table 3.

Variation	AGL	one	PH	-w/o NC	-w/o HC	-w/o NHC	Ada-MSHyper
Metric	MSE MAE	MSE MAE	MSE MAE	MSE MAE	MSE MAE	MSE MAE	MSE MAE
1080	-- --	0.685 0.679	0.640 0.591	0.539 0.515	0.574 0.516	0.597 0.525	0.534 0.509
1440	-- --	0.855 0.857	0.783 0.673	0.621 0.503	0.679 0.568	0.734 0.585	0.616 0.498

审稿意见

评分: 7置信度: 42024-07-12

This paper presents a Time Series Forecasting method Ada-MSHyper that uses a hyper graph to capture the group-wise interactions at different time scales rather than Point-Wise interaction. Experiments are performed on 8 data sets and the proposed method is compared with SOTA methods

优点

The use of Hyper Graph
Differentiation of variation at Each Time Scale
Comparisons with SOTA Methods
Ablation Study
Comparisons of Computational Cost with 3 SOTA

缺点

Related works does not include Graph-Transformer methods e.g. STGNN
Graph-Transformer methods e.g. STGNN are not used as SOTA for comparisons.
Data Sets should include financial data e.g. Stock Market
In results of "Ultra-Long-Range Forecasting", the reason for comparebale accuracies of "WITRAN" on ETTm2
The ablation study should include the effect of "η is the threshold of T op K function"

问题

Why SOTA methods do not include Graph-Transformer methods e.g. STGNN?
Will this process of "reduce subsequent computational costs and noise interference", introduce any loss of usefull information?
Have you studied and quantified the computational cost reduction by the above approach?
Is the linear layer for forecasting a linear regression?
Will the proposed method work on Stock Market Data?
Have you analyzed the data sets to check and quantified the presence of "multi-scale pattern interactions"?

局限性

No Applicable

作者回复

2024-08-07

Comment:

Many thanks to Reviewer HQ9E for providing the insightful reviews and comments.

Q1: Graph-Transformer methods should be included in Related Works and used for comparisons.

Thanks for your valuable suggestions and scientific rigor. We have added two latest Graph-Transformer methods, i.e., MSGNet (AAAI 2024) and CrossGNN (NeurIPS 2023) for comparison. The descriptions of these methods and their long-range time series forecasting results are shown as follows, which have also been included in the $\underline{\text{revised paper}}$ .

MSGNet: MSGNet leverages frequency domain analysis to extract periodic patterns and combines an attention mechanism with adaptive graph convolution to capture multi-scale pattern interactions.
CrossGNN: CrossGNN uses an adaptive multi-scale identifier to construct multi-scale representations and utilize a cross-scale GNN to capture multi-scale pattern interactions.

The long-range time series forecasting results under multivariate settings are shown in Table 1 of the $\underline{\text{global response}}$ and the following tendencies can be discerned:

MSGNet and CorssGNN are the state-of-the-art graph learning methods that use graph learning modules to capture multi-scale pattern interactions. However, they can only capture pair-wise interactions instead of group-wise interactions and get worse performance than Ada-MSHyper in most cases.

Q2: Will the sparsity strategy introduce any loss of useful information?

The sparsity strategy is employed to reduce subsequent computation costs and noise interference. However, the effectiveness of the sparsity strategy is influenced by the hyperparameter $\eta$ . When $\eta$ is set to a smaller value, some useful information may be filtered out. We have performed parameter studies to measure the impact of $\eta$ . The results are shown in Table 2, which has also been included in the $\underline{\text{revised paper}}$ . From Table 2 we have the additional observation:

The best performance can be obtained when $\eta$ =3. The reason may be that a small $\eta$ may filter out useful information and a large $\eta$ would introduce noise interference.

Table 2.

Hyparameter	$\eta$ =1	$\eta$ =2	$\eta$ =3	$\eta$ =4	$\eta$ =5
Metric	MSE MAE	MSE MAE	MSE MAE	MSE MAE	MSE MAE
96	0.407 0.415	0.390 0.397	0.372 0.393	0.387 0.396	0.419 0.418
336	0.547 0.500	0.476 0.443	0.422 0.433	0.438 0.435	0.560 0.510
720	0.450 0.463	0.476 0.465	0.445 0.459	0.460 0.459	0.473 0.474

Q3: Study and quantify the computational cost reduction by the above approach.

To investigate the effectiveness of the sparsity strategy, we conduct ablation studies by carefully designing the following variant:

-w/o SO: It removes the sparsity strategy in the AHL module.

We compare the computation cost on Electricity dataset under 96-96 input-output settings, the experimental results are shown in Table 3, which has also been included in the $\underline{\text{revised paper.}}$

Table 3.

Methods	Training Time/epochs	Parameters	GPU Occupation	MSE results
-w/o SO	9.525s	14,519,292	8,454MB	0.392
Ada-MSHyper	6.499s	8,965,392	6,542MB	0.384

From Table 3 we can observe that Ada-MSHyper can achieve better performance with faster speed and lower GPU occupation compared to -w/o SO, which demonstrates the effectiveness of the sparsity strategy.

Q4: Is the linear layer for forecasting a linear regression?

Yes, the linear layer used for forecasting can be considered as a form of linear regression as it maps the updated multi-scale features to the final predictions through a linear relationship.

Q5: Datasets should include financial data, e.g., Stock Market.

We have added Nasdaq 100 Stock dataset for comparison. The detailed descriptions of the public dataset are shown as follows:

Nasdaq: This dataset includes the stock prices of 82 major corporations, which are sampled 390 times every day from July 2016 to December 2016.

The full long-range time series forecasting results on Nasdaq dataset will be included in the $\underline{\text{revised paper}}$ . Due to time and space limitations, we list the comparison results between Ada-MSHyper and three latest baselines. The experimental results are shown as follows:

Table 4.

Methods	Ada-MSHyper	iTransformer	MSHyper	TimeMixer
Metric	MSE MAE	MSE MAE	MSE MAE	MSE MAE
96	0.027 0.090	0.057 0.141	0.034 0.102	0.027 0.094
192	0.054 0.131	0.095 0.183	0.076 0.153	0.059 0.137
336	0.091 0.179	0.153 0.237	0.111 0.199	0.092 0.182
720	0.182 0.268	0.315 0.349	0.257 0.310	0.184 0.270

From Table 4 we can observe that Ada-MSHyper achieves the best performance in almost all cases. The experimental results demonstrate the effectiveness of Ada-MSHyper on stock market dataset.

Q6: Analyze the datasets to check and quantify the presence of "multi-scale pattern interactions".

We have added weight visualization results, see Figure 2 in $\underline{\text{global response}}$ , the detailed analysis has been included in the $\underline{\text{revised paper}}$ .

Q7: The reason for comparable accuracies of WITRAN on ETTm2 dataset.

WITRAN employs a Recurrent Acceleration Network within its framework. For ETTm2 dataset (high forecastability), it can make effective predictions with fewer parameters and a shallower network structure. However, recurrent structure may face underfitting risks when handling more challenging datasets (low forecastability), e.g., ETTh1 and ETTh2. This is why WITRAN achieves comparable accuracies on ETTm2 but performs worse on other datasets (e.g., ETTh1 and ETTh2).

2024-08-08

Thanks for accepting my suggestions and answering questions, I am updating my score.

2024-08-09

We would like to thank Reviewer HQ9E for providing a valuable and constructive review, which has inspired us to improve our paper substantially.

Thanks again for your response and raising the score!

作者回复

2024-08-07

We sincerely thank all the reviewers for their insightful reviews and valuable comments, which are instructive for us to improve our paper further.

The reviewers generally hold positive opinions of our paper, in that they perceive our approach as interesting, detailed, and clear. The reviewers also acknowledge our research topic is the first work that incorporates adaptive hypergraph modeling into time series forecasting and the motivation is well-founded and insightful. In addition, the reviewers deem our paper exhibits a clear structure and substantial content and the experiments are extensive, solid, and effective.

The reviewers also raise insightful and constructive concerns. We made every effort to address all the concerns by providing sufficient evidence and requested results. Here is the summary of the major revisions:

Add more ablation studies and model analysis (Reviewer Y3pA, HQ9E, HJMp, and VVbq): Following the suggestions of the reviewers, we have added more than 10 ablation studies to investigate the effectiveness of the node constraint, the sparsity strategy, the cosine similarity and Euclidean distance in hyperedge loss, and the hypergraph convolution attention. We have added parameter studies to measure the impact of $\eta$ . In addition, we have added more computation cost analysis for longer input sequence and additional datasets.

Add additional baselines and datasets (Reviewer HQ9E): Following the suggestions of the reviewer, We have added stock market data for comparison and included two graph-transformer methods as baselines. The experimental results demonstrate the effectiveness of Ada-MSHyper over existing methods.

Polish the writings (Reviewer Y3pA and HJMp): We have performed thorough proofreading and revisions with helpful suggestions from the reviewers. We have improved the color of the pipeline, used more formal symbols, and added additional explanations about the experimental results.

Provide frequency domain analysis to results (Reviewer VVbq): Following the suggestions of the reviewer, we have added Short-Time Fourier Transform (STFT) in our model for comparison.

After 7 full days of experiments (with 4 RTX 3090 GPUs), we have added more than 150 new experimental results to address the mentioned issues. All the revisions have been included in the revised paper.

The valuable suggestions from reviewers are very helpful for us to revise the paper to a better shape. We would be very happy to answer any further questions.

Looking forward to the feedback of the reviewers.

最终决定Accept (poster)

2024-09-25

This paper introduces a novel time series forecasting method, Ada-MSHyper, which leverages a hypergraph-based approach to model group-wise multi-scale interactions rather than traditional point-wise interactions. The proposed adaptive hypergraph learning (AHL) module, along with multi-scale interaction and node clustering mechanisms, effectively captures implicit group-wise patterns and distinguishes temporal variations at different scales. Extensive experiments on multiple datasets demonstrate the model's superior predictive performance, maintaining low MSE and MAE even on ultra-long-range time series.

Reviewers have expressed concerns regarding the presentation of the paper and the experimental setup. Despite these issues, the significance of the contributions justifies recommending acceptance. It is advised that the authors incorporate the reviewers' feedback when preparing the final version of the paper.