PaperHub
5.5
/10
Poster4 位审稿人
最低5最高6标准差0.5
5
6
6
5
3.3
置信度
正确性2.3
贡献度2.5
表达3.3
ICLR 2025

S4M: S4 for multivariate time series forecasting with Missing values

OpenReviewPDF
提交: 2024-09-27更新: 2025-03-15

摘要

关键词
S4 ModelsMultivariate Time Series ForecastingMissing ValuePrototype Bank

评审与讨论

审稿意见
5

This paper explicitly models missing patterns within the Structured State Space Sequence (S4) architecture, developing two key modules: the Adaptive Temporal Prototype Mapper and the Missing-Aware Dual Stream. The experiments demonstrate that S4M achieves state-of-the-art performance.

优点

  1. This paper explores the integration of State Space Models with missing patterns for long-term time series forecasting.

  2. The proposed adaptive temporal prototype mapper and missing aware dual stream S4 modules effectively capture rich historical patterns and learn robust representations.

  3. Experimental results illustrate that the proposed model achieves state-of-the-art performance in handling missing data.

缺点

  1. The authors should further explain the motivation for introducing EˉEm(mt;θm)\mathbf{\bar{E}}E_m(m_t;\theta_m) to the SSM model in Equation 4.

  2. This paper lacks a discussion between the S4M and existing methods designed for handling missing values, which diminishes the significance of the proposed model.

  3. The settings of hyperparameters K1K_1, K2K_2, τ1\tau_1, and τ2\tau_2 are emprical. The authors should provide guidance on how to set these hyperparameters across different datasets with varying characteristics.

问题

  1. The proposed model struggles when the input length is shorter than the output length. In addition, it is general to fix the lookback length and adapt to various prediction lengths [1]. Thus, it would strengthen this paper to add experimental results with various horizons.

  2. Baselines such as Transformer and Autoformer are designed for complete data. The authors should clarify how these models can be adapted for partially observed data. Besides, the experimental analysis should include state-of-the-art baselines specifically designed for long-term forecasting tasks, such as iTransformer [1], CARD [2], and Crossformer [3].

[1] Liu Y, Hu T, Zhang H, et al. itransformer: Inverted transformers are effective for time series forecasting[C]. The eleventh international conference on learning representations. 2023.

[2] Wang X, Zhou T, Wen Q, et al. CARD: Channel aligned robust blend transformer for time series forecasting[C]. The Twelfth International Conference on Learning Representations. 2024.

[3] Zhang Y, Yan J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting[C]. The eleventh international conference on learning representations. 2023.

评论

Q2.1 How did the baseline models adapted for partially observed data.

Thank you for your question. In our experiments, the missing values are imputed using the mean value, after which the Transformer and Autoformer models were applied. The same procedure is used for the additional experiments with iTransformer, CARD, and PatchTST.

Q2.2 Justification on deliberately omitted methods not designed for non-missing data. Besides, the experimental analysis should include SOTA baselines specifically designed for long-term forecasting tasks, such as iTransformer [1], CARD [2], and Crossformer [3].

Thank you for raising this issue. In our initial submission, we did not include iTransformer and CARD because these methods are not specifically designed for time series with missing values. Comparing them directly with our method might lead to an unfair evaluation. Instead, we focused on comparisons with SOTA methods tailored for time series prediction with missing values, such as BiTGraph, as well as S4-based methods. For transformer-based baselines, we selected two classic architectures: Transformer and Autoformer.

To provide a more comprehensive evaluationaddress your request, we have now included PatchTST, iTransformer, and CARD in our experiments. The results on four benchmark datasets are presented in the table below, with additional results on a real-world dataset. Furthermore, we analyze the computational cost and the performance across different horizon windows.

Among the three additional methods, PatchTST exhibits strong performance in handling missing values, particularly on the Electricity dataset, and also performs well in scenarios without missing values. However, in other settings, its results are less competitive compared to S4M. Additionally, as shown in Table X, PatchTST incurs significantly higher training and inference times than S4M.

DatasetMetricS4MPatchTSTiTransformerCARD
ElectricityMAE0.4180.4200.4520.440
MSE0.3590.3440.3890.366
ETTh1MSE0.6270.5830.6680.780
MAE0.7420.6500.7861.041
WeatherMSE0.3700.3990.5100.422
MAE0.2940.3270.4590.376
TrafficMSE0.4990.5300.5190.554
MAE0.9430.9270.8970.965
评论

W3. Highlighting the existing discussion on hyperparameters K1, K2, tao1, tao2..

We agree on the importance of discussing hyperparameters. Therefore, in our original submission, we provided extensive sensitivity analysis on these hyperparameters in Appendix D. Specifically, we discussed K1K_1 in, K2K_2 in Appendix D.4.1. According to our results, a suggested K2K_2 is between 5 and 10, as this range is effective and relatively insensitive to performance. The recommended ranges for τ1\tau_1 and τ2\tau_2 are [0.3, 0.6] and [0.8, 1.0), respectively. Both parameters can be selected using the validation set. Most of the experiments are robust to the choice of K1=30K_1=30 or K1=50K_1=50. If one finds the dataset exists large amout of clusters over 100 (by running a preliminary experiment with a large value of K1K_1 and let the data adaptively tells the number of clusters), they can also adjust this number accordingly.

Q1. Adding experimental results with various horizons.

Thank you for your question. In response to your comment, we have included new results with a fixed lookback window of size 192 and various prediction lengths. The results show that our model retains top performance across most horizons for the Traffic, ETTh1, and Weather datasets. Specifically, for the Electricity dataset, our proposed model ranks among the top two performers in most cases.

Horizon WindowMetricS4(Mean)S4(FFill)S4(Decay)BRITSGRUDTransformerAutoformerBiTGraphiTransformerPatchTSTS4M(Ours)
Electricity24MAE0.3920.4060.3960.6240.4420.4120.4570.3610.4150.3490.369
MSE0.3120.3220.3030.6210.3720.3230.3900.2560.3260.2430.277
48MAE0.3890.4000.3970.6580.3930.4240.4370.3700.4210.3570.377
MSE0.3100.3110.3100.6690.4520.3440.3570.2720.3360.2540.290
96MAE0.4640.4100.4200.6810.4560.4340.4530.4100.4270.3650.381
MSE0.4090.3240.3360.7130.3940.3560.3960.3220.3440.2660.305
192MAE0.4240.4620.4650.7760.5080.4510.4430.4440.4300.3750.405
MSE0.3630.4140.4111.0470.4720.3780.3660.3650.3480.2780.331
ETTh124MAE0.5290.5380.5850.7500.6000.6170.6560.5580.5860.5750.554
MSE0.5320.5740.6810.9790.7080.6870.7400.5910.6210.6170.585
48MAE0.5770.5530.6050.7560.6460.6230.6990.6390.6300.5740.573
MSE0.6300.5880.7011.0400.8070.6800.8440.8310.5910.6230.633
96MAE0.6440.6590.6400.7760.7390.6850.7070.6500.6890.5780.604
MSE0.7390.7920.7821.0471.0040.8930.8560.8150.7810.6220.672
192MSE0.6910.6720.6920.7700.7500.7430.7150.6910.6610.5980.655
MAE0.8640.8330.8941.0401.0110.9750.8740.9350.7540.6620.801
Weather24MAE0.3600.3120.3100.4940.3940.4491.0200.3080.5110.3140.304
MSE0.2880.2240.2220.4380.3140.3761.5650.2340.4680.2310.212
48MSE0.4000.3470.3520.4900.4310.5351.0320.3560.5110.3470.339
MAE0.3400.2640.2710.4400.3580.4851.6040.2770.4670.2680.250
96MAE0.3860.3570.3530.4590.3720.5931.0340.4880.5150.3770.357
MSE0.3240.2830.2820.4130.2960.6041.6150.4730.4700.3030.276
192MAE0.5320.5030.5050.5190.5590.5881.0350.6280.5210.4160.410
MSE0.5380.4850.4890.4890.5610.5861.6260.6640.4790.3510.386
Traffic24MAE0.4410.4590.4350.6720.5690.4720.5610.4960.4630.4610.420
MSE0.7870.8330.7881.2071.0820.8210.9660.4960.6960.7130.762
48MAE0.4420.4720.4490.6820.6000.4850.5190.5270.4710.4720.420
MSE0.8250.8310.8061.2201.1040.8710.8890.9300.7180.7390.709
96MAE0.4420.4800.4520.6950.6170.5120.4720.5330.4800.4780.434
MSE0.8260.8700.8121.2671.1100.9500.8040.9490.7460.7630.810
192MAE0.4860.5470.4980.6950.5660.4780.5120.5500.4880.4830.478
MSE0.8690.9920.9010.6951.0370.8610.8660.9970.7660.7700.886
评论

We thank the reviewer for the thoughtful and detailed review. Also, we appreciate that the reviewer acknowledges our proposed model achieves SOTA performance in handling missing data. We address the raised concerns below.

W1. The authors should further explain the motivation for introducing EEm(mt;θm)\overline{\mathbf{E}} E_m(m_{t};\theta_m)  to the SSM model in Equation 4.

Thank you for your question about our motivation for incorporating EEm(mt;θm)\overline{\mathbf{E}} E_m(m_{t};\theta_m) into the model.

To address the missing data problem in S4 models, we aim to (1) distinguish the missing time points, enabling the model to treat them differently from the observed data (e.g., by referring to data in the prototype bank), and (2) ensure that the core properties of the S4 model are preserved. To this end, we seek a term that can flag missing values while preserving the HiPPO structure of S4. We found that integrating additional masking terms MM, inspired by literature [1], to serve as a simple yet effective indicator for the model to recognize missing values. However, since the elements of MM take binary values (0 or 1), they are not naturally on the same scale as the other terms in (4). To address this, we designed an encoder to transform the mask information to an appropriate scale. Incorporating this term still preserves the HiPPO structure of S4, thereby enriching the model with additional information while maintaining its core advantages.

W2. Highlighting the existing discussion between the S4M and existing methods designed for handling missing values

In fact, we discussed the differences between our methods v.s. the other existing methods for handling missing values in both Introduction (see lines 62-73) and Appendix A.2. To recap, traditional approaches for handling missing values use a two-step process: imputing missing values first and then performing standard analysis. This can lead to errors and suboptimal results, especially in multivariate time series with complex missing patterns and high missing ratios. We also refer to methods that directly forecast with missing data. RNN-based methods, such as BRITS and GRUD, typically require long training times and exhibit inferior forecasting performance. Graph network-based models, like BiTGraph, are effective at navigating temporal dependencies and spatial structures but often suffer from high memory usage. ODE-based methods, such as Neural ODE, generally incur high computational costs.

In contrast, we propose S4M, which combines a prototype bank with a structured state space model (S4). Our approach focuses on recognizing and representing missing data patterns in the latent space, thereby enhancing model performance by better capturing underlying dependencies while maintaining the high performance of S4.

评论

Dear reviewer EZj4,

We sincerely thank you for your thoughtful review and valuable feedback. We have carefully addressed each of your questions and provided detailed responses in the rebuttal. We hope to have resolved all your concerns. If you have any further comments, we would be glad to address them before the rebuttal period ends. If our responses address your concerns, we would deeply appreciate it if you could consider raising your score. Your recognition for our novel work means a lot. Thanks again for your time and effort in reviewing our work.

Regards, S4M authors

评论

Dear Reviewer EZj4,

We are at the end of the discussion period, so please take some time to read the response to your review for submission 10665

Also, please indicate the extent to which the response addresses your concerns and whether it changes your score - try to explain your decision.

All the best,

The AC

审稿意见
6

In this paper, the authors propose S4M. S4M is an adaption of S4 which can handle missing values by:

  • Using prototype clusters, look-back information and an encoder to find representations also for time-points where values are missing
  • By explicitly also incorporating the masking matrix M into the S4-Layers.

They evaluate S4M on the standard data for regular time-series forecasting when some data is hidden and show that i) S4M is competitive or often outperforms other S4 and Missing-Value approaches.

优点

  • The idea of the prototype bank is very compelling and thoughtful, I like it a lot.
  • The presentation is very good. The paper is written in a manner which makes it comfortable to follow. Especially having an algorithm for each of the crucial parts helped me a lot.
  • The results are not looking like fundamental break-throughs, but they are very promising for such a novel approach and there are a lot of ablations studies/hyperparameter experiments.

缺点

The two main weaknesses I identified:

  • My largest critique point is, that the authors are not comparing at all with recent results from the "Irregular Sampled Time-Series wit Missing Values" Literature. There is a plentitude of recent works solving irregular time-series forecasting in an end-to-end manner via ODEs, modelling latent dynamics or graph modelling.[1-5]. Furthermore, these papers provide a set of standard datasets for time-series forecasting with missing values, thus no need to synthetically make the normal regular datasets irregular.

  • There are a lot of standard methods missing, where one could do simply linear interpolation to use them in the experiments on which S4 is tested. For example, this work is not referring to important forecasting works like PatchTST or iTransformer at all. The results of S4M in Table 1, are way worse then the results of PatchTST (see Table at https://github.com/yuqinie98/PatchTST?tab=readme-ov-file) that it may be the case that PatchTST with 0.06 missing values indeed outperforms S4M.

[1] De Brouwer, E., Simm, J., Arany, A., & Moreau, Y. (2019). GRU-ODE-Bayes: Continuous modeling of sporadically-observed time series. Advances in neural information processing systems, 32.

[2] Yalavarthi, Vijaya Krishna, et al. "GraFITi: Graphs for Forecasting Irregularly Sampled Time Series." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 15. 2024.

[3] Schirmer, Mona, et al. "Modeling irregular time series with continuous recurrent units." International conference on machine learning. PMLR, 2022.

[4] Biloš, Marin, et al. "Neural flows: Efficient alternative to neural ODEs." Advances in neural information processing systems 34 (2021): 21325-21337.

[5] Klötergens, Christian, et al. "Functional Latent Dynamics for Irregularly Sampled Time Series Forecasting." Joint European Conference on Machine Learning and Knowledge Discovery in Databases.

My Current Rating

I do really like the idea and think that it has potential, even beyond S4 models for irregular time-series forecasting. However, for a top conference like ICLR, the amount of missing comparison to important related work is too high for recommending acceptance.

问题

Additionally to the critique points mentioned above, I have the following comments/question:

  • Have you tested S4M when there are no missing values at all? I would be curious whether your prototype bank is also useful if no values are missing.
  • Table 3: Do I understand correctly, that you are comparing: Prototyping Bank + Masking (i.e. having m_t in (4)) against only having the prototype bank? I would also like to see Only Masking, i.e. having the mask m_t in (4) but no prototype bank, i.e. replacing o_t with X_t. Is your prototype-bank really needed for irregular time-series forecasting?
评论

Q2. I would also like to see Only Masking, i.e. having the mask m_t in (4) but no prototype bank, i.e. replacing o_t with X_t. Is your prototype-bank really needed for irregular time-series forecasting?

Thank you for your valuable feedback. We have included an ablation study for the first module, ATPM. In this study, we compare S4M with and without ATPM, highlighting the improvements brought by ATPM, particularly on the Traffic and ETTh1 datasets. The following experiments were conducted under the same settings as those in the ablation studies presented in the paper.

DatasetElectricityETTh1WeatherTraffic
L\ell_LMetricsS4M (Ours)S4M (w/o prototype)S4M (Ours)S4M (w/o prototype)S4M (Ours)S4M (w/o prototype)S4M (Ours)S4M (w/o prototype)
Variable Missing
96MAE0.369+0.0110.571+0.0440.336+0.0200.442+0.024
MSE0.282+0.0100.624+0.0910.267+0.2060.786+0.125
192MAE0.357+0.0100.568+0.0450.320+0.6000.381+0.030
MSE0.261+0.0090.598+0.0900.261+0.0020.685+0.092
384MAE0.359+0.0090.584+0.0290.334+0.0060.383+0.026
MSE0.264+0.0090.613+0.0640.256+0.0080.700+0.065
768MAE0.362+0.0200.599+0.0280.341+0.0160.383+0.026
MSE0.269+0.0020.649+0.0580.266+0.0110.697+0.074
Timepoint Missing
96MAE0.372+0.0250.571+0.0490.313+0.0210.428+0.045
MSE0.287+0.0300.624+0.1100.237+0.0170.809+0.116
192MAE0.367+0.0040.574+0.0390.305+0.0060.385+0.005
MSE0.274+0.0040.593+0.1100.225+0.0010.687+0.023
384MAE0.370+0.0140.571+0.0570.306+0.0120.385+0.013
MSE0.277+0.0040.624+0.1120.220+0.0150.702+0.047
768MAE0.373+0.0130.588+0.0480.316+0.0050.388+0.000
MSE0.282+0.0160.647+0.0790.232+0.0040.699+0.024

R

评论

Dear Authors, thank you for the rebuttal. My concerns are only partially adressed:

  • I think that the differentiation to irregular sampled time-series has to be made more explicit in the paper. Furthermore, the fact that you are only considering irregular-sampled time-series with specific patterns of missingness is not clear in the current version of the paper.
  • Your response part 2: My request was more about doing linear interpolation etc and then having models like PatchTST and iTransformer on top, not S4. Because having a look at PatchTST results without missing values, it stands to reason that it outperforms S4M.
评论

W2.2. Justification on MethodsComparison.

Thank you for your comments on the comparison between PatchTST and our method. There might be a misunderstanding of the comparison. PatchTST does not include straightforward forecasting with missing values. Masking is only used in the context of self-supervised learning. Therefore, the results with 0.06 missing values (~40% missing ratio) are not directly comparable. In our initial submission, we did not include iTransformer and PatchTST because these methods are not specifically designed for time series with missing values. Comparing them directly with our method might lead to an unfair evaluation. Instead, we focused on comparisons with SOTA methods tailored for time series prediction with missing values, such as BiTGraph, as well as S4-based methods. For transformer-based baselines, we selected two representative architectures: Transformer and Autoformer.

To address your request , we have now included PatchTST, iTransformer, and CARD in our experiments. The results on four benchmark datasets are presented in the table below, with additional results on a real-world dataset(https://openreview.net/forum?id=BkftcwIVmR&noteId=qh9vNw3wyX). Furthermore, we analyze the computational cost (https://openreview.net/forum?id=BkftcwIVmR&noteId=8TefCTqULs) and the performance across different horizon windows (https://openreview.net/forum?id=BkftcwIVmR&noteId=bcEimHMmTq).

Among the three additional methods, PatchTST exhibits strong performance in handling missing values, particularly on the Electricity dataset, and also performs well in scenarios without missing values. However, in most of the settings, S4M achieves consistently superior performance. Additionally, as the table on computational cost(https://openreview.net/forum?id=BkftcwIVmR&noteId=8TefCTqULs) shows, S4M is significantly more efficient than these three suggested methods..

Q1. Have you tested S4M when there are no missing values at all? I would be curious whether your prototype bank is also useful if no values are missing.

Thank you for the question. By design, our method is suitable for time series with block-based missing data. The historical features stored in the prototype bank are particularly helpful when the missing ratio is high. If there is no missing data i, as expected, our method will not show a significant advantage over the other methods but will still maintain a very competitive performance. For this experiment, we report the results using a horizon window of 96 and a lookback window of 96, with no missing values in the original dataset.

DatasetMetricBRITSGRUDTransformerAutoformerS4BiTGraphS4M(Ours)
ElectricityMAE0.3980.4130.4110.3520.3860.3480.383
MSE0.3180.3320.3210.2420.3010.2540.295
ETTh1MAE0.6760.5710.6040.5560.5380.5300.538
MSE0.8670.6360.6770.5880.5600.5710.560
WeatherMAE0.3730.3700.3830.3060.3630.5040.332
MSE0.291360.3010.2980.2350.3010.4940.259
TrafficMAE0.4280.4460.4050.4540.4250.5040.414
MSE0.7700.8400.7070.7050.4250.8790.761
评论

W1.2 Justification on Datasets Selection.

There appears to be a misunderstanding on the type of data missing problem we are focusing on in this study. Our work considers cases with block missing patterns in regularly sampled time series, where the observed values occur at consecutive time points (see our Fig. 4). This structure enables the design of an informative representation oto_t, which is crucial for capturing temporal dependencies effectively. In contrast, standard irregularly sampled time series, like MIMIC and Physionet, do not contain such patterns of consecutive observations in the non-missing time points, making them outside the scope of our study.

Although our current design does not consider the general irregular sampled data, we appreciate your encouraging recognition that "I do really like the idea and think that it has potential, even beyond S4 models for irregular time-series forecasting." We also believe in the benefits of introducing a prototype bank beyond the S4 model, and we hope our work lays the foundation to inspire future work along this valuable future direction.

W2.1 Comparison with the linear interpolation method.

Thank you for your advice. We have included simple and standard imputation methods such as mean, forward fill (Ffill), and linear decay interpolation in their original forms. To incorporate your suggestion, we have added linear interpolation methods to the following table, which presents experiments conducted on four datasets with r=0.24r = 0.24 and a horizon window of 96. The results indicate that, linear interpolation’s performance is inferior to both our proposed approach and the decay method in most cases.

DatasetHorizon LengthMetricS4(Mean)S4(Ffill)S4(Decay)S4(Linear)S4M(Ours)
Electricity96MAE0.5560.5010.4600.4680.418
MSE0.5700.4790.4090.4250.366
192MAE0.4640.4100.4200.3950.391
MSE0.4090.3240.3360.3060.305
384MAE0.4720.4200.4240.4030.389
MSE0.4170.3340.3410.3110.304
768MAE0.4690.4130.4150.4020.399
MSE0.4130.3280.3310.3150.318
ETTh196MAE0.7100.7170.6810.6960.627
MSE0.9080.9460.8790.9430.742
192MAE0.6440.6590.6400.6710.609
MSE0.7390.7920.7820.8720.703
384MAE0.6320.6480.6480.6460.628
MSE0.7100.7680.7790.7820.710
768MAE0.6390.6610.6720.6590.632
MSE0.7140.8000.8270.8230.744
Weather96MAE0.4210.3810.3780.3990.362
MSE0.3790.3210.3170.3390.286
192MAE0.3860.3570.3530.3540.350
MSE0.3240.2830.2820.2760.269
384MAE0.3810.3490.3430.3490.358
MSE0.3150.2730.2700.2720.276
768MAE0.3810.3510.3420.3990.375
MSE0.3120.2760.2680.3390.300
Traffic96MAE0.4870.5690.5290.5680.485
MSE0.9101.0630.9841.0430.933
192MAE0.4420.4800.4520.4660.433
MSE0.8260.8700.8120.8420.787
384MAE0.4310.4560.4400.5240.433
MSE0.7950.8420.8090.9530.788
768MAE0.4320.4490.4340.4390.429
MSE0.7990.8230.7890.7900.789
评论

We thank the reviewer for the thoughtful and detailed review. Also, we appreciate that the reviewer acknowledges our prototype bank presents an interesting practical way to tackle missing values. We address the raised concerns below.

W1.1 Justification on the comparison baseline selections.

We thank the reviewer for the question. The reasons that we did not directly compare S4M with irregularly sampled time-series methods are multifold: First, here, we focus on block-based missing patterns where the observed values occur at consecutive time points (see our Fig. 4). The irregularly sampled time series problems typically don’t directly consider the properties of such missing patterns. Second, efficiency is an essential consideration for our work, as stated in lines 51-52 in our original submission. Compared with S4, the suggested ODE and graph-based irregularly sampled methods are computationally costly. For instance, experiments with Grafiti on the Traffic and Electricity datasets often result in out-of-memory (OOM) errors in most settings. Similarly, CRU is extremely slow due to its iterative computations over variable dimensions. Detailed comparisons and computational costs for these methods are provided in the table below

MethodS4(Mean)S4(Ffill)S4(Decay)BRITSGRUDTransformerAutoformerBiTGraphiTransformerPatchTSTCRUDGrafitiS4M(Our)
Flops(M)12463.3912463.3912618.529091.163813.8217627.8718734.883185.64565.36392299.02219.57265118.32139191.88
Training Time(s)0.112820.114980.083250.469200.199580.090350.096130.245460.067440.4601749.40020OOM0.219381
Inference Time(s)0.074160.079830.061520.211260.087560.060880.076620.081220.040090.168964.76765OOM0.099314

In response to your comment, we have two additional SOTA methods for irregular time-series forecasting, Grafiti and CRUD, for comparison. However, due to the high computational costs and significant memory requirements of methods designed for irregularly sampled time-series data, our experiments were restricted to the smaller-scale ETTh1 dataset, as detailed below. The results show that ODE-based models perform suboptimally in this context.

Lookback Window (LL)MetricS4M(Ours)CRUGrafiti
96MAE0.6300.7740.821
MSE0.7791.0931.162
192MAE0.6040.8020.811
MSE0.6701.1501.161
384MAE0.62810.8020.805
MSE0.7481.1841.162
768MAE0.6190.7740.821
MSE0.6931.0931.060
评论

Dear reviwer, thank you for acknowledging that our responses have addressed your concerns in part. We are delighted to engage further and address your remaining questions.

Clarifications on Time-Series Setting: In our revision, we have made explicit the distinction between irregularly sampled time-series and our focus on the missing data setting. This distinction is now emphasized both in the abstract and the instructions, ensuring clarity for all readers. Additionally, to guide understanding, we have highlighted Figure 4, which comprehensively illustrates this setting. Thanks for your suggestions. We hope this addresses your concern.

Focus on S4-based Models: We respectfully request the reviewer to consider that the primary focus of our study is on S4-based models, as stated in our introduction. S4 was selected due to its efficiency, which has been well-documented in the literature. We further demonstrated this efficiency in response to your W2.2 (Part 3), where we compared S4M against the suggested methods and many others. For your convenience, we have included the table from our response, which underscores that S4M is significantly more efficient than the three suggested methods.

MethodS4(Mean)S4(Ffill)S4(Decay)BRITSGRUDTransformerAutoformerBiTGraphiTransformerPatchTSTCRUDGrafitiS4M(Our)
Flops(M)12463.3912463.3912618.529091.163813.8217627.8718734.883185.64565.36392299.02219.57265118.32139191.88
Training Time(s)0.112820.114980.083250.469200.199580.090350.096130.245460.067440.4601749.40020OOM0.219381
Inference Time(s)0.074160.079830.061520.211260.087560.060880.076620.081220.040090.168964.76765OOM0.099314

We hope the reviewer agrees with the trade-offs inherent in various backbone architectures (no free lunch) and recognizes improving S4 is central to this study, reflected in the title and justified in our introduction.

Response to Interpretation of Reviewer Comments: We appreciate your clarification regarding your request. Initially, we interpreted your comments as two distinct questions, leading us to provide detailed responses in W2.1 (Part 2) and W2.2 (Part 3) separately.

Experimental Comparison with Your Suggested Methods: We appreciate your thoughtful feedback. We conducted experiments incorporating your suggested methods with the mean interpolation approach (specified in W2.2, Part 3). Furthermore, we conducted experiments combining PatchTST with linear interpolation. The results are shown below. In most settings, S4M consistently demonstrated superior performance. Both linear and mean interpolation with PatchTST do not work well. Linear interpolation outperformed mean interpolation only on the ETTh1 dataset. For datasets exhibiting clear seasonality like electricity and traffic, linear interpolation may perform worse than mean interpolation.

DatasetMetricS4M (Ours)PatchTST (Linear Interpolation)PatchTST (Mean Interpolation)
ElectricityMAE0.4180.5010.420
MSE0.3590.4660.344
ETTh1MAE0.6270.5870.583
MSE0.7420.6490.650
WeatherMAE0.370.3480.399
MSE0.2940.3090.327
TrafficMAE0.4990.6550.530
MSE0.9430.9290.927
评论

Dear Authors, thanks for the additional results and changes. I think the modifications and additional experiments strengthen the paper. The authors spend a lot of effort to incorporate the changes I proposed. i thus increased my score.

评论

Dear reviewer, we really appreciate the time and effort that you have dedicated to providing your valuable feedback on improving our manuscript. We are grateful for your insightful comments. Thank you.

审稿意见
6

The paper introduces S4M, an extension of the S4 framework to multivariate time series forecasting with missing values. It combines a prototype-based representation learning module (ATPM) with a dual-stream S4 architecture (MDS-S4) to handle missing values directly rather than through preprocessing. The method is evaluated on four datasets under various missing data scenarios.

优点

The paper addresses a practical and relevant problem. Real-world data often contains missing values (or mis-recorded values) which makes developing principled methods to handle them in forecasting models a worthwhile endeavor. The paper is well written and relatively easy to follow. The empirical evaluation does employ a set of strong baselines for comparison. The use of a "prototype bank" for backfilling is new to this reviewer and represents an interesting practical way to tackle missing values.

缺点

The paper introduces a few new key components, notably the prototype bank and the MDS-S4 architecture which, while well described, are only subjected to limited analysis and theoretical justification. For instance, complexity analysis is missing and ablation studies are partial. Some architectural choices seem arbitrary. The datasets chosen in the empirical evaluation (Traffic, Electricity, ETTh1, Weather) are all fairly simple datasets. Given that there are many more publicly available time-series evaluation datasets, I would like to see a more comprehensive evaluation.

问题

What is the computational complexity of ATPM vs traditional approaches? Can you provide theoretical justification for the prototype bank design? How sensitive is performance to prototype bank initialization?

评论

We thank the reviewer for the thoughtful and detailed review. Also, we appreciate that the reviewer acknowledges our prototype bank presents an interesting practical way to tackle missing values. We address the raised concerns below.

1. What is the computational complexity of ATPM vs traditional approaches?

Thanks for the question, we provide the complexity analysis below. An empirical comparison of the computational costs of the S4M and other methods is given in the table below. S4M demonstrates superior efficiency in both training and inference compared to other baselines.

MethodS4(Mean)S4(Ffill)S4(Decay)BRITSGRUDTransformerAutoformerBiTGraphiTransformerPatchTSTCRUDGrafitiS4M(Our)
Flops(M)12463.3912463.3912618.529091.163813.8217627.8718734.883185.64565.36392299.02219.57265118.32139191.88
Training Time(s)0.112820.114980.083250.469200.199580.090350.096130.245460.067440.4601749.40020OOM0.219381
Inference Time(s)0.074160.079830.061520.211260.087560.060880.076620.081220.040090.168964.76765OOM0.099314

We also provide the complexity analysis for the core steps in bank writing and reading operations.

  • Bank Writing: Given BB training data points in a batch, each with LL segments, as we detailed in Algorithm 1 on the page, the core operations in writing the prototype bank involves four main steps: (1) randomly selecting nn out of BLB \cdot L representations, which has a complexity of O(n)O(n); (2) computing similarity between nn selected representations and ss centroids of dimension RR, leading to O(nsR)O(n \cdot s \cdot R); (3) selecting the maximum similarity for each representation, which requires O(ns)O(n \cdot s); and (4) updating the clustering via a FIFO-based mechanism, with a cost of O(n)O(n). Assuming standard operations for similarity computation and FIFO updates, the overall computational complexity is dominated by O(nsR)O(n \cdot s \cdot R), reflecting the influence of the embedding dimension RR and the number of centroids ss
  • Bank Reading: Given BB training data points, each with ll segments, the procedure involves the following steps: (1) for all BlB \cdot l segments, compute cosine similarity with ss centroids, resulting in a complexity of O(BlsR)O(B \cdot l \cdot s \cdot R), where RR is the embedding dimension; (2) select the top KK centroids for each segment, which takes O(Bls)O(B \cdot l \cdot s) using a partial sort; (3) normalize the similarity values for these KK centroids using an exponential function, costing O(BlK)O(B \cdot l \cdot K); and (4) compute the weighted average of these KK centroids, which also takes O(BlKR)O(B \cdot l \cdot K \cdot R). The overall computational complexity is dominated by O(BlsR)O(B \cdot l \cdot s \cdot R), primarily driven by the initial cosine similarity calculations.
评论

4. More comprehensive evaluation.

The datasets for empirical evaluation are widely used in time series forecasting literature [2,3]. We selected these four datasets because they exhibit significant variation in size, number of variables, and the presence or absence of seasonality. We consider cases with block missing patterns in regularly sampled time series, where the observed values occur at consecutive time points ((see our Fig. 4). This structure enables the design of an informative representation oto_t, which is crucial for capturing temporal dependencies effectively.

Following your comment, we included the real-world USHCN climate dataset [1] in our analysis. We set the lookback window of size 96 and a horizon window of size 96. The results further confirm that S4M outperforms other methods on the real-world dataset.

rMetricS4(Mean)S4(Ffill)S4(Decay)BRITSGRUDTransformerAutoformerBiTGraphiTransformerPatchTSTCARDS4M(Ours)
0.12MAE0.4770.4890.4660.6440.4770.4610.5110.4740.4780.4940.4510.447
MSE0.4550.4140.4470.6680.4520.4060.4990.4390.4600.4570.4110.417
0.24MAE0.5070.5220.5020.6440.4990.4750.5340.4950.5040.5280.4770.473
MSE0.5030.5170.5030.6890.4840.4030.5300.4690.5020.5020.4440.433

Reference:

[1] Long-term daily climate records from stations across the contiguous united states, 2015.

[2] Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In Advances in Neural Information Processing Systems, 2021

[3] CARD: Channel aligned robust blend transformer for time series forecasting[C]. The Twelfth International Conference on Learning Representations. 2024.

5. How sensitive is performance to prototype bank initialization?

In our experiment, we found the performance is insensitive to the initial cluster configuration, as the clusters are updated continuously throughout the training process. To provide evidence, we presented the experimental results on four datasets using different cluster numbers for initialization, as shown below. In practice, we recommend using 3 to 5 clusters for initialization or determining the optimal number of clusters based on the within-cluster sum of squares.

num_cluster123481216
ElectricityMAE0.4150.4150.4150.4150.4150.4150.415
MSE0.3560.3560.3560.3560.3570.3580.356
ETTh1MAE0.6470.6470.6480.6480.6470.6480.650
MSE0.7680.7670.7700.7700.7670.7670.773
WeatherMAE0.3860.3900.3870.3850.3880.3880.385
MSE0.3070.3100.3080.3060.3100.3100.307
TrafficMAE0.5100.5050.5040.5150.5090.5130.509
MSE0.9660.9540.9440.9990.9740.9920.985
评论

2. Theoretical justification.

Thank you for the questions. The long-term dependency in S4 is achieved using the HiPPO matrix AA, as shown in (1). This long-term dependency is evident because the current state can be expressed as a convolution of previous states, with the convolution kernels being polynomial in the HiPPO matrix AA. Our dual-stream processing maintains the HiPPO structure, as described in line 273 of our manuscript. Specifically, the current state remains a convolution of previous states, with the convolution kernel being polynomial in the HiPPO matrix, just as in S4. Therefore, the theoretical results for S4 remain valid in our approach.

3. Additional ablation studies.

Thank you for your suggestion on additional ablation studies. We have included an ablation study for the first module, ATPM. In this study, we compare S4M with and without ATPM, highlighting the improvements brought by ATPM, particularly on the Traffic and ETTh1 datasets. The following experiments were conducted under the same settings as those in the ablation studies presented in the paper.

DatasetElectricityETTh1WeatherTraffic
L\ell_LMetricsS4M (Ours)S4M (w/o prototype)S4M (Ours)S4M (w/o prototype)S4M (Ours)S4M (w/o prototype)S4M (Ours)S4M (w/o prototype)
Variable Missing
96MAE0.369+0.0110.571+0.0440.336+0.0200.442+0.024
MSE0.282+0.0100.624+0.0910.267+0.2060.786+0.125
192MAE0.357+0.0100.568+0.0450.320+0.6000.381+0.030
MSE0.261+0.0090.598+0.0900.261+0.0020.685+0.092
384MAE0.359+0.0090.584+0.0290.334+0.0060.383+0.026
MSE0.264+0.0090.613+0.0640.256+0.0080.700+0.065
768MAE0.362+0.0200.599+0.0280.341+0.0160.383+0.026
MSE0.269+0.0020.649+0.0580.266+0.0110.697+0.074
Timepoint Missing
96MAE0.372+0.0250.571+0.0490.313+0.0210.428+0.045
MSE0.287+0.0300.624+0.1100.237+0.0170.809+0.116
192MAE0.367+0.0040.574+0.0390.305+0.0060.385+0.005
MSE0.274+0.0040.593+0.1100.225+0.0010.687+0.023
384MAE0.370+0.0140.571+0.0570.306+0.0120.385+0.013
MSE0.277+0.0040.624+0.1120.220+0.0150.702+0.047
768MAE0.373+0.0130.588+0.0480.316+0.0050.388+0.000
MSE0.282+0.0160.647+0.0790.232+0.0040.699+0.024
评论

Dear reviewer iDLq,

We sincerely thank you for your thoughtful review and valuable feedback. We have carefully addressed each of your questions and provided detailed responses in the rebuttal. We hope to have resolved all your concerns. If you have any further comments, we would be glad to address them before the rebuttal period ends. If our responses address your concerns, we would deeply appreciate it if you could consider raising your score. Your recognition for our novel work means a lot. Thanks again for your time and effort in reviewing our work.

Regards, S4M authors

评论

Dear Reviewer iDLq,

You have indicated that submission 10665 is marginally below acceptance. The authors have provided a detailed response.

Please indicate the extend to which their response addresses your concerns and explain your decision to update (or not update) your score.

All the best,

The AC

审稿意见
5

The paper presents S4M, an innovative end-to-end framework consisting of the Adaptive Temporal Prototype Mapper (ATPM) and the Missing-Aware Dual Stream S4 (MDS-S4) for multivariate time series forecasting that addresses the challenge of missing data.

优点

  1. The proposed S4M model is innovative, integrating missing data handling within the model architecture.

缺点

  1. Although the results are promising, the authors did not provide their source code which results in very low reproducibility;
  2. Computational efficiency comparison is missing;

问题

  1. What are the computational costs and scalability of the S4M model, especially when dealing with large-scale multivariate time series data with high missing ratios? How does it compare to the baseline models in terms of training and inference time?
  2. How does the dual stream processing impact the model's ability to capture temporal dependencies?
  3. Can the authors confirm whether they implemented these baseline methods using official code or leveraged existing unified Python libraries, such as the Time-Series-Library [1] or PyPOTS [2]? It's important to note that data processing varies significantly among different imputation algorithms. Utilizing unified interfaces could help ensure that the experimental comparisons are conducted fairly.

References

[1] https://github.com/thuml/Time-Series-Library

[2] Wenjie Du. PyPOTS: a Python toolbox for data mining on Partially-Observed Time Series. In KDD MiLeTS Workshop, 2023. https://github.com/WenjieDu/PyPOTS

评论

We thank the reviewer for the thoughtful and detailed review. Also, we appreciate that the reviewer acknowledges our proposed S4M model is innovative. We address your concerns below.

W1. Source code.

Thanks for the question, the code can be found at the anonymous link.

W2 & Q1. Computational costs of S4M model and its comparison to the baseline models in terms of training and inference time?

Thank you for your valuable feedback. S4M demonstrates superior efficiency in both training and inference compared to other baselines. To evaluate the computational cost of S4M, we conducted experiments using the Electricity dataset under the highest missing ratio setting. The experiments were performed with a batch size of 16 and a hidden size of 512.

We observe that S4M (ours) achieves a lower FLOPS value compared to other SOTA transformer-based methods, including Grafiti. Also, S4M (ours) is similar to the S4-based methods. The results confirm our motivation to focus on S4-based architecture, given their efficiency (see lines 51-52 of our original submission). Furthermore, S4M demonstrates shorter training times than CRUD, PatchTST, BiTGraph, and BRITS. For inference, S4M also outperforms CRUD, PatchTST, and BRITS, making it a more efficient choice for both training and inference. (For clarification, "OOM" in the tables refers to "Out-of-Memory.")

MethodS4(Mean)S4(Ffill)S4(Decay)BRITSGRUDTransformerAutoformerBiTGraphiTransformerPatchTSTCRUDGrafitiS4M(Our)
Flops(M)12463.3912463.3912618.529091.163813.8217627.8718734.883185.64565.36392299.02219.57265118.32139191.88
Training Time(s)0.112820.114980.083250.469200.199580.090350.096130.245460.067440.4601749.40020OOM0.219381
Inference Time(s)0.074160.079830.061520.211260.087560.060880.076620.081220.040090.168964.76765OOM0.099314
评论

Q2. How does the dual stream processing impact the model's ability to capture temporal dependencies?

Thank you for the question. The motivation behind the dual-stream processing is to take advantage of the strengths of S4—namely, its ability to capture long-term temporal dependencies and its computational efficiency—when addressing block missing patterns in time series. The long-term dependency in S4 is achieved through the use of the HiPPO matrix AA as shown in (1). In our dual-stream processing, we build on this structure by incorporating oto_t (the representation from a shorter look-back window) instead of utu_t (the observation of only at the current time point) into the model. The term oto_t inherently captures additional temporal dependency information.

Q3. Can the authors confirm whether they implemented these baseline methods using official code or leveraged existing unified Python libraries.

Thanks for checking the experiment details. For a fair comparison, we used the official implementation for all baselines.

评论

Dear reviewer UFK6,

We sincerely thank you for your thoughtful review and valuable feedback. We have carefully addressed each of your questions and provided detailed responses in the rebuttal. We hope to have resolved all your concerns. If you have any further comments, we would be glad to address them before the rebuttal period ends. If our responses address your concerns, we would deeply appreciate it if you could consider raising your score. Your recognition for our novel work means a lot. Thanks again for your time and effort in reviewing our work.

Regards, S4M authors

评论

Dear Reviewer UFK6,

The discussion period is almost over, so please read the response of the authors of submission 10665 to your review.

Does their response address your concerns? Will you modify your score? Please explain your decision.

All the best,

The AC

评论

We sincerely thank all reviewers for their time and valuable feedback. We are thrilled that the reviewers recognized the strengths of our work, describing our approach as “practical” and “innovative” and noting that the “idea of the prototype bank is very compelling and thoughtful.” We are also encouraged by the positive comments on the experimental results, which were described as “promising” and comprehensive, with “a lot of ablation studies/hyper-parameter experiments.” Additionally, we appreciate the acknowledgment that our paper is “well written” and “easy to follow.” We have carefully addressed your comments point by point. We appreciate the time and effort you have put into your review, and we welcome any further questions you may have.

AC 元评审

Most of the reviewers have appreciated the novelty of the method and its practical relevance, as well as the baselines chosen. Some design choices (such as the prototype bank) were deemed to be new solutions to the problem and of potential interest to the community.

There were concerns about the reproducibility of the method and its computational efficiency, raised by reviewer UFK6. The authors have shared their code and conducted experiments, showing their method does not introduce a large computational overhead compared to S4, and is competitive against baselines from the transformer family. The reviewer (who gave a score of borderline reject) did not participate in the discussion, even when prompted. However, I consider the issues they raised as having been addressed by the authors. There were no other reasons stated in the review as arguments to reject the paper.

An issue raised by Reviewer iDLq was the simplicity of the datasets and the need for more ablation studies. The authors have included a more complex dataset and additional ablation studies, which seemed to have convinced the reviewer since he raised his score. I also find these experiments a good addition to the paper.

Reviewer iyXX also appreciated the new ideas put forward in the paper as well as the authors’ response with additional experiments comparing against PatchTST and other models, which the reviewer found convincing.

Reviewer EZj4 raised some questions about modeling choices and hyperparameters, as well as the addition of experiments with various horizons. The authors have, in my option, addressed these issues, though the reviewer did not respond, even when prompted to do so.

All in all, there is sufficient novelty in this method to make it a valuable contribution to ICLR. Reviewers have requested additional experiments, which the authors provided, and which will also strengthen the paper. Thus, I recommend acceptance.

审稿人讨论附加意见

The meta-review contains a summary of the issues raised, and the author responses. The two reviewers who opted to marginally reject the paper did not participate in the discussion. I used my own judgement and determined that the authors addressed their comments.

最终决定

Accept (Poster)