5.2

/10

Rejected5 位审稿人

最低5最高6标准差0.4

3.6

置信度

正确性2.4

贡献度2.4

表达3.0

ICLR 2025

Towards Generalisable Time Series Understanding Across Domains

Özgün Turgut,Philip Müller,Martin J. Menten,Daniel Rueckert

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

关键词

Time Series AnalysisMulti-DomainSelf-Supervised Learning

评审与讨论

审稿意见

评分: 5置信度: 32024-10-27

The paper presents OTiS, a deep model pre-trained on a large corpus (11B) for general time series analysis. In this paper, the authors highlight the challenge of heterogeneity when applying self-supervised pre-training on time series. An MAE-style pre-training method is adopted to obtain a general tokenizer for multivariate time series, and then different task heads are introduced to complete time series analysis tasks. The model demonstrates strong performance across 15 diverse applications, including time series classification, regression, and forecasting.

优点

This paper researches an important question about generalizable time series understanding across diverse domains.
This work presents a large pre-training corpus, which can be greatly beneficial if the datasets are released.
The method exhibits promising results in handling multivariate time series analysis by leveraging variate and domain signatures.

缺点

My major concern is about the novelty of the proposed method: The design of the encoder/decoder is very identical to MAE. Is there any adaptation for the time series modality? For example, considering the inherent reconstruction difficulties of time series and adjusting the mask ratio compared with the vision modality?
About the model design towards generalizable time series understanding: As the authors mention an important challenge of heterogeneity, I am slightly unconvinced that a shared unified patch embedding/projector can reflect different semantics among variates and domains, even if the patch is completely the same. Prior to this, Moirai adopted different patch sizes for different frequencies, will it further enhance OTiS?
This work adopts learnable embeddings as variate/domain signatures. I am convinced that the signatures can "distinguish" them, but how can they explicitly "capture inter-variate relationships"? This approach may also limit the generalization scope as the learned signatures do not apply to unseen variates/domains during inference.
About the experiments: Results of classification are not compared with supervised, trained deep models, for example, TimesNet and ModernTCN. For the regression rask, can you introduce some variate-centric models into this baseline, such as iTransformer? As for forecasting, the average improvement does not seem significant compared with PatchTST. Also, can you provide some explanations about Table 3 why OTiS has a significant improvement on some datasets (such as ETTh2) and a great degeneration on similar datasets like ETTh1?
A minor suggestion: the name "dual masking strategy" can be somewhat overstated to me, which generally refers to dual or antagonistic behavior (e.g., minimax). I would prefer to simplify the contribution as a "mixture" (of masking modeling and generative modeling in this paper), which is a common technique in fact. Also, I would like to know how the ratios (25% - 75% in this paper) of the two strategies are determined.
The pipeline of using the masked pre-trained models seems still somewhat tedious, i.e., lacking in generalization. Supervised training should be performed after large-scale pre-training. Can the author provide an overall promotion compared with training from random initialization, or try zero-shot generalization on downstream tasks?

问题

Have you tried to pre-train separately according to different domains and then fine-tune it for domain-specific downstream tasks? As observed from Table 1, there are several discrepancies in different domains, such as the frequencies of Economics and EEG. Is it possible that separating datasets to pre-train domain-specific models works better?
The proposed method uses a fixed context for pre-training. Padding a large pre-training corpus, which generally contains univariate time series, into a fixed temporal/variate dimension. Will it cause a waste of computing resources?

评论- Author Responses (3/n)

2024-11-29

(Q1) Do models pre-trained on a specific domain outperform those pre-trained across domains?

We evaluate our model against several domain-specific baselines that are either i) fully supervised or ii) pre-trained and fine-tuned exclusively on the target dataset. These include N-BEATS [15], TimesNet [16], Autoformer [20], DLinear [18], MAE [21], ViT [21], iTransformer [22], CM-AE [19], MMCL [21], and PatchTST [10]. The experiments show that OTiS outperforms such domain-specific approaches in 10 out of 15 benchmarks, with inferior performance observed in only 2 out of 15 benchmarks. We have conducted additional ablation studies to investigate different pre-training strategies for OTiS in the context of EEG event type classification. The results show that domain-specific pre-training does not provide improved downstream performance compared to pre-training across domains. We have added these observations to Appendix G.2.

[10] Nie, Y. et al. “A time series is worth 64 words: Long-term forecasting with transformers.” International Conference on Learning Representations (ICLR). 2023.

[15] Oreshkin, B. et al. "N-BEATS: Neural basis expansion analysis for interpretable time series forecasting." International Conference on Learning Representations (ICLR). 2019.

[16] Wu, H. et al. "TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis." International Conference on Learning Representations (ICLR). 2022.

[18] Zeng, A. et al. "Are transformers effective for time series forecasting?" AAAI Conference on Artificial Intelligence (AAAI). 2023.

[19] Radhakrishnan, A. et al. "Cross-modal autoencoder framework learns holistic representations of cardiovascular state." Nature Communications. 2023.

[20] Wu, H. et al. “Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.” Advances in Neural Information Processing Systems (NeurIPS). 2021.

[21] Turgut, O. et al. "Unlocking the diagnostic potential of ecg through knowledge transfer from cardiac mri." arXiv preprint arXiv:2308.05764. 2023.

[22] Liu, Y. et al. "iTransformer: Inverted Transformers Are Effective for Time Series Forecasting." International Conference on Learning Representations (ICLR). 2023.

(Q2) Does padding a large pre-training corpus to a fixed temporal/variate dimension waste computational resources?

We would like to clarify that we do not pad our pre-training corpus offline, as doing so would waste memory and limit scalability. Instead, we pad the variate dimension to the maximum number of variates within each batch. Furthermore, we use attention masking to ignore the padded tokens during gradient calculation, thus preventing a waste of computational resources.

评论- Author Responses (2/n)

2024-11-29

(4) Additional baselines and interpretation of the results

We have added TimesNet [16] and iTransformer [22] as baselines for the classification and regression tasks, respectively. Moreover, we thank the reviewer for pointing out the performance differences between the ETT*1 and ETT*2 (both Electricity Transformer Temperature) datasets, which we also noticed during our study. The experiments indicate that the prediction of ETT*2 is generally easier than ETT*1 across all baselines. For both datasets, we have analysed the distribution shapes and the frequency components. Our findings reveal that ETT*1 exhibits long-tailed distributions and consistently includes large spikes, which may contribute to the increased difficulty in forecasting. Since the ETT*1 and ETT*2 were collected from two distinct regions in China [25], the external influences on the two transformers may greatly differ. For instance, one transformer may be positioned outside a steam vent or in a sunny spot, making its temperature harder to predict due to the influence of undocumented external signals.

[16] Wu, H. et al. "TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis." International Conference on Learning Representations (ICLR). 2022.

[22] Liu, Y. et al. "iTransformer: Inverted Transformers Are Effective for Time Series Forecasting." International Conference on Learning Representations (ICLR). 2023.

[25] Zhou, H. et al. "Informer: Beyond efficient transformer for long sequence time-series forecasting." AAAI Conference on Artificial Intelligence (AAAI). 2021.

(5) Additional ablation study on the composition of the dual masking strategy

The composition of the masking schemes is empirically set to 75% random masking and 25% post-fix masking. We have included an ablation study on the composition of the masking schemes in Appendix G.1.

(6) Comparison with training from random initialisation and additional experiments in zero-shot settings

We have reworked the experiments section to include a randomly initialised OTiS that is trained fully supervised. The results confirm the widely reported advantages of pre-training [4][5][6][7][8][9]. Additionally, we have conducted an ablation study to investigate different pre-training strategies for OTiS on EEG event type classification, as detailed in Appendix G.2, which further stress these findings. Moreover, we have conducted experiments under zero-shot conditions. The zero-shot results in unseen domains, such as EMG, reveal that OTiS outperforms baseline models even without domain-specific training, underscoring the generalisability of its extracted time series features. We have included these observations in Appendix F and reworked the experiments section to present the zero-shot results.

[4] Jin, M. et al. "Time-LLM: Time Series Forecasting by Reprogramming Large Language Models." International Conference on Learning Representations (ICLR). 2023.

[5] Zhou, T. et al. "One fits all: Power general time series analysis by pretrained lm." Advances in Neural Information Processing Systems (NeurIPS). 2024.

[6] Goswami, M. et al. "MOMENT: A Family of Open Time-series Foundation Models." International Conference on Machine Learning (ICML). 2024.

[7] Woo, G. et al. "Unified Training of Universal Time Series Forecasting Transformers." International Conference on Machine Learning (ICML). 2024.

[8] Yang, C. et al. “Biot: Biosignal transformer for cross-data learning in the wild.” Advances in Neural Information Processing Systems (NeurIPS). 2024.

[9] Jiang, W. et al. “Large brain model for learning generic representations with tremendous EEG data in BCI.” International Conference on Learning Representations (ICLR). 2024.

评论- Author Responses (1/n)

2024-11-29

Thank you for your extensive evaluation and constructive feedback on our work. We hope the following clarifications and additional experiments adequately address the points raised.

(1) Adaptation of the masked data modelling for time series analysis (e.g. regarding the masking ratio)

We would like to clarify that our contributions include the domain-specific tokenisation, the dual masking strategy, and the normalised cross-correlation loss, all of which are specifically designed for time series analysis. Additionally, we would like to emphasise that masked data modelling (MDM) is a widely adopted pre-training strategy in time series [6][7][8][9][10][12][21][24], primarily because it does not rely on heavy data augmentations difficult to design for sequential data [23]. Time series variates often exhibit high correlations, making higher masking ratios beneficial compared to the imaging modality, as they help eliminate redundancies in the learned representations. In our pre-training, we empirically set the masking ratio to 75%. Prior studies on MDM for time series, such as Ti-MAE [24], have explored optimal masking ratio for this modality. Their findings suggest that, similar to MDM in imaging, a masking ratio of 75% translates to best downstream performance.

[6] Goswami, M. et al. "MOMENT: A Family of Open Time-series Foundation Models." International Conference on Machine Learning (ICML). 2024.

[7] Woo, G. et al. "Unified Training of Universal Time Series Forecasting Transformers." International Conference on Machine Learning (ICML). 2024.

[8] Yang, C. et al. “Biot: Biosignal transformer for cross-data learning in the wild.” Advances in Neural Information Processing Systems (NeurIPS). 2024.

[9] Jiang, W. et al. “Large brain model for learning generic representations with tremendous EEG data in BCI.” International Conference on Learning Representations (ICLR). 2024.

[10] Nie, Y. et al. “A time series is worth 64 words: Long-term forecasting with transformers.” International Conference on Learning Representations (ICLR). 2023.

[12] Dong, J. et al. “SimMTM: A simple pre-training framework for masked time-series modeling.” Advances in Neural Information Processing Systems (NeurIPS). 2024.

[21] Turgut, O. et al. "Unlocking the diagnostic potential of ecg through knowledge transfer from cardiac mri." arXiv preprint arXiv:2308.05764. 2023.

[23] Assran, M. et al. "Self-supervised learning from images with a joint-embedding predictive architecture." Conference on Computer Vision and Pattern Recognition (CVPR). 2023.

[24] Li, Z. et al. "Ti-mae: Self-supervised masked time series autoencoders." arXiv preprint arXiv:2301.08871. 2023.

(2) Can a shared patch projector reflect different semantics among variates and domains? Could using different patch sizes for different frequencies, as in MOIRAI [7], lead to further improvements?

The authors of MOIRAI [7] (2024) presented a subsequent study [26] (2024) in which they eliminate the dependency on multiple projection layers for different frequencies. Instead, they employ a shared projection layer with a unified patch size across all frequencies (i.e. domains). They argue that frequencies are not a reliable indicator of the underlying patterns in time series, and that human-imposed inductive biases may hinder model generalisability. We agree with the authors and believe that projection layers should be viewed as general feature extractors, independent of frequency, variate, or domain. The extracted features serve as a learned vocabulary, which can then be slightly modulated to the domain and variate through specific positional embeddings, as implemented in OTiS.

[26] Liu, X. et al. "Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts." arXiv preprint arXiv:2410.10469. 2024.

(3) How can domain-specific variate embeddings capture inter-variate relationships? These learned embeddings may limit generalisability, as they do not translate to unseen variates/domains during inference.

To investigate whether adaptation to unseen domains is required for competitive performance, we have conducted additional experiments under zero-shot conditions, as detailed in Appendix F. The zero-shot results in unseen domains, such as EMG, reveal that OTiS outperforms baseline models even without domain-specific fine-tuning, underscoring the generalisability of its extracted time series features. We have included these observations in Appendix F and reworked the experiments section to present the zero-shot results.

2024-12-02

Thank you for the replies. I have read the rebuttal and the revision, which addressed my concerns regarding the performance and method design. But there are a few unsolved concerns:

Regarding W2: Although the authors provide another work to support it, I would be interested to know if the authors have done some specific empirical evaluations to draw this conclusion.

Regarding W3: I don't think the author answered my question. I agree that learnable embeddings can help the model distinguish the heterogeneity of data in different domains (which may lead to what the rebuttal has mentioned: OTiS can outperform baseline models without domain-specific fine-tuning). However, I still cannot be convinced that the "learnable embeddings can explicitly capture inter-variate relationships" mentioned in this work.

Regarding Q1: I read Appendix G.2 carefully: the authors provide further experiments to prove that the pre-trained model can outperform the models supervised-trained (or self-supervised trained + fine-tuned) on one dataset. The results are not convincing to me because of (1) the lack of state-of-the-art baseline models and self-supervised training methods, such as PatchTST, and (2) The results cannot solve the concern of OTiS in data scaling. Concretely, training OTiS on (1) a set of datasets (not one) related to the target dataset and (2) a domain-universal dataset (which may include the former set and some less related datasets). Is it possible that the first model that is pre-trained on a smaller scale works better?

评论- Author Responses (1/n)

2024-12-03

Thanks a lot for the constructive interaction! We are happy to hear that our rebuttal addressed most of your concerns. Below, we provide detailed responses to your new comments on a point-by-point basis.

(1) Can a shared patch projector reflect different semantics among variates and domains?

We see projection layers with a unified patch size as general feature extractors, independent of the sampling frequency, variate, or domain. This hypothesis is not derived from empirical evaluation, but based on conceptual considerations that we made prior to our study on time series foundation models, as elaborated in the following.

The sampling frequency refers to the number of observations collected from a continuous signal per unit of time. The choice of sampling frequency depends on the goal of the analysis: some studies require low frequencies (e.g. $f=386$ nHz to capture long-term economic trends spanning 60 years, only within 728 time points [31]), while others require high frequencies (e.g. $f = 44.1$ kHz to capture rapid fluctuations in 10-second audio signals, resulting in 441,000 time points [32]). However, all sampling frequencies share the same purpose: to ensure that the information relevant to the analysis is captured within the observation period (i.e., the time series).

Hence, we assume that a model will have access to all of the relevant information captured in a time series, if its context length is sufficiently long. Consequently, the context length, rather than the frequency itself, is the critical factor for model performance. Ideally, the model would analyse the entire time series to ground its prediction. However, especially for high-frequency time series, this is often infeasible with small patch sizes due to the computational complexity of attention-based models (the smaller the patch size, the more tokens need to be analysed for a specific context length).

We hypothesise that adopting different patch sizes for different frequencies may be beneficial, not to reflect different semantics as assumed by the reviewer, but to enable sufficiently long context lengths. This also aligns with the authors of MOIRAI, who opted “for a larger patch size to handle high-frequency data, thereby lower[ing] the burden of the quadratic computation cost of attention while maintaining a long context length” [7].

Based on this reasoning, we agree with the reviewer that it would be interesting to see how different patch sizes, and effectively different context lengths, affect the downstream performance of our model. Therefore, we will include a small empirical study in the final version of our manuscript, analysing the effect of different patch sizes.

[7] Woo, G. et al. "Unified Training of Universal Time Series Forecasting Transformers." International Conference on Machine Learning (ICML). 2024.

[31] McCracken, M. W. et al. "FRED-MD: A monthly database for macroeconomic research." Journal of Business & Economic Statistics. 2016.

[32] Gemmeke, J. F. et al. "Audio set: An ontology and human-labeled dataset for audio events." IEEE international conference on acoustics, speech and signal processing (ICASSP). 2017.

(2) Do domain-specific variate embeddings capture inter-variate relationships?

Our model effectively learns the relationships between variates within a domain, purely from the data it has seen during training, as showcased in Figures 3, 7, 8, 9 of our manuscript.

For example, the principal component analysis (PCA) presented in Figures 3 and 7 demonstrates that EEG-specific variate embeddings accurately capture the spatial arrangement of EEG variates, which correspond to actual electrodes placed on the scalp. In this context, the spatial arrangement represents the inter-variate relationships.

Similarly, the PCA in Figure 8 indicates that ECG-specific variate embeddings correctly capture the spatial arrangement of ECG variates, which partially correspond to actual electrodes placed on the human body (e.g. V1-V6). In this context, the spatial arrangement again denotes the inter-variate relationships.

Finally, the embedding similarity analysis in Figure 9 reveals that Weather-specific embeddings capture the physical relationships among the 21 climatological indicators described in Appendix E.2. In this case, these physical relationships represent the inter-variate relationships.

If the reviewer has alternative interpretations of the term “inter-variate relationship”, we welcome further discussion.

评论- Author Responses (4/n)

2024-12-03

(3) ctd, Does OTiS pre-trained on domain-specific datasets outperform OTiS pre-trained across domains?

[8] Yang, C. et al. “Biot: Biosignal transformer for cross-data learning in the wild.” Advances in Neural Information Processing Systems (NeurIPS). 2024.

[9] Jiang, W. et al. “Large brain model for learning generic representations with tremendous EEG data in BCI.” International Conference on Learning Representations (ICLR). 2024.

[10] Nie, Y. et al. “A time series is worth 64 words: Long-term forecasting with transformers.” International Conference on Learning Representations (ICLR). 2023.

[33] Van Dijk, H. et al. "The two decades brainclinics research archive for insights in neurophysiology (TDBrain) database." Scientific data. 2022.

[34] Zheng, W. et al. "Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks." IEEE Transactions on autonomous mental development. 2015.

[35] Obeid, Iyad, and Joseph Picone. "The temple university hospital EEG data corpus." Frontiers in neuroscience. 2016.

[36] Song, Y. et al. "Transformer-based spatial-temporal feature learning for EEG decoding." arXiv preprint arXiv:2106.11170. 2021.

[37] Peh, W. et al. "Transformer convolutional neural networks for automated artifact detection in scalp EEG." IEEE Engineering in Medicine & Biology Society (EMBC). 2022.

[38] Li, H. et al. "Motor imagery EEG classification algorithm based on CNN-LSTM feature fusion network." Biomedical signal processing and control. 2022.

[39] Jing, J. et al. "Development of expert-level classification of seizures and rhythmic and periodic patterns during eeg interpretation." Neurology. 2023.

[40] Yang, C. et al. "Self-supervised electroencephalogram representation learning for automatic sleep staging: model development and evaluation study." JMIR AI. 2023.

[41] Buckwalter, G. et al. "Recent advances in the TUH EEG corpus: improving the interrater agreement for artifacts and epileptiform events." IEEE Signal Processing in Medicine and Biology Symposium (SPMB). 2021.

[42] Veloso, L. et al. "Big data resources for EEGs: Enabling deep learning research." IEEE Signal Processing in Medicine and Biology Symposium (SPMB). 2017.

[43] Shah, V. et al. "The temple university hospital seizure detection corpus." Frontiers in neuroinformatics. 2018.

[44] Von Weltin, E. et al. "Electroencephalographic slowing: A primary source of error in automatic seizure detection." IEEE Signal Processing in Medicine and Biology Symposium (SPMB). 2017.

评论- Final Author Responses

2024-12-04

Dear Reviewer tXLU,

Thanks again for engaging in a discussion.

As you acknowledged in your response, our rebuttal has addressed your main concerns regarding performance and methodology. Additionally, we have worked to address your remaining concerns, by (1) providing a chain of thought on shared patch projectors, (2) elaborating on the terminology behind inter-variate relationships, and (3) clarifying the effectiveness of pre-training strategies, including the introduction of PatchTST [10] as a new baseline.

We hope these efforts adequately address the remaining points you raised. If you believe so, we would greatly appreciate a final adjustment of your scores to reflect this.

Authors

[10] Nie, Y. et al. “A time series is worth 64 words: Long-term forecasting with transformers.” International Conference on Learning Representations (ICLR). 2023.

评论- Author Responses (3/n)

2024-12-03

(3) ctd, Does OTiS pre-trained on domain-specific datasets outperform OTiS pre-trained across domains?

We have updated Table 12 accordingly with the results of PatchTST [10], as outlined in the following.

Methods	Parameters	Balanced ACC ⬆️	Cohen’s Kappa ⬆️	Weighted F1 ⬆️
ST-Transformer [36]	3.5M	0.3984 ± 0.0228	0.3765 ± 0.0306	0.6823 ± 0.0190
CNN-Transformer [37]	3.2M	0.4087 ± 0.0161	0.3815 ± 0.0134	0.6854 ± 0.0293
FFCL [38]	2.4M	0.3979 ± 0.0104	0.3732 ± 0.0188	0.6783 ± 0.0120
SPaRCNet [39]	0.79M	0.4161 ± 0.0262	0.4233 ± 0.0181	0.7024 ± 0.0104
ContraWR [40]	1.6M	0.4384 ± 0.0349	0.3912 ± 0.0237	0.6893 ± 0.0136
PatchTST [10]	3.3M	0.4677 ± 0.0243	0.5051 ± 0.0169	0.7526 ± 0.0203
BIOT [8]	3.2M	0.5281 ± 0.0225	0.5273 ± 0.0249	0.7492 ± 0.0082
LaBraM [9]	369M	0.6616 ± 0.0170	0.6745 ± 0.0195	0.8329 ± 0.0086

OTiS-Base $_\text{w/o pre-training}$ *	8M	0.5361 ± 0.0350	0.5183 ± 0.0316	0.7642 ± 0.0157
OTiS-Base $_\text{EEG}$$^\dagger$	8M	0.5562 ± 0.0106	0.5504 ± 0.0204	0.7784 ± 0.0095
OTiS-Base	8M	0.5743 ± 0.0257	0.5913 ± 0.0146	0.8004 ± 0.0071

$*$ Model was randomly initialized and trained fully supervised.

$^\dagger$ Model was pre-trained only with the EEG data of our pre-training corpus (i.e. TDBrain [33] and SEED [34]).

The experiments reveal that general models (i.e., OTiS-Base $_\text{EEG}$ ), trained on a smaller scale, do not perform better than foundation models (i.e., OTiS-Base and LaBraM), trained on a large scale.

We have revised Appendix G.2 to more carefully highlight these observations and hope that our discussion and clarifications adequately address the points you raised. If there are any remaining questions or concerns, we are happy to discuss them further.

评论- Author Responses (2/n)

2024-12-03

(3) Does OTiS pre-trained on domain-specific datasets outperform OTiS pre-trained across domains?

Thank you for pointing this out out; we believe there may be a misunderstanding here. Note that OTiS-Base $_\text{EEG}$ in Table 12 refers to OTiS pre-trained on TDBrain [33] and SEED [34], i.e. a set of two EEG datasets related to the target TUEV dataset [35]. In contrast, OTiS-Base refers to OTiS pre-trained on the full pre-training corpus detailed in Table 1 of our manuscript.

The additional experiments on the TUEV [35] data provided in Appendix G.2 show that both OTiS-Base $_\text{EEG}$ and OTiS-Base outperform i) domain-specific baselines (either fully supervised or pre-trained and fine-tuned on the target dataset, i.e. one dataset), ii) general baselines (pre-trained on few external source datasets and fine-tuned on the target dataset), and even iii) foundation models (pre-trained on multiple external source datasets and fine-tuned on the target dataset). The domain-specific baselines include ST-Transformer [36], CNN-Transformer [37], FFCL [38], and SPaRCNet [39]. The general methods include ContraWR [40]. The foundation methods include BIOT [8] and LaBraM [9]. We would like to clarify that other than stated by the reviewer, these latter models represent state-of-the-art baselines that are trained using self-supervised learning. Additionally, we have introduced PatchTST [10] as a new baseline. We have reworked Appendix G.2 to summarise the baselines, similar as in the following Table.

Model	Pre-training Method	Pre-training Dataset	Domain Adaptation	Architecture
ST-Transformer [36]	$-$	Target	Fine-tuning	Transformer
CNN-Transformer [37]	$-$	Target	Fine-tuning	CNN and Transformer
FFCL [38]	$-$	Target	Fine-tuning	CNN and LSTM
SPaRCNet [39]	$-$	Target	Fine-tuning	1D-CNN
ContraWR [40]	CL	Target	Fine-tuning	Transformer
PatchTST [10]	MDM	Target	Fine-tuning	Transformer
BIOT [8]	MDM	$*$	Fine-tuning	Transformer
LaBraM [9]	MDM	$^+$	Fine-tuning	Transformer

$*$ Pre-trained on 6 EEG datasets (totalling 13,000 recording hours), including the target dataset (i.e. TUEV [35])

$^+$ Pre-trained on 16 EEG datasets (totalling 2,500 recording hours), including TUAR [41], TUEP [42], TUSZ [43], and TUSL [44], which are subsets of the TUH [35]. As TUEV [35] is a subset of TUH [35], too, there may be potential information leakage through overlapping subjects between subsets.

审稿意见

评分: 5置信度: 42024-10-31

This paper presents OTiS for multi-domain time series analysis, building on existing pre-training paradigms for time series. It allocates domain-specific variable embeddings to distinguish the heterogeneity of different variables across domains and enhances the model's ability to learn temporal causal relationships through a dual-masking strategy. Additionally, it introduces NCC loss to capture global patterns. Experimental results demonstrate that the proposed method achieves competitive performance in time series classification, regression, and forecasting tasks across multiple domains compared to SOTA methods. Visualization results further highlight the effectiveness and interpretability of the domain-specific variable embeddings.

优点

The paper is well-written, and the method is easy to understand. The authors clearly articulate how they consider the heterogeneity of different domain time series to achieve multi-domain time series forecasting.
This paper focuses on the problem of multi-domain time series analysis, which is crucial for building generalizable foundational models for time series.
The experimental section utilizes a large amount of data, and the model is open-source, contributing particular engineering value to the community.

缺点

The paper mentions that one of the challenges of cross-domain time series models is the significant differences in temporal dynamics and sampling frequencies among different domains. However, the paper uses the same patch size for all domains when dividing patches, failing to accommodate the unique sampling rates of different domains. This oversight means the paper does not sufficiently consider the differences in sampling rates across domains. Additionally, using a shared patch projector to encode the temporal dynamics within each patch does not adequately address the differences in temporal dynamics between domains. While this approach may be common in previous works, it does not consider the temporal heterogeneity among domains.
The method of considering variable heterogeneity through learned variable embeddings is not uncommon. In spatiotemporal prediction, some methods [2][3] have already employed learnable embeddings to explicitly distinguish heterogeneous spatiotemporal patterns by learning time-specific and space-specific parameter spaces.
[1] proposed using textual descriptions to label different time series domains for cross-domain time series forecasting, utilizing a channel-independent strategy. In contrast, the domain-specific variable embeddings in this paper correspond to a channel-mixing strategy. I look forward to seeing a comparison between these two strategies in cross-domain time series.
The experimental section lacks details about the baselines. How were these methods selected? Were they pre-trained and fine-tuned? If so, what data was used for pre-training and fine-tuning?
How does the performance of the proposed method compare to conventional time series classification or forecasting methods trained on a single specific dataset?

[1] UniTime: A Language-Empowered Unified Model for Cross-Domain Time Series Forecasting, WWW, 2024

[2] Heterogeneity-Informed Meta-Parameter Learning for Spatiotemporal Time Series Forecasting, KDD, 2024

[3] Adaptive Graph Convolutional Recurrent Network for Traffic Forecasting, NeurIPS, 2020

问题

See the weaknesses

评论- Author Responses (2/n)

2024-11-29

(4) Clarification of the baseline models

We have reworked the experiments section to clarify the categorisation of the baselines. Additionally, we have included a summary of all baselines, detailing their architectures, pre-training strategies, and domain adaptation techniques, in Appendix B.

(5) Comparison with traditional baselines trained on a single, specific dataset

In extensive benchmarking, we compare our model against multiple domain-specific baselines that are either i) fully supervised or ii) pre-trained and fine-tuned exclusively on the target dataset. These include N-BEATS [15], TimesNet [16], Autoformer [20], DLinear [18], MAE [21], ViT [21], iTransformer [22], CM-AE [19], MMCL [21], and PatchTST [10]. These baselines span all key use cases in time series analysis, providing a comprehensive comparison. The experiments show that OTiS outperforms such domain-specific approaches in 10 out of 15 benchmarks, with inferior performance to such approaches in only 2 out of 15 benchmarks. We have conducted an additional ablation study to investigate different pre-training strategies for OTiS, as detailed in Appendix G.2, which further stress the widely reported advantages of general pre-training across domains [4][5][6][7][8][9].