PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
3
3
4
4
ICML 2025

KAN-AD: Time Series Anomaly Detection with Kolmogorov–Arnold Networks

OpenReviewPDF
提交: 2025-01-10更新: 2025-07-24

摘要

关键词
anomaly detectiontime series analysis

评审与讨论

审稿意见
3

The paper introduces KAN-AD, a novel approach to time series anomaly detection (TSAD) based on Kolmogorov–Arnold Networks (KANs). The motivation for this work stems from the limitations of existing TSAD methods, particularly those relying on forecasting models, which often overfit to local fluctuations and fail to generalize well in the presence of noise. The authors argue that effective TSAD should prioritize modeling "normal" behavior using smooth local patterns rather than attempting to capture every minor variation in the data.

To address this issue, the authors reformulate time series modeling by approximating the series with smooth univariate functions. They observe that KAN, in its original form, is susceptible to localized disturbances due to its reliance on B-spline functions. To overcome this limitation, they propose KAN-AD, which replaces B-splines with truncated Fourier expansions. This modification enhances robustness against local fluctuations while preserving the ability to capture global patterns.

The core methodological contributions include the reformulation of time series modeling using Fourier series for improved smoothness, the development of a novel coefficient-learning mechanism based on one-dimensional convolutional neural networks. The authors validate their approach through empirical analysis, demonstrating its effectiveness across various datasets, including those with noisy training data.

after rebuttal

I changed my score from reject to weak accept. The authors did address my questions, however I'm still not certain about the authors' choice of evaluation metric. If you see from their tables in the second round, the real performances of the detectors are not as high as they claim with F1eF1_e.

给作者的问题

I've put them in the other sections to stay consistent with my thought process. This layout of reviewing doesn't allow for cross-referencing. I'm sorry if I scrambled the questions above. Alas, this is my way of reviewing papers, as I feel it's more organic to put questions on each of the sections above.

Here are some more due to character limitation in Sec. Experimental Designs Or Analyses.

  1. Figure 9 doesn't actually tell me, most of the time, that Fourier is the best. The most indicative example is in UCR, but the others don't show much of difference with the others. So why do you opt for Fourier? Also, what are the F1dF1_d and the AUPRC for these cases? I'd like to see AUPRC in the rebuttal, please.
  2. Why aren't the trends in Figure 10 non-increasing? This doesn't make sense, especially for SoTA forecasting methods. The more the training is tainted, the lower the performances are going to be. It's very wierd, especially for the lower portion of the figure.

论据与证据

The authors makes claims that raise some questions. Here, I'm reporting some of them:

  1. In the introduction the authors state that normal sequences exhibit greater local smoothness than abnormal ones. Isn't this claim just a collorary of reconstruction-based approaches? In other words, in reconstruction-based detectors, the anomalies have higher reconstruction error than normal ones, since the detectors (usually autoencoder-based) are trained on normal sequences only? How do the authors justify this?

方法与评估标准

The benchmarks are fine, however the evaluation metrics don't seem to make sense (see next section for my doubts, please). Also, the authors fail to show whether they do any sort of cross-validation after they split the series in windows, or whether they use the time component to define the train, validation, and tests sets. In other words, are you using time, say first 10 windows for training, the 11th and 12th window for validation, and from the 13th window you test? Or are you mixing these windows, say 80% of all windows goes to training, 2% to validation, and the rest to test? If the former is the case, then how many kk-cross validations are you doing?

理论论述

The authors don't make any theoretical claims, which haven't been shown/proven before: i.e., using Fourier transformata instead of B-splines when wanting to capture periodic patterns, or having more local smoothness.

实验设计与分析

I don't know why the authors are proposing different evaluation strategies. All of the F1 variants adopted seem to misrepresent the performances of KAN-AD. Sure, the delay penalizes KAN-AD, but why do you need to penalize it? Here's an example.

t1t_1t2t_2t3t_3t4t_4t5t_5t6t_6t7t_7t8t_8t9t_9t10t_{10}
GT×\times×\times×\times×\times×\times
DetectionTTT
Timestep-basedTNTNTNFNFNFNTPFPTNTP
3-Delay PATNTNTNFNFNFNFNFPTNTP
Point-wise PATNTNTNTPTPTPTPFPTNTP

×\times is true anomaly; T = detected anomaly; TP = True Positive; FN = False Negative; TN = True Negative; FP = False Positive.

So, my doubt is on the evaluation itself. Since the authors treat univariate series only, this is even clearer. Each xix_i at any timestamp tit_i can be either normal or anomalous. Therefore, the predictor can be evaluated for each tit_i as shown above: i.e., timestep-based. In these conditions, the authors report F1d=0.2857,F1e=0.5714F1_d = 0.2857, F1_e = 0.5714 - since it doesn't consider the length of the detected anomalies nor false positives. Nevertheless, the real F1 score should be that for each time-step: i.e., F1t=0.5F1_t = 0.5. The authors claim that they use the event-based calculation - i.e., F1eF1_e - which smears false negatives as being true positives (see the point-wise PA and Figure 4 in the paper). The authors were self-critical with the delay in the beginning and then decide to use the event-based one. It's very confusing, and the reason the authors provide is "For the sake of convenience [...] use Event F1 [...] as it is more alignment (typo) with the need for real-time anomaly detection in real-world situations." Who says that event F1 is more aligned with real-time anomaly detection? Is there any other work that argues this? If not, this statement is unfounded, and actually harmful, since, again, event F1 obfuscates false negatives, a critical aspect in critical domains, especially healthcare: e.g., if you miss an anomaly in neurodegenerative patients, they might die.

If we take the example in Figure 4 of the paper, we have F1d=0.1429F1_d = 0.1429, and F1e=0.5714F1_e = 0.5714, which is an overestimate of the true performance according to the timestep-based approach proposed above F1t=0.2667F1_t = 0.2667. Therefore, F1tF1_t emulates quite closely the "self-harming" F1dF1_d. Again, F1eF1_e is an overestimate of the true performances.

The only sound metric here is AUPRC, which doesn't actually show that KAN-AD is way better than the second-best (i.e., KAN in KPI and WSD). So, in KPI you're comparable to KAN (difference of .029 which is statistically insignificant) and in WSD you underperofm by .013 (which again is statistically insignificant). Let's agree that in 2/4 datasets you're comparable to SoTA, and you win in 2 others. Therefore, across the board KAN-AD is not the best. Moreover, as per my comment in Sec. Methods And Evaluation Criteria, we can't even be sure that these metrics make sense since we don't know the number of folds you tested this on and what is the train-validation-test split you used.

Moreover, the average F1eF1_e reported in Table 2 isn't indicative of KAN-AD's overall performance. There are statistical tests that showcase how much better is one method compared to others. For example the Friedmann test with a post-hoc Bonferroni-Dunn could actually show if KAN-AD is better than KAN in the previous two edge cases, which I doubt it would. However, these tests work if you have multiple runs/folds for each of the detectors compared.

In Table 3, why isn't SAND's resource consumptions reported? It has the second-best F1eF1_e score according to Table 2! Even though you can execute it on CPU, the reader still should have these values reported. What does the justification you provided "SAND’s CPU-only execution requirement and SubLOF’s limited multi-core utilization capabilities preclude fair comparison in modern hardware acceleration contexts" mean? You're using most of the other methods on GPU which actually accelerate the methods. What fair comparison are you doing then here by excluding SANDS or SubLOF? Also, how come KAN-AD has a lower execution time on CPU rather than on GPU? Are you doing a lot of I/O transfer operations? Actually this happens for all methods with 1k or less parameters.

补充材料

I checked all of it. It could've been omitted as far as I'm concerned. It doesn't provide anything interesting to support the main paper. The code is there and I checked it. I tried to run it as described in the README file. It doesn't seem to work when building the environment. Here's the error:

Could not solve for environment specs

The following packages are incompatible

├─ _libgcc_mutex ==0.1 conda_forge does not exist (perhaps a typo or a missing channel);

├─ _openmp_mutex ==4.5 2_kmp_llvm does not exist (perhaps a typo or a missing channel);

├─ aiohttp ==3.9.5 py310h2372a71_0 does not exist (perhaps a typo or a missing channel);

├─ alsa-lib ==1.2.10 hd590300_0 does not exist (perhaps a typo or a missing channel);

├─ argon2-cffi-bindings ==21.2.0 py310h2372a71_4 does not exist (perhaps a typo or a missing channel);

├─ attr ==2.5.1 h166bdaf_1 does not exist (perhaps a typo or a missing channel);

├─ aws-c-auth ==0.7.16 h70caa3e_0 does not exist (perhaps a typo or a missing channel);

├─ aws-c-cal ==0.6.9 h14ec70c_3 does not exist (perhaps a typo or a missing channel);

├─ aws-c-common ==0.9.12 hd590300_0 does not exist (perhaps a typo or a missing channel);

├─ aws-c-compression ==0.2.17 h572eabf_8 does not exist (perhaps a typo or a missing channel);

├─ aws-c-event-stream ==0.4.2 h17cd1f3_0 does not exist (perhaps a typo or a missing channel);

├─ aws-c-http ==0.8.0 hc6da83f_5 does not exist (perhaps a typo or a missing channel);

├─ aws-c-io ==0.14.3 h3c8c088_1 does not exist (perhaps a typo or a missing channel);

├─ aws-c-mqtt ==0.10.2 h0ef3971_0 does not exist (perhaps a typo or a missing channel);

├─ aws-c-s3 ==0.5.1 h2910485_1 does not exist (perhaps a typo or a missing channel);

├─ aws-c-sdkutils ==0.1.14 h572eabf_0 does not exist (perhaps a typo or a missing channel);

├─ aws-checksums ==0.1.17 h572eabf_7 does not exist (perhaps a typo or a missing channel);

├─ aws-crt-cpp ==0.26.2 ha623a59_3 does not exist (perhaps a typo or a missing channel);

├─ aws-sdk-cpp ==1.11.267 h0bb408c_0 does not exist (perhaps a typo or a missing channel);

├─ binutils_impl_linux-64 ==2.40 hf600244_0 does not exist (perhaps a typo or a missing channel);

├─ binutils_linux-64 ==2.40 hbdbef99_2 does not exist (perhaps a typo or a missing channel);

├─ blas-devel ==3.9.0 20_linux64_openblas does not exist (perhaps a typo or a missing channel);

├─ blas ==2.120 openblas is requested and can be installed;

├─ blessed ==1.19.1 pyhe4f9e05_2 is not installable because it requires

│ └─ __unix, which is missing on the system;

The stacktrace is huge, above is a snippet. Other reviewers should try to build the environment and see if this occurs to them as well.

Also, I've noticed in the run.py file that there are more datasets that authors could've tested: e.g., Yahoo, NAB, AIOPS. How come the authors decided not to?

In method/kanad/config.toml there is the window size of 96 that performs a sliding/slicing window over the time series. Why isn't this discussed in the hypeparameters (column 1 lines 270-272)

与现有文献的关系

The scope of the paper seems shortsighted. Nowadays, especially foundational models [1], treat multivariate time-series [2]. KAN-AD treats only univariate ones. Real-world scenarios usually have multiple variables which makes detecting anomalies more complicated especially if one wants to produce an explanation/interpretation of why the anomaly happened. I'm sceptical about KAN-AD's usage in broader TSAD. The overall experiments (especially the evaluation) raises a lot of question about KAN-AD's utility (see Experimental Designs Or Analyses)

[1] Gao et al. Units: A unified multi-task time series model. NeurIPS'25

[2] Flaborea et al. Are we certain it's anomalous?. Workshops CVPR'23.

遗漏的重要参考文献

Why are the authors only considering forecasting approaches? AD has also reconstruction-based approaches. This neglection hinders the generalizability of KAN-AD. Therefore, the authors have missed a lot of important competitors in TSAD. Here are a few that should be discussed and, perhaps, compared against. Most importantly, the comparison against UniTS [11] is paramount since they are SoTA in all time-series problems. I'm aware that some of these are >5 years old, however [1,9,11] should definitely be considered in the experiments.

[1] Flaborea et al. Are we certain it's anomalous?. Workshops CVPR'23.

[2] Audibert et al. Usad: Unsupervised anomaly detection on multivariate time series. KDD'20

[3] Bieber et al. Low sampling rate for physical activity recognition. PETRA'14

[4] Geiger et al. Tadgan: Time series anomaly detection using generative adversarial networks. Big Data'20

[5] Hundman et al. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. KDD'18

[6] Li et al: Mad-gan: Multivariate anomaly detection for time series data with generative adversarial networks. ICANN'19

[7] Su et al. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. KDD'19

[8] Zhang et al. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. AAAI'19

[9] Zhang et al. Unsupervised deep anomaly detection for multi-sensor time-series signals. IEEE TKDE'21

[10] Zhao et al. Multivariate time-series anomaly detection via graph attention network. ICDM'20

[11] Gao et al. Units: A unified multi-task time series model. NeurIPS'25 (however, the arxiv was published since February 2024, so this cannot be considered simultaneous and concurrent paper, and the authors should've included it in their discussion, especially since it can be used also in forecasting modality to detect anomalies). I'm mostly interested to see KAN-AD compared against UniTS, the latter beign a foundational model for time-series problems. I suggest the authors look at [12] to see how to adapt it for anomaly detection.

[12] Gabrielli et al. Seamless Monitoring of Stress Levels Leveraging a Universal Model for Time Sequences. arXiv preprint arXiv:2407.03821. 2024 Jul 4.

其他优缺点

Weakness

I can't seem to find how the anomaly is detected. After the projection component of KAN-AD, you have a forecast on the time series. Then you have a ground truth (GT) series corresponding to the same window. How do you compare the forecasted and the GT? Nowhere in the paper this is written. Or are you predicting the classes cic_i for each input value at timestamp tit_i? This is confusing.

Strength (but not so much)

The authors rely on the EasyTSAD (probably they forget the project on github) framework to implement their KAN-AD class. This framework provides a straightforward evaluation pipeline which contains the also F1dF1_d the authors show in the paper, among other things. However, although using an-already-built framework makes it easier to prototype and evaluate stuff, the authors refrain from implementing the works I enlisted in Essential References Not Discussed. All the reported SoTA methods were already implemented in EasyTSAD, which is ok, but doesn't justify the authors' choice of not discussing the rest, because "probably it's a hassle to port them to EasyTSAD's design patterns".

其他意见或建议

  1. Why is the first figure on the paper labelled as figure 2? I can't seem to find figure 1.
  2. Figure 7 should rather be a 2d plot where the AUPRC axis can be transformed into a heatmap/colorbar. This improves readability of the plot. 3d-plots are often discouraged, but this is a minor suggestion.
  3. Why is Table 2 the first table in the paper. Please check your labelling system in latex.
  4. Figure 5 isn't much useful. I notice that you provide anomaly scores for each method, but the reader would much appreciate to see which are the detected anomalies. Here you might want to change the GT anomaly color to green, and the detected ones to red. Only in this way one can appreciate the detection capabilities of KAN-AD vs. rest. How the figure is now shown doesn't make much sense. Anomaly errors don't tell me anything.

伦理审查问题

The authors missed the impact statement which is mandatory according to the submission guidelines at https://icml.cc/Conferences/2025/CallForPapers.

Is this paper a desk reject?

作者回复

Response to reviewer UWPv

We thank the reviewer for the constructive suggestions and will further revise the manuscript accordingly. For KAN-AD's Performance on MTS, please refer to our response to reviewer HB8y.

Clarification on the Claim of Local Smoothness in Normal Curves: The smoothness of normal patterns is a fact, independent of the type of model or training method used. The Anomaly Transformer [1] also leverages this observation to design its algorithm. Moreover, TSAD methods are not trained solely on data containing normal patterns but are trained on data that mixes both normal and anomalous data. In Table 1 (220-226, Column 2), we show that the training sets for KPI, TODS, and WSD datasets contain a certain proportion of anomalous data.

Data Splitting: We have detailed data splitting in 4.1.2 (270-274 col 1).

Evaluation Metrics: We are glad that you mention the healthcare example, what you mentioned illustrates that each evaluation metric has its appropriate use case. For the healthcare example, we should use F1 delay, as it allows us to select methods that detect anomalies as early as possible. However, for internet service systems, some anomalies last longer (as shown in Figure 6, where the anomaly duration can exceed 300 points). Using only point-wise PA or AUPRC in such cases may inflate the F1 score, since the algorithm only detects one point in the long anomaly segment will be considered as successfully detecting the entire segment. Event F1 is designed to overcome this issue by measuring whether the algorithm can detect more anomaly events, which is more appropriate and aligns with findings in [2].

Inference Time of SAND and SubLOF: We tested the inference time under the same experimental conditions as in the paper.

CPU TimeUCR F1e
SAND5637s0.5108
SubLOF299s0.4772
KAN-AD36s0.5335

Faster CPU Inference Time: Your observation is correct. CPU inference time can be faster because the GPUs cannot fully leverage parallel processing for models with fewer than 1k parameters.

Reproducibility Environment: As indicated in the first line of the list of incompatible dependencies you provided, the issue arose because the conda-forge channel was not configured on your local machine. We understand that no channel can guarantee 100% coverage, so to further facilitate reproducibility, we have provided a Dockerfile in the original repo to assist with this.

Yahoo and NAB Datasets: These datasets were considered flawed in [3], and we have adopted their suggestions by not using these datasets in the main experiments. In fact, our algorithm performs even better on these two datasets, especially Yahoo. The AUPRC results for these datasets are shown below. Additionally, the KPI dataset is also known as the AIOps dataset, and we refer to it as AIOps in the code because the KPI dataset was originally used in the AIOps challenge.

NABYahoo
SRCNN0.85610.1459
SAND0.65950.5412
Anomaly Transformer0.96240.1109
TranAD0.99650.568
SubLOF0.95820.5222
TimesNet0.98420.4736
FITS0.99160.7803
OFA0.99470.7777
FCVAE0.98610.7049
LSTMAD0.99320.5655
KAN0.99760.7142
KAN-AD0.99180.9547

Discussion on Hyperparameters: We have discussed hyperparameter settings in 4.3 (342-344, col 2).

Reconstruction-based Baselines: As outlined in Sec. 5 col 1, forecasting methods can be divided into reconstruction and prediction categories. For the reviewer’s interest in reconstruction methods, we have selected popular SOTA methods: Anomaly Transformer (ICLR 2022), TranAD (VLDB 2022), TimesNet (ICLR 2023), FITS (ICLR 2024), OFA (NeurIPS 2023), and FCVAE (WWW 2024) as our baseline.

AUPRC of Figure 9: Due to space constraints, we will only show the AUPRC, and if necessary we will show the F1d in round 2.

KPITODSWSDUCR
Taylor0.94110.95720.97570.6904
CI0.95850.97140.98320.7911
CII0.96160.95840.98190.7835
Fourier0.96930.97160.98680.8188

Explanation of Figure 10: The x-axis represents the anomaly ratio in the training set. Since real-world anomaly detection always involve some proportion of anomalies in training data, this figure aims to evaluate algorithm robustness across different anomaly ratios. Performance degrades as anomaly ratios rise, since models are increasingly influenced by anomalous data. This reduces normal pattern accuracy, causing more false positives in detection. The lower part methods show more sensitivity to training anomalies. Even at low ratios, they fail to maintain consistent normal-pattern accuracy, resulting in unpredictable performance curves.

[1] Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. ICLR 2022.

[2] An Evaluation of Anomaly Detection and Diagnosis in Multivariate Time Series. TNNLS 2021.

[3] Current Time Series Anomaly Detection Benchmarks Are Flawed and Are Creating the Illusion of Progress. TKDE 2021.

审稿人评论

Train-test split: Thanks for the pointer. Do you do any cross-validations over the split, or are the splits time-wise. For example, when employing a 4:1:5 split for train-val-test, do you reserve the first 40% of the series to the training, the next 10% for validation, and so on? Or is it mixed?

Eval metrics: In the table I provided, I wanted to propose a metric that just considers the point detected as anomaly and not the entire segment. I believe the F1 should be point-based and not segment nor event based. I think you should still report the timestep-based version. I think putting this in the appendix is a good compromise between us.

SAND CPU runtime: Can you include this table at camera ready, please? You can maybe merge it with the other where you show the runtime in GPU.

UniTS as baseline: I still think that you should include UniTS as comparison for anomaly detection. You can either use it in reconstruction or forecasting mode. Its inclusion would definitely make your claims to be SoTA stronger. Again, it's arxiv version is public since February 2024. It's now officially a NeurIPS 2025 paper.

NAB and YAHOO: This is cool. I missed [3] during my experiments. I'm going to refer it in my next paper. Thanks for the pointer :-)

Reconstruction-based Baselines: Do you mean here that anomaly detection methods can be divided into reconstruction and prediction approaches?

Hyperparams: Thanks, I missed the window_size=96 in the main paper. So you fine-tune KAN-AD in UCR and then use these hyperparams for the other datasets?

Fig. 9 Fourier: Besides UCR, I feel like the other AUCPRs are statistically insignificant. Can you do a one-way ANOVA test with a post-hoc Tuckey test with maybe p=.05?

Fig. 10 expected non-increasing trends: Let's take SRCNN. Why does it have a jump in performance when apssing from 25% to 30% training test taint. How come the lower portion of the plot have similar F1e_e at 10% and at 40%? Can you measure the timestep-based F1 here and maybe give standard deviations say for a 5-cross validation.

作者评论

Thank you for your response. We will address your concerns point by point.

Train-test split: The KPI, TODS, and WSD datasets are split time-wise. Following your suggestion, we conducted experiments with reversed splits for KPI/TODS/WSD (original test→train/val, original train/val→test). UCR unmodified (test-set anomalies only). The resulting F1e scores are as follows:

KPITODSWSDAvg
SubLOF0.290.440.650.46
SRCNN0.080.210.220.17
OFA0.610.570.800.66
FITS0.640.520.790.65
SAND0.020.190.080.10
AnomalyTransformer0.240.180.240.22
TimesNet0.610.320.830.59
LSTMAD0.810.600.840.75
FCVAE0.750.700.800.75
KAN0.780.690.830.77
KAN-AD0.810.900.840.85

Eval metrics: We performed timestep-based F1 evaluation (results in table below) and will include full results in the appendix. However, to avoid misleading readers, we will clearly indicate that these metrics may not fully align with practical application scenarios. For example, a method specialized in detecting one anomaly type may outperform a generalist approach if that anomaly persists for long periods. As discussed in [1] (WWW 2018, 1068 citations):

In real applications, the human operators generally do not care about the point-wise metrics. It is acceptable for an algorithm to trigger an alert for any point in a contiguous anomaly segment, if the delay is not too long.

KPITODSWSDUCR
SubLOF0.090.070.290.13
FITS0.060.100.210.04
OFA0.060.110.150.03
LSTMAD0.100.220.270.06
FCVAE0.090.190.200.08
KAN0.090.160.220.06
KAN-AD0.090.180.220.10

UniTS as baseline: We appreciate your mention of UniTS. UniTS presents an innovative approach to unifying various TS tasks, which we find particularly insightful. We have extended evaluation to UTS and MTS anomaly detection (results in tables below). We will add a discussion and citation of UniTS in our paper and include it as baseline method.

UTSKPITODSWSDUCR
UniTS0.610.490.390.33
KAN-AD0.940.940.990.86
MTSSMDMSLSMAPSWaTPSMAvgParams@MSL
UniTS0.880.840.840.930.970.898,066,376
KAN-AD0.840.850.950.940.970.914,491

Recon-based Baselines: As noted in Related Work, TSAD methods include forecasting-based approaches (split into recon- and pred-based) and pattern change detection.

Hyperparams: For fair comparison, we optimized hyperparameters based on overall dataset performance and keep them fixed throughout experiments. Due to space limitations, detailed sensitivity analysis is provided only for the UCR dataset. Other datasets showed similar sensitivity trends which are omitted for brevity.

Fig 9: We examined the assumptions for one-way ANOVA [2], including independence, normality, and homogeneity of variance. Levene's test [3] confirmed homogeneity (p=0.96>0.05), but Shapiro-Wilk tests [4] indicated non-normality (Taylor: 0.02, CI: 0.03, CII: 0.03, Fourier: 0.02, all p<0.05). Since normality was violated, the ANOVA results would be unreliable, and consequently the Tukey test cannot be used. We agree that the significance experiments are important. Therefore,we used the Friedman test [5] (p=0.01<0.05), showing significant differences. Cliff's Delta analysis [6] revealed Taylor has moderate negative effects vs. Fourier, while Chebyshev variants show smaller negative effects, supporting Fourier (KAN-AD) as the optimal choice.

Pairwise Cliff's Delta:
T vs {CI,CII,F} = -0.375 
CI vs CII = 0.000 
{CI,CII} vs F = -0.250

Fig 10: Timestamp-based F1 mean and variance results:

Ano% in train10152025303540
SRCNN0.017±0.0180.036±0.0210.028±0.0130.002±0.0010.025±0.0120.002±0.0010.018±0.008

SRCNN exhibits high anomaly sensitivity but lower precision and large performance fluctuations. This instability arises from its difficulty handling high anomaly proportions in training data, leading to near-random classification and irregular trends.

The above is our experimental findings and the response to your second round of comments. We hope this addresses your concerns. We will incorporate the supplementary experiments from both rounds into the main text or appendix of the paper, along with citations to outstanding methods such as UniTS. If our response has resolved your questions, we would sincerely appreciate it if you could reconsider your evaluation score. Once again, thank you for your valuable feedback and support for our work.

[1]Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications.

[2]Statistical methods for psychology.

[3]Robust tests for equality of variances.

[4]An analysis of variance test for normality.

[5]The use of ranks to avoid the assumption of normality implicit in the analysis of variance.

[6]Cliff's Delta Calculator: A non-parametric effect size program for two groups of observations.

审稿意见
3

The paper discusses about the problem that most TSAD methods using forecasting models tend to overfit to local fluctuations, and reformulates time series modeling while approximating to smooth univariate functions. The paper adopted KAN backbone for TSAD, while replacing B-spine function with Fourier series for local smoothness. The proposed method KAN-AD is composed of mapping phase where raw time series is converted into multiple univariate functions, reducing phase where the coefficients of univariate functions are obtained, and projection phase where the coefficients are aggregated into the normal pattern. Experimental results show that KAN-AD achieves state-of-the-art performance on univariate time series datasets, showcasing its effectiveness.

给作者的问题

  1. I wonder if KAN-AD works with multivariate dataset, since the absence of results on multivariate dataset is a critical limitation of the paper. I am willing to change my evaluation of the paper if KAN-AD proves it to be effective in multivariate datasets.
  2. Why is using N=2 the best? I think using bigger N enables more accurate modeling of time window into fourier series, and is more helpful for the task. It would be helpful if the authors provide analysis on it.

论据与证据

The claim that existing time series forecasting methods tend to overfit to local fluctuations is well illustrated with qualitative results.

方法与评估标准

The proposed method is designed for only univariate time series, and experimented only in univariate datasets, making it impossible to be used in multivariate time series, which is the case in many cases in real life. I assume that the proposed method can be extended to multivariate time series, and I wonder if there is any specific reason for KAN-AD not applied to multivariate time series in the paper.

理论论述

The paper does not contain any theoretical claims.

实验设计与分析

The paper does not clarify how TODS benchmark is synthesized, which can affect the performance highly. The authors should clarify detailed information about the benchmark for its validity.

补充材料

I reviewed the supplementary material covering the information about the datasets and the baselines.

与现有文献的关系

The key contribution of the paper that it adopts KAN backbone for efficient TSAD is related to recent advent of KAN network. The paper adequately adopted Kolmogorov-Arnold representation theorem that it decomposes multivariate continuous function into a finite sum of univariate functions to time series data.

遗漏的重要参考文献

The paper cited the related works appropriately.

其他优缺点

The attempt to integrate Kolmogorov-Arnold Network for TSAD is original, demonstrating further expandability of KAN into time series analysis in the future. In addition, the design of KAN-AD is clearly described to understand the key components of it. However, it was only employed to univariate time series, covering very limited scope of time series data.

其他意见或建议

There is no other comment.

作者回复

Response to reviewer HB8y

Synthetic Method of the TODS Dataset: We used the synthetic TODS dataset from [1], which includes all five anomaly types with diverse durations and the non-trivial characteristics introduced by TODS. This dataset is publicly available in their repository.

KAN-AD's Performance on MTS Data: We can reshape the original multivariate input (batch, window, n_multivariate) into (batch * n_multivariate, window), which effectively supports multivariate time-series data. We implemented KAN-AD in a popular third-party Times-Series-Library [2] and trained it using the same seed (seed=2021) as the SOTA methods. We compared KAN-AD trained in this way with SOTA multivariate anomaly detection algorithms across several datasets, with the following results:

MethodsSMDMSLSMAPSWaTPSMAvgParameters@MSL
Informer81.6584.0669.9281.4377.1078.83504,174
Anomaly Transformer85.4983.3171.1883.1079.4080.504,863,055
DLinear77.1084.8869.2687.5293.5582.4620,200
Autoformer85.1179.0571.1292.7493.2984.26325,431
FEDformer85.0878.5770.7693.1997.2384.971,119, 982
TimesNet84.6281.8069.5093.0097.3885.2675,223
UniTS88.0983.4683.8093.2697.4389.218,066,376
KAN-AD84.2985.0194.5093.5096.5090.764,491\mathbf{4,491}

As seen from the results, KAN-AD outperforms the SOTA methods in average, while using only 0.05% of the parameters compared to UniTS. We have made the MTS version of KAN-AD available in an anonymous repository for further review. https://anonymous.4open.science/r/TSL-C6AC

Impact of Parameter N on Model Performance: Your suggestion is great, larger N allows for more precise modeling of the time-series data, but it also comes with the risk of "overfitting" to the anomaly patterns within the time window. When N becomes too large, the model creates an accurate prediction of the anomaly pattern, which leads to very small anomaly scores when the anomaly occurs. As a result, the model might fail to detect anomalies, leading to performance degradation.

[1] Si H, et al. Timeseriesbench: An industrial-grade benchmark for time series anomaly detection models. ISSRE 2024.

[2] https://github.com/thuml/Time-Series-Library Machine Learning Group of Tsinghua University.

审稿人评论

Thank you for the informative rebuttal. The authors' responses resolve most of my concerns, so I raise the final score.

审稿意见
4

This paper focuses on time series anomaly detection (TSAD), and proposes KAN-AD, which models "normal" behavior of time series using smooth functions. The paper addresses the shortcoming of existing methods that tend to overfit local variances in time series data. Proposed KAN-AD is a clever and novel approach -- introduces Fourier expansion within the KAN network -- which shows its effectiveness on benchmark time series datasets.

update after rebuttal

I appreciate authors for addressing my comments. I think including the rebuttal responses makes the paper stronger. Achieving competitive/superior performance on the MTS task with only a fraction of parameters points to the superior architecture KAN-AD offers for the task.

My other comment was about testing the model's performance on real use-case dataset outside of the benchmarks that often contain synthetic anomalies. Nevertheless, I appreciate your efforts in providing detailed responses to my comments. I already liked this work, and will keep my score as is.

给作者的问题

In the discussion of future work, the paper mentions exploring whether normal patterns in time series can be represented more efficiently by leveraging additional data. What types of additional data could be considered and could it be integrated into the KAN-AD framework?

As stated earlier, are there specific types of time series data or anomaly patterns where KAN-AD might be expected to struggle?

论据与证据

I think the claims made — better detection, faster inference, robustness to noise — are supported through empirical results.

方法与评估标准

KAN-AD definitely makes sense. With respect to evaluation, the metrics Event F1 and Delay F1 align with real world use cases of detecting sustained as well as varied length anomalus segments in the time series.

理论论述

The paper doesn’t introduce a new theory/theorems, instead relies on the Kolmogorov-Arnold representation theorem as the theoretical foundation. Though I haven’t verified it, I think the correctness of the Kolmogorov-Arnold theorem is well established in the literature.

实验设计与分析

I’d answer yes. Experimental setup is clearly described, with details on datasets, training and testing setup, baselines and evaluation metrics as well as ablation studies (Section 4 as well as appendix C). I do not find issues with designs as such.

补充材料

Yes, I reviewed Appendix A – datasets, B – baselines, and C – ablation, they are pretty compact.

与现有文献的关系

The paper draws on KAN networks, and clearly points to overfitting phenomena in deep anomaly detection models. I actually like the paper, and find that it provides a new perspective on TSAD approach, overcomes the overfitting issues of existing methods, and cleverly uses Fourier transform to improve current KANs for time series anomaly detection, which is an extremely relevant problem to study.

遗漏的重要参考文献

I think the paper references related literature.

其他优缺点

++ The paper presents an original approach to time series anomaly detection by reformulating the problem and leveraging KANs. The modifications made to the original KAN architecture, including the use of Fourier series and the lightweight learning mechanism, are also original and well-motivated.

++ The paper studies an TSAD which has significant practical applications. Importantly, the proposed model is nimble with only about 300 parameters making it suitable for real-world environment with improvements in detection accuracy and inference speed.

++ The paper is well-written, organized, and is easy to read.

– Not really a weakness, but something that I find worth mentioning. Most of the real world time series datasets, are naturally multivariate, while the paper focuses on univariate series. How hard would it be to extend the framework to multivariate setup? How will it affect the design choices in the proposed KAN?

– The experiments cover a wide range of time series dataset, however, these are mostly well curated datasets, with many containing synthetic anomalies. I think the empirical study could be stronger if the method is applied on a real world dataset. It would be particularly interesting to see how well does it detect anomalies around 2008 recession, covid 19 etc.

– Discussions around why on some datasets, KAN-AD is comparable to other methods while in some datasets it clearly outperforms (UCR vs WSD). What are the failure cases or scenarios of KAN-AD? What dataset properties support or oppose KAN-AD?

其他意见或建议

Please double-check the references; I believe there are duplicates.

作者回复

Response to reviewer HZbR

We appreciate the constructive suggestions provided by the reviewer and will incorporate improvements in the subsequent version of the paper.

Performance of KAN-AD on MTS: We can reshape the original multivariate input (batch, window, n_multivariate) into (batch * n_multivariate, window), which effectively supports multivariate time-series data. We implemented KAN-AD in a popular third-party Times-Series-Library [2] and trained it using the same seed (seed=2021) as the SOTA methods. We compared KAN-AD trained in this way with SOTA multivariate anomaly detection algorithms across several datasets, with the following results:

MethodsSMDMSLSMAPSWaTPSMAvgParameters@MSL
Informer81.6584.0669.9281.4377.1078.83504,174
Anomaly Transformer85.4983.3171.1883.1079.4080.504,863,055
DLinear77.1084.8869.2687.5293.5582.4620,200
Autoformer85.1179.0571.1292.7493.2984.26325,431
FEDformer85.0878.5770.7693.1997.2384.971,119, 982
TimesNet84.6281.8069.5093.0097.3885.2675,223
UniTS88.0983.4683.8093.2697.4389.218,066,376
KAN-AD84.2985.0194.5093.5096.5090.764,491\mathbf{4,491}

As seen from the results, KAN-AD outperforms the SOTA methods in average, while using only 0.05% of the parameters compared to UniTS. We have made the MTS version of KAN-AD available in an anonymous repository for further review. https://anonymous.4open.science/r/TSL-C6AC

Impact of data characteristics on performance: In our paper (Column 2, Lines 292-295), we briefly discuss how different types of data across datasets influence the final performance. We will further elaborate on this in the next version.

Additional data: Thanks to KAN-AD's parameter efficiency, we believe incorporating textual information (such as metric name) could enhance its performance. Furthermore, leveraging a Large Language Model to route multiple downstream KAN-AD models could further boost overall performance. Since KAN-AD is a very lightweight approach, this integration will be much easier.

Challenges with rapidly oscillating data: For EPG-type time-series data in the UCR dataset, the detection accuracy of both KAN-AD and other baselines remains relatively low due to the highly volatile nature of these signals, even within very small time windows.

审稿意见
4

This paper proposes a method for univariate time series anomaly detection. In particular, they aim to approximate the time series using smooth univariate functions. They build upon a method that uses Kolmogorov-Arnold Networks to approximate the time series, by replacing B-splines functions with Fourier series. They demonstrate that the new model is more robust to anomalies in the training set and therefore tends to perform better on the test set, especially, not surprisingly, where the training set has a larger percentage of anomalies. The models also are smaller compared to state-of-the-art anomaly detection algorithms, in terms of the number of parameters, and also have lower running times.

update after rebuttal

I had minor questions, which the authors answered. For this reason, I will keep my review as it is.

给作者的问题

  1. The main question that I have is about differences in the types of anomalies detected by the different methods, going beyond simply performance measures. This likely will not shift my decision by a full level given that there are five choices, but this is nevertheless an important question to answer in general, and one that is only occasionally answered in Machine Learning papers.

论据与证据

The algorithmic claims made by the paper are clearly true. The algorithm is clearly explained, and performance metrics (F1-based metrics) and running times and model sizes are given. The latter two are particularly notable, since Machine Learning papers rarely state their running times and model sizes, even when these are presented as advantages of the proposed methods. I'm glad the authors are bucking this disturbing tendency.

The number of datasets and types of datasets on which the algorithm is tested is insufficient. The percentage of anomalies in the training and test set range from zero to 7% is low, but I think that this is okay since the proposed method is aimed at addressing robustness to anomalies, so actually having a relatively low fraction of anomalies is more difficult for the method. However, the range of types of anomalies and the range of normal behavior seems to need broadening---amplitude, frequency, combinations of these.

方法与评估标准

While the range of benchmark datasets needs broadening, as I mentioned in the claims and evidence section, the metrics that they use for evaluation are quite appropriate.

理论论述

There are no proofs provided, which is reasonable given the nature of this paper, which is more algorithmic and experimental.

实验设计与分析

I did check the soundness of the experimental designs and analysis. The analysis provided is fairly thorough and appropriate. I have a few questions/comments on this:

  1. In figure 4, it appears that the 5-delay PA and Point-wise PA rows are swapped.
  2. What are the differences in the types of anomalies detected by the different methods?
  3. In 4.4.2, the authors state "In contrast, Taylor series exhibited persistent bias due to non-zero function values in most cases, hindering optimal model performance." Is this true specifically in the context of this ablation study? I can't see this being an issue in general, since Taylor series, like the other methods, allows for constant term elimination.

补充材料

I reviewed the appendices, which were helpful in understanding the paper.

与现有文献的关系

This paper yields an anomaly detection method that has greater robustness to anomalies in the training set, is smaller in terms of the number of patterns, and runs faster; relative to recent methods that have a large number of parameters.

遗漏的重要参考文献

I cannot think of any essential references that were not discussed in this paper.

其他优缺点

I have no other notable strengths and weaknesses to bring up beyond what I wrote in other sections.

其他意见或建议

  1. In figure 2, the anomalies in the bottom row appear to be more difficult to detect than the anomalies in the top row, which seems to better explain the differences in performance than the claimed difference, which is the level of noise in the training sample.
作者回复

Response to reviewer GaXC

We appreciate the constructive comments provided by the reviewer and will incorporate improvements in the subsequent version of the paper.

Comparative analysis of algorithm strengths: Indeed, different methods are good at detecting specific types of anomalies. For example, frequency-domain-based methods (FCVAE [1], FITS [2], etc.) excel at detecting periodic anomalies, methods with differential modules [3] are more adept at handling spike-type anomalies, and methods with shape clustering modules (e.g., SAND [4]) are more effective at detecting shapelet anomalies.

Bias in the Taylor series: The Taylor series can keep the value to be fitted around 0 through a constant term elimination mechanism. However, based on the mathematical properties of the Taylor series, when using a finite number of terms to fit curves with large fluctuations in absolute values, there will always be substantial residual errors (this is what we refer to as "bias" in the original paper) [5].

[1] Z Wang, et al. Revisiting VAE for Unsupervised Time Series Anomaly Detection: A Frequency Perspective. WWW 2024.

[2] Z Xu, et al. FITS: Modeling Time Series with 10k Parameters. ICLR 2024.

[3] Wu R, et al. Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress. TKDE 2021.

[4] P Boniol, et al. Sand: streaming subsequence anomaly detection. VLDB 2021.

[5] Rudin W. Principles of Mathematical Analysis. 2021.

审稿人评论

In response to your comment under "comparative analysis of algorithm strengths," thank you for this discussion. However, my question was on how well your proposed method detects anomalies of the types that you mentioned and others.

作者评论

Response to Reviewer GaXC

Comparative Analysis of Algorithm Strengths: Thank you for your kindly reminder. In our first-round response, we mentioned three types of anomalies, and KAN-AD demonstrates strong performance in detecting all of them. Specifically, for periodicity-related anomalies, KAN-AD benefits from the introduced Fourier univariate functions, enabling precise identification of low-frequency variations, as illustrated in Figure 2 of the paper. Regarding spike-type anomalies, the constant term elimination module in KAN-AD enhances the significance of spikes, making them easier to detect. Finally, for shape-related anomalies, KAN-AD combines univariate functions in both the time and frequency domains, effectively capturing amplitude and periodic characteristics, leading to improved detection performance, as shown in Figure 5 of the paper.

最终决定

There is a clear consensus in the PC that this is a strong paper that should be accepted. Congratulations to the authors ! I am requesting author to carefully incorporate the suggestion provided by the review in camera ready version.