PaperHub
6.4
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
3.8
置信度
创新性2.5
质量2.8
清晰度2.8
重要性2.8
NeurIPS 2025

Fortifying Time Series: DTW-Certified Robust Anomaly Detection

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We propose the first certified defense tailored for time-series data under Dynamic Time Warping (DTW) distance.

摘要

Time-series anomaly detection is critical for ensuring safety in high-stakes applications, where robustness is a fundamental requirement rather than a mere performance metric. Addressing the vulnerability of these systems to adversarial manipulation is therefore essential. Existing defenses are largely heuristic or provide certified robustness only under $\ell_p$-norm constraints, which are incompatible with time-series data. In particular, $\ell_p$-norm fails to capture the intrinsic temporal structure in time series, causing small temporal distortions to significantly alter the $\ell_p$-norm measures. Instead, the similarity metric Dynamic Time Warping (DTW) is more suitable and widely adopted in the time-series domain, as DTW accounts for temporal alignment and remains robust to temporal variations. To date, however, there has been no certifiable robustness result in this metric that provides guarantees. In this work, we introduce the first DTW-certified robust defense in time-series anomaly detection by adapting the randomized smoothing paradigm. We develop this certificate by bridging the $\ell_p$-norm to DTW distance through a lower-bound transformation. Extensive experiments across various datasets and models validate the effectiveness and practicality of our theoretical approach. Results demonstrate significantly improved performance, e.g., up to 18.7% in F1-score under DTW-based adversarial attacks compared to traditional certified models.
关键词
Certified RobustnessAnomaly DetectionTime Series

评审与讨论

审稿意见
4

The paper proposes a novel DTW-certified robustness framework by adapting randomized smoothing approach. The proposed method bridges l-norm to DTW distances by a lower-bound transformation. Extensive experiments on various baselines and datasets are conducted to showcase the advantages of the proposed method.

优缺点分析

Strengths: S1. The paper provides a novel theoretical framework for bridging the l-norm and DTW-distances. S2. The paper proposes a robust and general defense mechanism that is model-agnostic. S3. The paper includes extensive experiments on various baselines and datasets. Weaknesses: W1. Lacks efficiency studies. No time cost analysis or empirical studies. W2. Lacks large scale datasets, i.e., million-scale. W3. Lacks more comprehensive hyperparameter studies, e.g. studies on number of noisy samples and percentile p. W4. Figure 1 could be improved with legends to indicates the adversarial changes.

问题

Q1. What’s the time cost of the proposed method? (W1) Q2. Can the proposed method scale to million-scale datasets? (W2) Q3. What are the impacts of different number of noisy samples and percentile p? (W3)

局限性

yes

格式问题

no major issues

作者回复

We sincerely thank the reviewer for the valuable feedback provided. Below, we clarify and address each concern in detail:

Q1 As shown in Figure 2, the DTW-based certification process uses an analytical solution that introduces no computational overhead. The observed slowdown arises from the Monte Carlo sampling step, which increases inference time roughly in proportion to the number of noisy samples nn (e.g., nn=1000). The parameter nn is user-configurable: larger values yield more precise estimates and typically result in a larger certified radius.

We would like to note that such overhead is intrinsic to all randomized smoothing methods [32][18][55], reflecting a fundamental trade-off for significantly stronger robustness guarantees. Nevertheless, randomized smoothing is one of the most efficient and scalable approaches compared to other certified robustness methods [18]. Our method does not incur any additional cost beyond that of standard randomized smoothing, and reducing sample complexity remains an active research direction. In practice, the overhead can be substantially reduced through parallelization using multiple copies of the anomaly detector.

We provide additional results on running time overhead, evaluated in the settings as described in Section 4.

  • Normal inference: 1/498s per batch (size 64)
  • Inference and Certification: 1.83s per batch (size 64)

Q2 Our method does not require any modification to the training procedure and therefore imposes no computational overhead during training.

Furthermore, we believe our approach can scale to million-scale testing datasets. As evidence, the SMD dataset evaluated in our experiments contains approximately 0.70.7 million (7x1057 x 10^5) testing timesteps with 2525 channels, and required 3:36:193:36:19 (h:m:s) to complete both inference and certification. The computational overhead at testing time grows linearly with the dataset size, maintaining an inference time complexity of O(n)O(n), equivalent to that of a standard model.

Q3 The impact of noisy samples is addressed in Q1. The percentile parameter pp has little impact on results as long as it reflects the majority distribution and is not overly influenced by outliers (i.e., extreme values p=0.0p = 0.0 or p=1.0p = 1.0). We will include a more comprehensive ablation study in the camera-ready version to further examine its effect.

We sincerely thank the reviewer again for your time and effort. We hope that our responses have clarified the issues raised.

评论

Thanks for the explanation. I will keep the score.

审稿意见
4

Existing defenses for adversarial samples are either heuristic or only provide certified-robustness under l_p-norm. Time-series data has a unique characteristic where a small temporal shift may result in a much larger l_p-norm distance. This paper integrates DTW (dynamic time warping) similarity metric with l_p-norm guarantee, and proposes the first DTW-certified robust defense in time-series anomaly detection.

优缺点分析

Strength:

This paper addresses a meaningful problem where the existing l_p-norm fails to preserve semantic meaning in temporal data, and is the first work to do so.

Theoretical proofs are rigorous and solid.

Evaluation results are comprehensive and demonstrate model-agnostic applicability.

Weakness:

Although computation overhead by monte carlo approach is mentioned as a known limitation, including a comparison on runtime would give a clue on how big this overhead can be.

The discussion of various attacks that may or may not bypass the proposed defense is also helpful for potential users to decide whether this is applicable to their system.

问题

How would an attack evolve to bypass this DTW-certified mechanism?

In general how big is the test-time overhead compared to baseline methods? Would it be a concern for adoption?

局限性

yes

格式问题

None

作者回复

We sincerely thank the reviewer for the valuable feedback provided. Below, we clarify and address each concern in detail:

Computation overhead: The Monte Carlo sampling introduces computational overhead, slowing inference by approximately a factor of nn (the number of noisy samples). The value of nn can be user-specified, where a larger nn gives a more precise estimation and usually a larger certified radius.

We would like to note that such overhead is intrinsic to all randomized smoothing methods [32][18][55], reflecting a fundamental trade-off for significantly stronger robustness guarantees. Nevertheless, randomized smoothing is still one of the most efficient and scalable approaches compared to other certified robustness methods [18].

Our method does not incur any additional cost beyond that of standard randomized smoothing, and reducing sample complexity remains an active research direction. In practice, the overhead can be substantially reduced through parallelization using multiple copies of the anomaly detector.

We provide additional results on running time overhead, evaluated in the settings as described in Section 4.

  • Normal inference: 1/498s per batch (size 64)
  • Inference and Certification: 1.83s per batch (size 64)

Attack to bypass this defence: As a certified defense method, we theoretically guarantee that no adversarial attack can bypass our approach as long as the adversarial example x' lies within the certified DTW radius e, i.e., DTW(x, x') < e (see Section 3.1, Threat Model and Certified Defense Goal). An attacker can only compromise the defense by introducing a perturbation exceeding the certified radius.

We sincerely thank the reviewer again for your time and effort. We hope that our responses have clarified the issues raised.

评论

We sincerely appreciate the reviewer’s time and effort, and we hope our rebuttal has addressed the concerns raised. Please let us know if there are any remaining questions or if further clarification would be helpful.

审稿意见
4

This paper aims to provide provable robustness for time series anomaly detection under Dynamic Time Warping (DTW) distance. The authors argue that existing Lp-norm based certification methods are unsuitable for time series data due to their sensitivity to temporal distortions. Building upon this observation, they propose a novel certification approach that bridges Lp-norm to DTW distance through randomized smoothing and lower-bound transformation. The proposed method is model-agnostic and has been experimentally validated across multiple datasets.

优缺点分析

Strengths:

  1. The research problem addressed in this paper is meaningful. Identifying more appropriate robustness metrics (e.g., DTW) for time series data compared to conventional Lp-norm measures represents a valuable research direction.
  2. The model-agnostic nature of the proposed methodology substantially enhances its potential applicability across diverse domains.
  3. The authors propose a novel certification approach that bridges Lp-norm to DTW distance through randomized smoothing and lower-bound transformation, supported by a strong theoretical framework.

Weaknesses:

  1. Although the paper compares with Lp-norm certification, this comparison may not be entirely fair. The DTW attack itself is specifically designed to maximize DTW distance, so it's expected that Lp-norm certification would perform poorly under such attacks.
  2. The results (Table 1) show that on some data sets (such as SMAP, NIPS-TS-SWAN), the mean of the verifiable radius is very small (such as 0.037,0.022), and the certified proportion is not high. This shows that the actual protection ability of the method is very limited on these datasets. Although the explanation (high dimension and large data variance) of this paper is reasonable, it also exposes the vulnerability of the method.
  3. What is the difference between the smoothed anomaly score function and the median smoothing in [14]? Why only the sup function is used to represent this, but no inf function?
  4. A literature citation is required when Dynamic Time Warping first appears in the paper.
  5. X= RL×C should be change to XRL×C
  6. The contributions of abstact and introduction are weak. In the method part, this work simply combines random smoothing and DTW technology, and the contribution of the statement is not fully reflected in the method part. In the experimental part, the feasibility of the theory is not verified in more methods, and the performance improvement brought by the method is not uniform。

问题

See my comments on the weakness.

局限性

While the paper introduces a theoretically sound DTW-based certification approach with broad applicability, several weaknesses limit its impact. The comparison with Lp-norm certification is arguably unfair, as DTW-specific attacks inherently favor DTW metrics. Practical effectiveness is questionable, with extremely small verifiable radii (e.g., 0.022) and low certified proportions on key datasets (SMAP, NIPS-TS-SWAN), revealing vulnerability to high-dimensional, high-variance data. The method’s novelty is unclear—it combines randomized smoothing and DTW without sufficiently differentiating from prior work (e.g., median smoothing in [14]) or justifying design choices (e.g., exclusive use of sup function). Presentation issues (missing citations, unclear notation) and weakly articulated contributions further undermine the work. Most critically, experiments lack validation across diverse methods, and performance gains are inconsistent, leaving the method’s added value unconvincing.

最终评判理由

My comments are well-addressed , and I would like to raise my score.

格式问题

NA

作者回复

We sincerely thank the reviewer for the valuable feedback provided. Below, we clarify and address each concern in detail:

Concern in Weakness 1 "Comparison with lpl_p-Norm Defenses": As discussed in the Abstract and Introduction (lines 30-49), the inadequacy of existing lpl_p-norm certification methods for time-series data, their poor performance under DTW-based attacks, and the complete absence of any prior certified defense against DTW attacks were the primary motivations for developing our proposed DTW-certified defense. The performance gap between lpl_p-norm certification and our DTW-certified defense validates our assumption and clearly demonstrates the superiority of the proposed approach under realistic DTW-based adversarial scenarios.

Concern in Weakness 2 "Performance Differences Across Datasets": We acknowledge that our certified defense does not achieve identical performance across all datasets. However, it is important to note that no single method/setting achieves optimal performance for all data, and this is a well-known phenomenon in existing certified robustness literature [32][18][55].

As discussed in lines 265-267 & 292-298, the variation arises from differences in dataset characteristics and model architectures, particularly their sensitivity to injected noise. Consequently, tuning the noise level is often necessary to achieve optimal certification performance in each setting. For example, on SMAP (COUTA), the mean certified radius is 0.0370.037 with a certified proportion of 51.0651.06% at noise level 0.50.5 (Table 1), but improves to a mean radius of 0.2020.202 and certified proportion of 72.2472.24% at noise level 2.02.0 (Table 2).

Since the primary focus of this work is to introduce a novel DTW-based certified defense, we deliberately reported results at a fixed noise level (Table 1) rather than cherry-picking and tuning hyperparameters for each dataset. Table 2 further demonstrates that, with minor tuning, our method can achieve consistently high certified radii and proportions in most settings.

Concern in Weekness 3 & 6 "Contribution and Difference with [14]": We agree with the first point in Strengths that our main contribution lies in providing a theorem for a more appropriate certified robustness metric, DTW, as formalized in Lemma 3.2 and Theorem 3.3 (Section 3.2). In contrast, the method in [14] applies exclusively to lpl_p-norm metrics.

While we incorporate median smoothing from [14] as part of our anomaly score construction, the certification theorem we propose for DTW is non-trivial. It is the first in the literature to provide a certified guarantee in DTW distance. Importantly, a DTW radius generally encompasses more adversarial examples than an lpl_p-norm radius of the same size, meaning that our DTW certificate cannot be derived through a simple adaptation or trivial combination of existing lpl_p-norm results with DTW definition.

Concern in Weakness 3 "no inf function": In applying median smoothing, we note that the sup\sup and inf\inf functions in [14] Definition 1 are equivalent for continuous distributions, as stated in [14] “The inf and sup are equivalent for continuous distributions; the distinction is needed to handle edge cases with discrete distributions.”

Concern in Weakness 4 & 5 "Notation and Citations": We will add the corresponding citations and update the notation in the camera-ready version.

Concern in Limitation "Experimental Validation Across Diverse Methods": We evaluated our method on three representative state-of-the-art (SOTA) anomaly detection models across eight widely used time-series anomaly detection datasets as reported in Section 4. The experimental setup is consistent with best practices in both the time-series anomaly detection [4][58][59] and certified robustness [18][55] communities.

We sincerely thank the reviewer again for your time and effort. We hope that our responses have clarified the issues raised and kindly request your consideration.

评论

We sincerely appreciate the reviewer’s time and effort, and we hope our rebuttal has addressed the concerns raised. Please let us know if there are any remaining questions or if further clarification would be helpful.

审稿意见
4

The paper introduces the first certified‐robust defense for time-series anomaly detection measured in Dynamic Time Warping (DTW) distance. It adapts randomized smoothing to obtain an ℓ₂-norm certificate and then analytically translates this guarantee to DTW by exploiting the Keogh lower bound, yielding a closed-form certified radius that can be reported at inference time. Because the smoothing wrapper sits on top of any pre-trained detector, the method is model-agnostic and requires no retraining. Empirical tests on seven benchmark datasets and three modern detectors show that the approach maintains high detection accuracy while improving F1 scores against strong DTW adversaries, substantiating both its practicality and theoretical soundness.

优缺点分析

Strengths

  1. The paper closes a gap by deriving a formal certificate in Dynamic Time Warping (DTW) distance rather than the usual lpl_p norms, which are known to mis-characterise temporal distortions. This is positioned as the first such result in this domain.

  2. Because the defense is applied post-training (it wraps a pre-trained detector with a smoothing/denoising layer), practitioners can add robustness onto existing systems without retraining.

  3. By proving Lemma 3.2 and Theorem 3.3, the authors map any l2l_2-certified radius obtained via randomized smoothing to an explicit DTW radius through the Keogh lower bound. This connection is practically computable.

Weaknesses

  1. The paper's robustness guarantee quietly depends on the anomaly-score function changing smoothly whenever the input is nudged a tiny bit. The key lemma assumes that adding a small amount of Gaussian noise cannot reorder which inputs look more or less anomalous, something that holds only when the score has no sudden jumps. The three neural network detectors tested in the experiments behave this way, so the proof is valid for them. But any detector that uses hard thresholds, coarse rounding, or aggressive max-pool operations could let its score flip abruptly under minuscule input changes, breaking the guarantee. Therefore, the defense is provably safe only for detectors that produce smooth, gently varying scores, which narrows its general usefulness.

  2. The paper's promise that its method can be extended to any distance norm "with minimal modification" is not backed up by the proofs. Every step of the theoretical argument is built around ordinary Euclidean distance: the randomized-smoothing lemma protects the model only against perturbations measured in that specific way, and the later conversion from Euclidean radius to a Dynamic-Time-Warping radius depends on algebra and geometric facts that hold only under Euclidean geometry. The authors never supply a new noise-adding scheme, inequality, or optimization path that would make the same guarantee work for other norms, so the statement in lines 124-126 that such a change would be easy remains an unproven claim.

  3. Certified prediction relies on huge amounts of Monte-Carlo samples, adding tangible latency that could hinder real-time monitoring.

问题

  1. Generality Beyond Euclidean Norms. The paper states that the proposed certificate “readily generalizes to arbitrary norm orders p” (line 125). However, both the theoretical development and the experiments appear tailored to the Euclidean norm. Could the authors provide a concrete derivation for at least one non-Euclidean case (e.g., p=1) and report certified radii on the same datasets?

  2. Smoothness Assumption in Lemma 3.1. Lemma 3.1 hinges on smooth anomaly scores, yet the manuscript does not explicitly state any regularity conditions on the anomaly score function ff. I encourage the authors to formally state the smoothness or continuity assumptions required for the lemma to hold, and possibly include a brief proof sketch showing that the percentile smoothing preserves these properties.

  3. Inference Efficiency and Latency. The proposed method relies on large numbers of Monte Carlo samples (e.g., n=1000n = 1000 in experiments), which may cause considerable inference latency. It would be helpful if the authors could include measurements of wall-clock latency or throughput, particularly for typical values of nn.

局限性

Yes

最终评判理由

My concerns have been thoroughly addressed through the rebuttal. I have updated my score accordingly.

格式问题

NA

作者回复

We sincerely thank the reviewer for the valuable feedback provided. Below, we clarify and address each concern in detail:

1 Generality to Arbitrary Norm Orders pp: Our method builds upon Randomized Smoothing, which has been generalized to arbitrary norm orders pp in the cited work [73]. In summary, different norm orders can be achieved by adding appropriate noise distributions:

  • Gaussian noise for p=2p = 2
  • Laplace noise for p=1p = 1
  • Uniform noise for p=p = \infty

The proposed DTW-certificate can likewise be generalized to any norm order pp (e.g., p=1p = 1) by replacing Gaussian noise with the corresponding distribution (e.g., Laplace noise). Regarding the certification process, Lemma 3.2 and its proof remain valid for arbitrary pp, and Theorem 3.3 requires the following modifications:

  1. Redefine the certified radius rr from norm-2 to norm-1.
  2. Redefine RR from norm-2 to norm-1 as R=δiR = \sum \|\delta_i\|.
  3. In Eq. (15), replace the norm-2 expansion with norm-1 expansion: r=δi+dr = \sum \|\delta_i\| + \|d\|.
  4. Following the same proof steps, the expression of ee in Eq. (10) becomes e=rRe = r - R.

Given the limited time, we provide initial results for the norm-1 certification of the COUTA model here (align to Table 1, last 6 columns), and will expand on this discussion in greater detail in Sec 4 of the camera-ready.

DatasetF1-scoreROC AUCRadii MeanRadii MaxRadii Std.Certified Prop.
SMAP0.9590.9980.0580.4440.07847.23%
SMD0.5740.9370.0280.2070.04162.83%
MSL0.8180.9700.0270.4960.07544.23%
NIPS-TS-SWAN0.7380.7950.0120.4320.03212.97%
NIPS-TS-CREDITCARD0.6520.9190.0540.3220.07374.34%
NISP-TS-WATER0.4450.9690.0830.2170.02184.36%
UCR-10.9740.9990.0400.3000.01857.25%
UCR-20.6880.9720.0130.0940.07919.47%

2 Applicability to Arbitrary Anomaly Score Functions: Our method and Lemma 3.1 impose no smoothness or continuity assumptions on the anomaly score function f:XRf: \mathcal{X} \to \mathbb{R}. The proposed approach is applicable to any anomaly score function that is used as described in Section 3.1 (Para. Time-series Anomaly Detector), where the anomaly is identified as an anomaly score that exceeds the threshold. This includes the function output with sudden jumps, such as hard thresholds, coarse rounding, or aggressive max-pool operations. This generality follows from our use of a randomized smoothing approach, which can be applied to any function as one of its key advantages, as established in [17, 14].

3 Computational Overhead: We acknowledge that Monte Carlo sampling introduces computational overhead, slowing inference by approximately a factor of nn (the number of noisy samples). The value of nn can be user-specified, where a larger nn gives a more precise estimation and usually a larger certified radius.

We would like to note that such overhead is intrinsic to all randomized smoothing methods [32][18][55], reflecting a fundamental trade-off for significantly stronger robustness guarantees. Nevertheless, randomized smoothing is still one of the most efficient and scalable approaches compared to other certified robustness methods [18].

Our method does not incur any additional cost beyond that of standard randomized smoothing, and reducing sample complexity remains an active research direction. In practice, the overhead can be substantially reduced through parallelization using multiple copies of the anomaly detector.

We provide additional results on running time overhead, evaluated in the settings as described in Section 4.

  • Normal inference: 1/498s per batch (size 64)
  • Inference and Certification: 1.83s per batch (size 64)

We sincerely thank the reviewer again for your time and effort. We hope that our responses have clarified the issues raised and kindly request your consideration.

评论

We sincerely appreciate the reviewer’s time and effort, and we hope our rebuttal has addressed the concerns raised. Please let us know if there are any remaining questions or if further clarification would be helpful.

评论

Thank you for the detailed response. My concerns regarding generalization to arbitrary norm orders, applicability to anomaly score functions, and computational overhead have been addressed. The theoretical clarifications and additional results are convincing.

I will raise my score and recommend incorporating these insights into the final version to further strengthen the paper.

最终决定

This paper introduces a DTW-based certified defense for time-series anomaly detection, offering a solid theoretical foundation and demonstrating model-agnostic applicability across diverse datasets. Its novelty and rigor make it a valuable contribution to the field. The main limitations concern the reliance on smoothness assumptions, computational overhead, and relatively limited evaluation scope, which constrain its demonstrated practicality. Nonetheless, all reviewers gave positive feedback, and the rebuttal effectively addressed most concerns. Overall, I recommend acceptance, with the expectation that the authors incorporate reviewer suggestions in the final version to strengthen clarity, efficiency analysis, and broader validation.