PaperHub
7.8
/10
Poster4 位审稿人
最低4最高6标准差0.8
4
6
4
5
4.3
置信度
创新性3.0
质量3.0
清晰度2.8
重要性3.3
NeurIPS 2025

ScatterAD: Temporal-Topological Scattering Mechanism for Time Series Anomaly Detection

OpenReviewPDF
提交: 2025-05-05更新: 2025-10-29
TL;DR

ScatterAD leverages representation scattering as an inductive signal to jointly model temporal and topological patterns for effective multivariate time series anomaly detection.

摘要

关键词
Time series anomaly detectionrepresentation learningmutual informationgraph attention network

评审与讨论

审稿意见
4

This paper introduces a novel anomaly detection framework, ScatterAD, which jointly enhances feature discriminability and temporal consistency by using a temporal-topological scattering mechanism. Specifically, ScatterAD employs a dual-encoder architecture, consisting of an online encoder and a target encoder, where the online encoder is updated via backpropagation and the target encoder is updated using an exponential moving average (EMA) of the online encoder's parameters. The online encoder enforces temporal consistency across adjacent time steps, while the target encoder captures compact and consistent scattering representations in a projected hypersphere space. Moreover, a contrastive fusion mechanism is employed to align temporal and topological representations through a cross-view contrastive loss. The effectiveness of the paper is validated through extensive experiments.

优缺点分析

Strengths: The paper formalize the dispersion phenomenon of normal and anomalous samples in high-dimensional space as scattering, quantified by the mean pairwise distance among sample representations. The authors further leverage this insight as an inductive signal to enhance spatio-temporal anomaly detection and propose a novel anomaly detection framework, ScatterAD, providing valuable insights for multivariate time series anomaly detection. The overall paper is well-structured and easy to follow. The motivation of each component in the proposed framework is well explained. The effectiveness of the proposed framework is validated through both theoretical analysis and extensive experiments across diverse datasets.

Weakness: While the paper provides empirical observations that the scattering of anomalous samples tend to be more pronounced in high-dimensional space, it remains unclear whether such a phenomenon generalizes well across datasets with different types of anomalies or varying noise levels. The paper could benefit from systematically examining the scattering phenomenon across more datasets. Additionally, the experimental results should report the mean and standard deviation, with multiple repetitions to ensure statistical reliability.

问题

(a) Could the authors provide further justification or quantitative analysis to confirm that the observed scattering behavior is consistent and not occasional?

(b) Could the authors clarify whether the reported experimental results are based on a single run or averaged over multiple runs?

(c) The global scattering center c in the paper is randomly initialized inside the unit ball. Could the authors clarify whether this random initialization significantly affects the model performance or stability? Have you ever considered using alternative strategies?

局限性

Yes.

最终评判理由

The authors have addressed my concerns; however, the novelty and technical quality still do not warrant a higher score, so I am maintaining my original rating.

格式问题

No Formatting issues.

作者回复

We sincerely thank Reviewer jjez for the detailed comments and insightful questions.

Q1: Analysis of the universality of the scattering phenomenon.

To address your concerns about the universality of the scattering phenomenon, we present the quantitative results on the scattering phenomena for all six benchmark datasets used in our study. Furthermore, to evaluate the model's robustness, we inject additive Gaussian noise of three different intensities (σ =0, 0.5, 1.0, 2.0) into the normalized test data (with σ=0.0 as the no-noise baseline). We quantify the impact of noise on the model's internal discriminative power by calculating the average scattering score for normal and anomalous samples, which we define as the mean Euclidean distance of samples to their respective class center, and analyzing their separation ratio.

DatasetNoise Level (σ)Scattering Score (Normal)Scattering Score (Anomalous)Separation Ratio (Anomalous/Normal)
MSL0.016.018.21.14
0.521.631.61.46
1.036.049.31.37
2.065.984.61.28
PSM0.031.537.81.20
0.533.140.31.22
1.041.247.41.15
2.061.165.71.08
SWaT0.073.1094.091.29
0.575.6395.711.27
1.079.2298.031.24
2.084.50102.111.21
WADI0.00.781.161.49
0.534.943.41.24
1.087.172.40.83
2.0166.7137.00.82
N-T-S0.020.228.41.41
0.521.832.41.49
1.027.742.21.52
2.042.165.51.56
N-T-W0.021.6095.974.44
0.526.10105.334.04
1.040.7041.401.02
2.068.60197.302.88

The results in the table above clearly reveal a consistent anomaly scattering pattern across six different real-world datasets. In addition to this quantitative evidence, we would like to draw the reviewer's attention to Figure 3 in our paper, which qualitatively supports the consistency of the scattering behavior from another perspective. Furthermore, we wish to highlight the case of the WADI dataset, where under medium-to-high noise, the scattering score of normal samples surpasses that of anomalies. We attribute this phenomenon primarily to WADI's high dimensionality (123 dimensions), where strong noise may have a compounding effect and lead to complex changes in inter-variable correlations. This also indicates that while ScatterAD is robust in various noisy environments, its performance can be sensitive to strong noise when dealing with extremely high-dimensional datasets with subtle anomaly patterns. Thank you again for this insightful question, which helps us to more comprehensively characterize the performance boundaries of our method.

Q2: Clarification on the reproducibility of experimental results.

Thank you for this question, which is crucial for ensuring the rigor and reproducibility of our work. The experimental results reported in our paper, including the baseline evaluations, are based on the average of three independent runs. We will add standard deviations for all core experimental results in the revised version to more comprehensively demonstrate the stability and statistical reliability of our findings.

Q3: Regarding the random initialization strategy for the scattering center cc.

We sincerely thank you for your question regarding the initialization strategy for the global scattering center cc. This insightful question is also raised by Reviewer WWNK, and we kindly invite you to refer to our detailed response in Section W1. In brief, our experiments show that the random initialization of cc is a simple yet effective strategy. For a detailed analysis, including stability tests, we kindly direct you to the Section W1.

评论

We appreciate your acknowledgement of our core contributions, as well as the important questions you raised regarding the universality of the scattering phenomenon. Thank you again for helping us improve our work.

评论

Thank you for the detailed and thoughtful responses. I appreciate the additional quantitative analysis on the universality of the scattering phenomenon, as well as the clarification on experimental reproducibility and initialization strategies.

These responses have satisfactorily addressed my concerns, and I believe the authors have made a strong effort to characterize both the strengths and limitations of their approach. Given the current quality and novelty of the work, I am happy to maintain my original evaluation.

审稿意见
6

The authors insightfully propose ScatterAD, a spatiotemporal anomaly detection framework for industrial IoT, introducing representation dispersion as an inductive bias to model the scattered nature of anomalies in high-dimensional space. ScatterAD combines a topological encoder and a temporal encoder to extract complementary features, and detects anomalies via scattering deviation and temporal inconsistency. The authors also prove that maximizing conditional mutual information is equivalent to a contrastive loss objective. Experiments on six benchmarks confirm its effectiveness through strong performance, visual analysis, and ablation studies.

优缺点分析

Strengths:

The strengths of the work can be summarized as follows:

  1. Well written and clear. The paper introduces representation scattering as a novel perspective for high-dimensional anomaly detection, with a logical flow from dispersion to feature separation and detection.
  2. The paper presents a novel MTSAD approach inspired by the scattering phenomenon in industrial data, demonstrating clear insight.
  3. The effectiveness of the method is demonstrated on multiple MTSAD benchmarks with comprehensive evaluations.
  4. The authors’ theory unifies contrastive learning and conditional mutual information to enhance spatiotemporal representation complementarity.
  5. The authors provide the code implementation in the supplementary materials, which facilitates reproducibility.

Weaknesses:

  1. A potential weakness lies in the initialization of the scattering center 𝑐 in Equation (6), which is randomly sampled within the unit sphere. This strategy may lead to suboptimal local minima and instability during training. For instance, poor initialization is a well-documented problem in machine learning, often causing slower convergence or trapping the model in unfavorable local optima, which degrades performance[1, 2].
  2. The scattering mechanism fundamentally depends on cosine similarity between samples and a reference center. However, the manuscript does not discuss the well-documented limitations of cosine similarity in high-dimensional settings, which may affect the robustness of the method.

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010.

[2] Sutskever, Ilya, et al. "On the importance of initialization and momentum in deep learning." International conference on machine learning. PMLR, 2013.

问题

Please see weakness.

局限性

The authors have sufficiently addressed the limitations of their study. I have no additional comments.

最终评判理由

I carefully read the author's reply and was pleased to find that the author addressed my concerns. Therefore, I am happy to further raise my score. I have no other additional questions.

格式问题

No.

作者回复

We greatly appreciate Reviewer WWNK for the comprehensive and insightful comments.

W1: Regarding the random initialization strategy for the scattering center cc.

Your concerns about training stability and suboptimal solutions are entirely valid, and these are key considerations in our model design.

First, we wish to clarify the core role of cc. In our framework, cc is initialized only once at the beginning of each training run and remains fixed throughout the training process. It is not a parameter learned via backpropagation. Its purpose is twofold:

  • (1) Provide a fixed anchor point: cc offers a convergent target direction for the representations of normal samples, enabling the model to learn to map these representations into a compact region on the hypersphere.
  • (2) Break symmetry and regularize: If we fix cc to a constant (e.g., [1, 0, ..., 0]), it might introduce an undesirable bias, inducing the model to learn a trivial solution where all normal samples map to a specific axis. Randomly initializing cc avoids this risk, compelling the model to learn more generalizable representations. This can be viewed as a form of implicit stochastic regularization, preventing the model from overfitting to a pre-defined direction in the latent space.

To empirically address your concerns about instability and suboptimal solutions, we conduct a new experiment: we repeat our full training and evaluation process with 5 different random seeds on the MSL and PSM datasets. A different random seed results in a different scattering center cc. We report the mean and standard deviation of the Aff-F1 and A-ROC metrics.

DatasetMetricOriginalMeanStd. Dev.
MSLAff-F10.8650.865± 0.003
A-ROC0.9860.986± 0.001
PSMAff-F10.7970.797± 0.002
A-ROC0.9860.986± 0.003

Furthermore, we design several variants to compare the performance of different center initialization strategies, including:

  • (1) Zero: Initialize the center at the origin (zero vector) to test the simplest symmetric starting point.
  • (2) Fixed-Radius: Initialize the center on a hypersphere with a fixed radius to test the model's sensitivity to the initial magnitude.
  • (3) Multi-center: Use multiple independent scattering centers to test the model's ability to capture potentially multi-modal normal data.

In the table below, Num Centers denotes the number of scattering centers used, and Radius specifies the initial magnitude for the fixed_radius strategy. We compare these strategies with the Random-in-Ball (randomly selecting a point within the unit hypersphere) strategy proposed in our paper.

DatasetCenter StrategyNum CentersRadiusAff-F1AUC ROC
MSLrandom_in_ball (Ours)1N/A0.8670.986
zero10.00.8660.986
fixed_radius10.30.8660.986
fixed_radius10.70.8640.986
Multi-center3N/A0.8690.986
PSMrandom_in_ball (Ours)1N/A0.7970.986
zero10.00.7970.986
fixed_radius10.30.7970.986
fixed_radius10.70.7960.986
Multi-center3N/A0.7920.980
SWaTrandom_in_ball (Ours)1N/A0.7040.982
zero10.00.7040.982
fixed_radius10.30.6980.977
fixed_radius10.70.7020.979
Multi-center3N/A0.7020.980

The results confirm that for most datasets, ScatterAD is not sensitive to the specific initialisation strategy of cc, and the random_in_ball strategy proposed in our paper is a simple and effective choice.

W2: Discussion on the effectiveness of cosine similarity in high-dimensional representation space.

Your point that distances between random vectors in high-dimensional space tend to converge is entirely valid for unprocessed or randomly distributed high-dimensional data. However, the core of our method is to avoid processing such data directly. Instead, the basic task of ScatterAD is to learn a nonlinear mapping from the original high-dimensional space to another well-structured high-dimensional representation space. In this learned space, the distribution of vectors is not random, specifically:

  • (1) LscatterL_{scatter}: Actively pulls the representations of normal samples towards the same fixed center cc, forming a compact cluster locally.
  • (2) LtimeL_{time}: Enforces temporal continuity in the representation space by minimising the distance between representations of adjacent timesteps.
  • (3) LcontrastL_{contrast}: Ensures that the representations from different views (onlineonline and targettarget) remain structurally consistent, further reinforcing the geometry of the representation space.

Therefore, we are not relying on the properties of cosine similarity in a random high-dimensional space; rather, we leverage our model to create a representation space that possesses both spatial structure and temporal continuity. Furthermore, we provide a comparison of our anomaly scoring strategy with two variants based on classic distance metrics:

DatasetMetricScatterAD (Euclidean - OURS)ScatterAD (Mahalanobis)ScatterAD (KL Divergence)
MSLAff-F10.8670.8510.789
A-ROC0.9860.9860.972
PSMAff-F10.7970.7920.129
A-ROC0.9860.9800.793

The results show that our simple inductive bias based on Euclidean distance is more effective than methods that attempt to fit complex correlations (Mahalanobis distance) or probability distributions (KL Divergence). This also serves as evidence that our model successfully learns a well-structured representation space.

评论

I carefully read the author's reply and was pleased to find that the author addressed my concerns. Therefore, I am happy to further raise my score. I have no other additional questions.

评论

Thank you for your positive feedback and for taking the time to review our rebuttal. We are delighted to hear that our response addressed your concerns. Thank you very much for your support and for raising your score.

审稿意见
4

This paper proposes ScatterAD, a novel approach for multivariate time series anomaly detection that leverages temporal-topological scattering mechanisms. The method combines a temporal encoder and a topological encoder with contrastive fusion to learn complementary spatio-temporal representations. The core insight is that anomalous samples exhibit more pronounced scattering patterns in high-dimensional space compared to normal samples. The authors provide theoretical justification using information bottleneck theory and demonstrate state-of-the-art performance across six real-world datasets.

优缺点分析

Strengths

(1) The paper introduces conditional mutual information maximization based on information bottleneck theory into multivariate time series anomaly detection, providing theoretical proof that maximizing I(ZT;ZGG)I(Z_T; Z_G \mid G) can effectively enhance cross-view consistency and improve representation discriminability.

(2) The method jointly considers temporal features and topological features, significantly enhancing the model's ability to capture complex spatio-temporal dependencies and anomaly patterns.

(3)The paper employs multiple optimization objectives including scattering loss LscatterL_{scatter}, temporal consistency loss LtimeL_{time}, and contrastive fusion loss LcontrastL_{contrast}. The contrastive fuse learning design can effectively prevent over-scattering of representations while maintaining temporal structural consistency, simultaneously improving feature discriminability and representation stability.

Weaknesses

(1) The paper states that hih_i "represent the input feature vectors" in equation (2), but subsequent text indicates that these symbols denote different data representations hi=hi+h(l)h_i = h'_i + h^{(l)} while using identical notation. This creates ambiguity regarding what specific data each variable represents.

(2) Section 3.1 introduces the online encoder fθ()f_θ (∙) and target encoder fϕ()f_ϕ (∙) without providing concrete model architecture details. The paper lacks a clear explanation of how honline,th_{online,t} and htarget,th_{target,t} are obtained, making it difficult to understand the specific implementation and reproduce the results.

(3) Numerous formulas contain unexplained parameters, including ττ in Section 3.2, l,kl,k in equation (1), and ξ in equation (8). Additionally, critical parameter values such as the range or default value of l,τl,τ are not specified.

(4) Section 3.2 states "the node features are aggregated with topological features and temporal features, yielding hi=hi+h(l)h_i = h'_i + h^{(l)}". This description lacks clarity regarding how many values are being aggregated and what specific representations correspond to "node features," "topological features," and "temporal features," respectively.

(5) Section 3.3 "Temporal Topological Scattering Representation Learning" presents c=Randn(),c=ccϵc = \text{Randn}(\cdot), c = \frac{c}{\|c\|} \cdot \epsilon, what appears to be two equations without clearly describing their sequential relationship or dependencies.

(6) The experimental comparison could benefit from including more classic and recent methods such as ModernTCN [Ref1], iTransformer [Ref2], and DCDetector [Ref3].

  • [Ref1] Moderntcn: A modern pure convolution structure for general time series analysis, ICLR. 2024.

  • [Ref2] Itransformer: Inverted transformers are effective for time series forecasting, ICLR 2023.

  • [Ref3]Dcdetector: Dual attention contrastive representation learning for time series anomaly detection, KDD. 2023.

问题

(1) Could you provide detailed architectural specifications for the online encoder fθ()f_θ(·) and target encoder fϕ()f_ϕ(·)? Additionally, please clarify how h_online and h_target are obtained from these encoders - are they direct outputs or derived through additional processing steps?

(2) Many critical parameters lack proper definition, including l,kl, k in Section 3.2, ττ in equation (1), and ξ in equation (8). Could you provide precise definitions of these parameters and their typical value ranges? Furthermore, what are the specific values of parameters such as ll you used in your experiments?

(3) How does the theoretical result I(ZT;ZGG)I(Z_T; Z_G | G) connect to your actual implementation? What specific graph structure GG is used in practice, and how does it relate to the conditional mutual information maximization in your theoretical framework?

局限性

Yes

最终评判理由

This paper provides a comprehensive clarification during rebuttal that addresses most of my concerns.

格式问题

NAN

作者回复

We sincerely thank Reviewer 9v8x for the insightful comments and valuable suggestions. We will address your concerns one by one below and will integrate these clarifications into the final version of our paper.

W1: Ambiguity of the notation hih_i.

Thank you for pointing out this ambiguity. We will use zi=htt+hi(l)z_i = h_t^t + h_i^{(l)} to denote the final representation and will update this notation consistently throughout the rest of the revised manuscript.

W2: Lack of specific model architecture details.

To clarify, the online encoder fθ()f_\theta(·) and the target encoder fϕ()f_\phi(·) share the exact same backbone architecture, which consists of a causal convolution module (Eq. 1 in our paper) followed sequentially by a Graph Attention Network (GAT) layer (Eqs. 2 and 3 in our paper). However, the outputs of these two encoders undergo asymmetric processing steps before being used in the final loss computation, as detailed below:

Processing StageOnline PathTarget Path
1. Encodinghpred=fθ(X)h_{pred} = f_\theta(X)htarget=fϕ(X)h_{target} = f_\phi(X)
2. Asymmetric Processinghonline=Predictor(hpred)h_{online} = \text{Predictor}(h_{pred})h_target=L2Normalize(htarget)h'\_{target} = L2{Normalize}(h_{target})
3. Loss ContributionUsed to compute LtimeL_{time} and LcontrastL_{contrast}Used to compute LscatterL_{scatter} and LcontrastL_{contrast}

This asymmetric design is a well-established and highly effective strategy in momentum contrastive learning frameworks (Kaiming et al., 2020; Byeongchan et al., 2023). It prevents the model from collapsing by breaking the symmetry through an additional learning task (predicting the target representation). We will update the methods section in the final version of our paper with a detailed process description and diagram to ensure the clarity and full reproducibility of our work.

W3: Unexplained parameters.

As per your suggestion, we will define all parameters upon their first appearance in the revised manuscript and include their default values. We provide the clarifications here:

  • (1) τ\tau in Section 3.2: This is the look-back window size of the temporal graph, which defines neighbourhood relations as (utk,ut)(u_{t-k}, u_t), where k[1,τ]k \in [1, \tau]. In our experiments, we use τ=2\tau=2.
  • (2) ll, kk in Equation (1): ll represents the layer index of the causal convolution, and kk is the kernel size. We use two two-layer (l=2l=2) with a kernel size of k=3k=3.
  • (3) ξ\xi in Equation (8): This represents the parameters of the target encoder fϕ()f_\phi(·). We use ξ\xi to maintain consistency with the momentum update literature, where θ\theta is often used for the online encoder. For clarity, we will unify the notation to use ϕ\phi.

W4: Ambiguity of feature aggregation.

It is important to note that the aggregation operation hi=htt+hi(l)h_i = h_t^t + h_i^{(l)} is a simple element-wise addition. To be precise:

  • (1) htth_t^t ("temporal features") is the output of the initial causal convolution encoder (from Eq. 1).
  • (2) hi(l)h_i^{(l)} ("topological features") are the attention-weighted features learned from neighbouring nodes via the GAT layer (from Eq. 3).
  • (3) hih_i ("node features") is the final representation obtained by the element-wise addition of these two components.

We will revise the relevant text in Section 3.2 in the revised manuscript, following your suggestion, to explicitly state that this is an element-wise addition and to clarify the origin of each term. Furthermore, as per your suggestion in W1, we will resolve the ambiguity caused by the notation hih_i.


W5: Relationship between the formulas for cc.

Thank you for requesting clarification on this. The two parts of this formula describe a two-step sequential process for initializing the scattering center cc.

  • (1) Step 1: c=Randn()c' = \text{Randn}(\cdot): First, a vector cc' is sampled from a standard normal distribution.
  • (2) Step 2: c=ccϵc = \frac{c'}{\||c'\||} \cdot \epsilon: The sampled vector cc' is then L2-normalized, projecting it onto the unit hypersphere, and subsequently multiplied by a random scalar ϵ[0,1)\epsilon \in [0, 1). This ensures that cc is strictly located inside the unit ball.

W6: Comparison with classic baselines.

Following your suggestion, we provide extensive comparisons with classic baselines such as ModernTCN, iTransformer, and DCDetector. The results for all baseline models are reproduced by strictly following the configurations and guidelines from their official open-source codebases. Due to space constraints, we present four key scores here; the complete results will be updated in the revised manuscript.

DatasetMetricScatterADDCDetectorModernTCNiTransformer
MSLAff-F0.8670.6740.7090.652
PA-F0.9640.9570.8070.659
A-ROC0.9860.9610.6270.604
A-PR0.9320.8910.7390.721
PSMAff-F0.7970.6530.7010.652
PA-F0.9810.9770.9650.926
A-ROC0.9860.9480.5810.687
A-PR0.9690.8660.6290.751
SWaTAff-F0.7040.6870.6850.716
PA-F0.9510.9390.8840.916
A-ROC0.9820.8760.6750.662
A-PR0.9090.8240.5920.619
WADIAff-F0.6050.7250.5580.671
PA-F0.8620.7310.7660.754
A-ROC0.9600.8290.8310.814
A-PR0.7650.6510.6970.627
NIPS-TS-SWANAff-F0.0380.4840.5330.507
PA-F0.7360.7330.7310.729
A-ROC0.7920.6550.5620.537
A-PR0.7160.6260.5450.531
NIPS-TS-GECCOAff-F0.8250.6480.4460.435
PA-F0.7840.4720.3570.381
A-ROC0.9690.7150.9460.817
A-PR0.6330.5230.6150.474

Q1: Encoder architecture specifications and the origin of honlineh_{online}, htargeth_{target}.

We provide detailed specifications in our response to W2. We will also explicitly state these details in Section 3 of the revised manuscript to improve clarity and reproducibility.

Q2: Definitions and values of key parameters.

Please see our response to W3. We will ensure that all parameters are clearly defined upon their first appearance and will add a table of key hyperparameters in the appendix of the revised manuscript.

Q3: Connection between the theoretical result I(ZT;ZGG)I(Z_T; Z_G | G) and the practical implementation.

Thank you for this crucial question regarding the connection between our mutual information theory and its practical implementation. The link is as follows:

(1) Theoretical Goal: Our theory (Appendix A) shows that maximizing the conditional mutual information I(ZT;ZGG)I(Z_T; Z_G | G) is a principled way to learn complementary representations from the temporal (ZTZ_T) and topological (ZGZ_G) views, given the graph GG.

(2) Practical Implementation: The contrastive fusion loss LcontrastL_{contrast} (Eq. 7) serves as a tractable, differentiable lower bound for this mutual information. By minimising LcontrastL_{contrast}, we effectively maximise this lower bound, thus driving the model to achieve the theoretical objective. In our implementation, the graph structure GG is the temporal graph described in Section 3.2, where nodes are time points at timestep tt, and edges connect temporally adjacent nodes (e.g., (t1,t)(t-1, t)). This graph GG constrains the entire process, defining the neighbourhood for the GAT's aggregation operation and providing positive pairs for the contrastive loss.

References

He, Kaiming, et al. "Momentum contrast for unsupervised visual representation learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

Lee, Byeongchan, and Sehyun Lee. "Implicit contrastive representation learning with guided stop-gradient." Advances in Neural Information Processing Systems 36 (2023): 30885-30897.

评论

I appreciate the clarification from the authors. Now my concerns have been addressed, and I would like to increase the rating.

评论

We sincerely thank the reviewer for reviewing our rebuttal and for raising your score. Thank you once again for your thoughtful and detailed engagement with our work, which has significantly improved our submission.

审稿意见
5

This paper presents ScatterAD, a novel anomaly detection framework for multivariate time series in industrial IoT, which links representation learning to the information bottleneck principle, proving that maximizing conditional mutual information between temporal and topological views improves anomaly discrimination. ScatterAD jointly models temporal dynamics and topological techniques via the scattering loss, temporal constraint, and contrastive fusion. Extensive experiments on six benchmarks show state-of-the-art results of ScatterAD.

优缺点分析

Strengths: This paper addresses a critical gap in spatio-temporal anomaly detection by unifying temporal consistency and topological dispersion. Achieves impressive results across diverse datasets. It is novel to formalize scattering as a learning signal and integrate information bottleneck theory into multivariate time series anomaly detection. The paper is well-structured, with clear method descriptions and precise theoretical proofs in the Appendix. Authors provide a technically rigorous study with thorough ablation studies, sensitivity analysis, and visualizations. The proposed scattering mechanism is empirically validated via feature space analysis and training dynamics.

Weaknesses: While inference speed is competitive, the need for careful hyperparameter tuning may hinder plug-and-play adoption in other testing domains. The mutual information bound relies on strong independence assumptions, which may not hold if the graph structure correlates with anomalies. Assuming a static graph structure, it seems to limit its applicability to systems with evolutionary dependencies.

问题

  1. The temporal graph is fixed during training/inference. I wonder how the performance of ScatterAD would be if the topology were to change suddenly.
  2. Using mean pairwise Euclidean distance instead of dispersion metrics like Mahalanobis distance or others needs clarification. Does this choice affect sensitivity to feature scaling?
  3. PA-F of the method was high, but the AF-F in NIPS-TS-SWAN dropped sharply, although the authors stated the possible reasons in Sec. 4.2. More analysis could reveal method limitations. Or does scattering fail for bursty patterns?

局限性

  1. Although the authors acknowledge static topology and lack of causality modeling in Appendix H. However, these aspects need more expansion.
  2. More testing of the proposed method for Real-world noisy data will help explore the robustness.

最终评判理由

The concerns have been mostly addressed by the authors.

格式问题

None

作者回复

We sincerely thank Reviewer xfeK for the detailed and constructive comments, which are instrumental in further improving our paper.

W: Regarding hyperparameter sensitivity, theoretical assumptions, and their impact on generalization.

To systematically evaluate the robustness of our model, we conduct a detailed sensitivity analysis in Appendix E of our paper on key hyperparameters, including the time window size, number of encoder layers, hidden dimension, EMA decay rate, and the anomaly threshold δδ. The results show that ScatterAD maintains stable performance across a wide range of parameter settings. And, we detail the impact of changes in graph structure in Q1.

Q1: Regarding the performance of a static graph when the topology changes.

To simulate dynamically changing graph topologies, we extend our data loading process to generate different graph structures for each input time window during training and inference. We design two dynamic topology generation strategies:

(1) Random Topology: For each time window, we randomly establish connections between nodes with a certain probability (edge_prob=0.3). To ensure basic graph connectivity, we preserve the connections between adjacent nodes. This strategy simulates scenarios where connections are irregular and appear randomly.

(2) K-Nearest Neighbours (KNN) Topology: For the topology constructed within each time window, we compute the Euclidean distance between each node and all other nodes, connecting it to its k-nearest neighbours (k=3k=3).

MSLAFF_FPA-FA-ROCA-PRThroughput (samples/second)Avg. Inference Time (ms)
Original (Ours)0.8660.9640.9860.93260.3516.571
Random Topology0.8670.9670.9870.93738.3326.091
KNN Topology0.8620.9610.9840.92749.1720.338
PSMAFF_FPA-FA-ROCA-PRThroughput (samples/second)Avg. Inference Time (ms)
Original (Ours)0.7970.9810.9860.96962.1716.085
Random Topology0.7910.9750.9800.96131.7731.472
KNN Topology0.8030.9660.9710.94825.6638.976

The results indicate that while dynamic topology strategies lead to marginal performance improvements in some cases (e.g., Random Topology on MSL and KNN Topology on PSM), they introduce additional computational overhead. Therefore, ScatterAD exhibits strong robustness to graph topology, and employing a static graph strikes a good balance between efficiency and effectiveness.

Q2: Clarification on the choice of Euclidean distance and its sensitivity to feature scaling.

Our goal is to find a simple yet effective inductive bias. Compared to complex methods that require covariance estimation and rely on distributional assumptions, Euclidean distance, being a computationally simple and assumption-free geometric metric, is a more direct and robust choice to achieve our objective. To address the potential issue of feature scaling, we first apply standard Z-score normalisation to all input data. Second, we perform L2 normalisation on the target representation htargeth_{target} before calculating the scatter loss (LscatterL_{scatter}) (Equation 5), ensuring robustness to feature scales.

To address the concerns regarding different distance metrics, we conduct a new experiment comparing our method with two other classic distance metric variants: Mahalanobis distance and Kullback-Leibler (KL) Divergence. In the Mahalanobis distance variant, we estimate the global covariance matrix from the representations of the training set, following standard practice. We report the performance on three representative datasets below.

DatasetMetricScatterAD (Euclidean - OURS)ScatterAD (Mahalanobis)ScatterAD (KL Divergence)
MSLAff-F10.8670.8510.789
PA-F0.9640.9560.928
A-ROC0.9860.9860.972
A-PR0.9320.9340.866
PSMAff-F10.7970.7920.129
PA-F0.9810.9760.739
A-ROC0.9860.9800.793
A-PR0.9690.9620.701

Q3: In-depth analysis of performance discrepancy on the NIPS-TS-SWAN dataset.

First, we would like to clarify that ScatterAD successfully generates separable anomaly scores for the NIPS-TS-SWAN dataset. The high PA-F1 score (0.736) demonstrates that anomalous events are assigned higher scores than normal ones. Furthermore, to objectively analyse the characteristics of the Aff-F1 and Point-wise F1 metrics, we design a new experiment to simulate the metric sensitivity, which is independent of our model and aims to test the robustness of the Aff-F1 metric itself on the NIPS-TS-SWAN dataset. We assume a model whose predictions can perfectly match the ground truth. We then simulate a minor, realistic localisation error by shifting the entire prediction sequence by nn timesteps and observe how the Aff-F1 and Point-wise F1 scores change.

DatasetLocalization Error (Shift)Aff-F1 Score (Segment-level)Point-wise F1 (Point-level)
NIPS-TS-SWAN0 steps1.0001.000
1 step0.065 (93.5%↓)0.686
2 steps0.1740.697
5 steps0.2230.704
10 steps0.2260.709
MSL0 steps1.0001.000
1 step1.0000.995
2 steps1.0000.991
5 steps0.972 (2.8%↓)0.977
10 steps0.8890.954

The results show that for the NIPS-TS-SWAN dataset, a localisation shift of just one timestep can cause the Aff-F1 score of the model to catastrophically collapse from 1.0 to 0.065. This explains why many other SOTA models (as shown in Table 1 of our paper) also perform poorly on this metric. In contrast, the more traditional Point-wise F1 score shows only a modest decline. This is why we report both scores.

To further validate our model's detection capabilities at a macro level, we compiled overall prediction statistics for ScatterAD on the NIPS-TS-SWAN dataset.

StatisticGround TruthOur Prediction
Total Anomaly Ratio32.6%32.1%
Number of Anomaly Segments551,937843,596

The statistics show that the total proportion of anomalies predicted by ScatterAD (32.1%) closely aligns with the ground truth (32.6%), demonstrating its accurate detection capability. However, the higher number of predicted segments suggests that the model may tend to respond to a single continuous anomaly as several smaller segments. Consequently, its performance is underestimated by the Aff-F1 metric under these specific data conditions due to such minor, unavoidable localisation offsets.

评论

Thank you for your detailed review and constructive engagement. We greatly appreciate you taking the time to read our rebuttal and consider our responses.

评论

Dear Authors,

Thanks for your rebuttal. Most of my questions have been addressed. The additional experiments show the effectiveness of the proposed method. I highly encourage the authors to integrate these modifications into the revised manuscript. I will maintain my positive score.

Best,

Reviewer

评论

We are grateful for your time and constructive guidance throughout the review process. We will ensure that all modifications and additional results are integrated into the revised manuscript as you suggested.

最终决定

This paper makes solid contributions to multivariate time series anomaly detection through: (1) a novel and well-motivated approach using representation scattering, (2) theoretical grounding through information bottleneck theory, (3) strong empirical results with comprehensive evaluation, and (4) clear practical applicability to industrial IoT scenarios. While initial presentation issues were raised, the authors demonstrated commitment to addressing all concerns in the final version. The work represents a meaningful advance in the field with both theoretical insights and practical impact.

The paper received positive evaluations from all four reviewers with final ratings of 5 (Accept), 4 (Borderline Accept), 6 (Strong Accept), and 4 (Borderline Accept). The reviewers appreciated the novel approach of leveraging representation scattering as an inductive signal for multivariate time series anomaly detection, the theoretical grounding through information bottleneck theory, and the comprehensive experimental validation.

While the work is technically solid with good experimental results, it represents an incremental advance in anomaly detection methodology rather than a breakthrough warranting oral/spotlight presentation. The contributions are more suited for detailed technical discussions at a poster session.