Causality-Aware Contrastive Learning for Robust Multivariate Time-Series Anomaly Detection
We introduce a novel multivariate time-series anomaly detection pipeline that incorporates the notion of causality into contrastive learning.
摘要
评审与讨论
This paper proposes a causality-aware contrastive learning method for time-series anomaly detection. Experiments on five real-world and two synthetic datasets validate that the integration of causal relationships improve the anomaly detection capabilities.
给作者的问题
Time series anomalies can arise from various sources, such as evolving underlying processes, external events, or sensor transmission errors. In many cases, while the time series may exhibit abnormal values, they can still adhere to the underlying causal relationships. For instance, external events might disrupt overall sensor readings and push them into abnormal ranges; however, the fundamental causal processes within the system may remain unchanged. I recommend that the authors discuss the specific types of time series anomalies their method is designed to address and identify scenarios in which their approach might fail.
The proposed method appears to be highly complex and likely computationally intensive. Considering that changes in causal relationships may not be the sole factor contributing to time series anomalies, I recommend that the author explore more practical and efficient methods for time series anomaly detection.
I recommend that the authors also report the number of sensors included in the discovered causal model/graph in the experiment. If not all sensors are included into the causal graph, how can abnormal behaviors from sensors that fall outside the model's coverage be detected?
Some important related work are missing, such as https://arxiv.org/pdf/2206.15033 https://ieeexplore.ieee.org/document/6413806
There are also studies that utilize the correlation of time series data for anomaly detection. For example, https://arxiv.org/abs/2307.08390 https://onlinelibrary.wiley.com/doi/10.1155/2022/4756480 It is highly recommended that the authors discuss the advantages and limitations of both correlation-based and causality-based approaches for anomaly detection to provide a more comprehensive perspective.
论据与证据
Most the claims in the paper are clear, except for the following concerns.
Time series anomalies can arise from various sources, such as evolving underlying processes, external events, or sensor transmission errors. In many cases, while the time series may exhibit abnormal values, they can still adhere to the underlying causal relationships. For instance, external events might disrupt overall sensor readings and push them into abnormal ranges; however, the fundamental causal processes within the system may remain unchanged. I recommend that the authors discuss the specific types of time series anomalies their method is designed to address and identify scenarios in which their approach might fail.
The proposed method appears to be highly complex and likely computationally intensive. Considering that changes in causal relationships may not be the sole factor contributing to time series anomalies, I recommend that the author explore more practical and efficient methods for time series anomaly detection.
方法与评估标准
I recommend that the authors also report the number of sensors included in the discovered causal model/graph in the experiment. If not all sensors are included into the causal graph, how can abnormal behaviors from sensors that fall outside the model's coverage be detected?
理论论述
I didn't see any issues.
实验设计与分析
I recommend that the authors also report the number of sensors included in the discovered causal model/graph in the experiment. If not all sensors are included into the causal graph, how can abnormal behaviors from sensors that fall outside the model's coverage be detected?
补充材料
Yes, all.
与现有文献的关系
This work provides findings to the time series anomaly detection community.
遗漏的重要参考文献
Some important related work are missing, such as https://arxiv.org/pdf/2206.15033 https://ieeexplore.ieee.org/document/6413806
其他优缺点
There are also studies that utilize the correlation of time series data for anomaly detection. For example, https://arxiv.org/abs/2307.08390 https://onlinelibrary.wiley.com/doi/10.1155/2022/4756480 It is highly recommended that the authors discuss the advantages and limitations of both correlation-based and causality-based approaches for anomaly detection to provide a more comprehensive perspective.
其他意见或建议
No
We would like to thank Reviewer J2ZH’s insightful and constructive comments. This response presents additional experiments and discussion to address the reviewer’s concerns, all of which will surely be integrated into the main paper.
Time series anomalies can arise from various sources… discuss specific types of time series anomalies their method is designed to address and identify scenarios in which their approach might fail.
As the reviewer notes, certain anomalies may result in anomalies while keeping the causal structure. CAROTS is designed to detect anomalies that violate inter-variable causal dependencies, including scenarios:
- A variable behaves inconsistently with its known causal parents.
- Structural dynamics deviate from the learned causal graph.
- Temporal or multi-variable patterns break causal relationships.
We humbly acknowledge that CAROTS may be less sensitive to anomalies that lie within the causal structure; however, our method consistently demonstrates strong anomaly detection performance across a wide range of datasets, which suggests that CAROTS remains effective in practice.
Computational cost of CAROTS
Even with causal modules, CAROTS is efficient due to a lightweight one-layer LSTM. We compare the total train time of the studied methods on SWaT:
Train Time (min):
| Method | Time |
|---|---|
| CAROTS | 25 |
| AnomalyTransformer | 12 |
| TimesNet | 56 |
| USAD / SimCLR / SSD | 6 |
| CSI / CTAD | 10 |
The train time of CAROTS is comparable to baselines and cheaper than heavier models like TimesNet, indicating that leveraging causal structure can reduce reliance on deeper architectures.
We also profile the per-iteration time on MSL_P-14:
Per-Iter Time (seconds):
| Component | Time | Ratio |
|---|---|---|
| CPA | 0.0382 | 78% |
| CDA | 0.0017 | 3% |
| Loss | 0.0017 | 3% |
| Others | 0.0072 | 15% |
| Total | 0.0488 | 100% |
Each iteration takes less than 0.05 sec, and even the heaviest component (CPA) is lightweight.
# of sensors included in the discovered causal mode/graph
Discovered causal graphs include all sensors, even if they can include disjoint subgraphs and isolated nodes. Some sensors appear as isolated nodes, which is expected in complex systems with partially independent components.
CPA and CDA ensure that every sensor is incorporated in training and synthetic outlier generation. CDA randomly selects a node and performs DFS over causal graph to extract a local subgraph. If the selected node is isolated, it forms a single-node subgraph, and bias is directly injected into its values. This allows us to synthesize abnormal behavior for all sensors, regardless of connectivity. As node selection is random and repeated, all variables have equal chances to be selected and perturbed during training.
CPA similarly applies to all sensors, including isolated nodes. CPA perturbs parents of a selected variable and forecasts the target with the causal forecaster. For isolated nodes, we consider temporal self-dependence as the causal link. The node is perturbed through its past values, and forecasting is performed accordingly, enabling CPA to handle the absence of graph edges.
We also report # of non-trivial subgraphs (excluding isolated nodes) and isolated nodes in datasets:
| #Vars | #Subgraphs | #Isolated Nodes | |
|---|---|---|---|
| SWaT | 51 | 1 | 13 |
| WADI | 123 | 1 | 32 |
| PSM | 25 | 1 | 2 |
| SMD_2-1 | 38 | 1 | 14 |
| SMD_3-7 | 38 | 2 | 12 |
| MSL_P-14 | 55 | 1 | 53 |
| MSL_P-15 | 55 | 1 | 47 |
| Lorenz96 | 128 | 1 | 0 |
| VAR | 128 | 3 | 121 |
CAROTS’ strong performance even on datasets with a large # of isolated nodes and fragmented subgraphs indicates that it can handle complex graph structures and partially disconnected systems.
More related works
-
Related works will be updated to include discussion of [1, 2], a relevant early work connecting causality and anomaly detection.
-
Correlation vs. Causality: Correlation-based methods, such as [3, 4], model dependencies with co-activation patterns across variables. While these methods are effective at capturing immediate statistical associations, they may fail to distinguish true dependencies from spurious correlations, particularly under distribution shifts or external interventions. In contrast, CAROTS is grounded in the causal perspective. It explicitly models directional relationships by learning a causal graph from train data using a causal discovery method. This enables our method to simulate both causality-preserving and causality-breaking augmentations, which serve as the foundation for contrastive learning. Anomalies are then interpreted as deviations from learned causal relationships, making CAROTS more robust to superficial variations that do not reflect structural disruptions.
[1] Yang et al., A Causal Approach to Detecting Multivariate Time-series Anomalies and Root Causes, 2022.
[2] Qiu et al., Granger Causality for Time-Series Anomaly Detection, 2012.
[3] Zheng et al., Correlation-aware Spatial-Temporal Graph Learning, 2023.
[4] Wang et al., Correlation-Based Anomaly Detection Method for Multi-sensor System, 2022.
This paper proposes a new anomaly detection method called CAROTS, tailored for multivariate time-series data. Its central idea is to leverage stable causal relationships among variables discovered through a forecasting-based causal model. These discovered relationships guide two specialized data-augmentation “augmenters”: one generates variations that preserve the typical causal structures, while the other simulates anomalies by breaking them. A contrastive-learning framework is then trained to distinguish these “causality-preserving” and “causality-disturbing” samples, thereby learning a representation space where typical (standard) patterns and anomalous (disturbed) patterns are well separated.
CAROTS combines two scores to detect anomalies at test time. First, it measures a sample’s distance from a centroid of “causality-preserving” training samples in the learned embedding space. Second, it computes a forecasting error with the original causal discovery model because true anomalies are more complex to predict under the learned causal relationships. Experiments on both real-world and synthetic datasets show consistently strong performance for CAROTS, emphasizing that explicitly modeling and preserving causal relationships can enhance the robustness and accuracy of anomaly detection.
update after rebuttal
The authors covered most of my concerns in the rebuttal, so I kept the original positive rating.
给作者的问题
-
Handling Imperfect Causal Graphs: How sensitive is CAROTS if the learned causal structure is partially incorrect or the training set contains mild anomalies? If it severely degrades performance, clarifying mitigation strategies (e.g., robust training or iterative graph refinement) would increase my confidence in real-world applicability.
-
Threshold Tuning: The paper employs a dynamically adjusted similarity filter (0.5→0.9). Could you elaborate on how this threshold is chosen or adapted for different datasets? If there’s a systematic tuning procedure, it would clarify reproducibility and broader applicability.
-
Computational Overhead: Generating causality-preserving/-disturbing augmentations repeatedly might be costly. Is there a significant runtime impact compared to simpler contrastive approaches, and have you explored approximate techniques to reduce cost?
-
Reusable code: Could you consider making reproducible code publicly available to enhance the persuasiveness of the proposed method?
论据与证据
Overall, the paper’s central claims—including that (i) incorporating causal discovery leads to more robust anomaly detection, (ii) contrastive learning can separate samples based on whether their causal structures are preserved or disrupted, and (iii) the combined distance-and-forecasting anomaly score outperforms standard baselines—are supported by results on multiple datasets (including both real-world and synthetic scenarios). Notably, the authors demonstrate that existing approaches struggle more than CAROTS on anomalies that stem from “broken” causal relationships, thereby lending convincing evidence to the core claim that integrating a causal model helps.
That said, a few points merit caution: the paper relies heavily on the assumption of correct (or near-correct) causal discovery in standard data. While the authors test different hyperparameters and show consistent performance, it would help to see explicit empirical checks on how inaccuracies in the learned causal graph impact final performance. Also, while they provide ablations (removing specific components and comparing results), the paper could explore real-world complications like partial anomalies in the “normal” training set in more detail. Still, these caveats do not significantly detract from the main results that the authors present.
方法与评估标准
The paper’s use of public, well-known benchmarks (SWaT, WADI, PSM, SMD, MSL) and synthetic datasets (VAR and Lorenz96) aligns well with the anomaly detection context, as each dataset is commonly used to test multivariate time-series methods. Likewise, the evaluation metrics (AUROC, AUPRC, and F1) are standard and appropriate for anomaly detection, capturing different aspects of precision, recall, and overall ranking performance. By showing strong results across these diverse benchmarks, the authors demonstrate that the approach is suitable for real-world scenarios and that the chosen evaluation pipeline genuinely assesses detection performance.
理论论述
The paper does not formally present (or prove) any strong theoretical claims that typically require rigorous mathematical proofs (e.g., convergence guarantees or asymptotic optimality). Instead, the authors rely on conceptual justifications—particularly around the plausibility that “causality-preserving” vs. “causality-disturbing” samples guide a practical contrastive learning objective—and empirical evidence across multiple benchmark datasets. Hence, there were no formal proofs to check in the text, and all theoretical underpinnings (e.g., why preserving causal relationships should help anomaly detection) are primarily described at a high level rather than as fully proved theorems.
实验设计与分析
The experimental design aligns with standard anomaly-detection practices:
- Data Splits: The paper uses a regular portion of the training set for validation and then evaluates a test set containing anomalies.
- Metrics: AUROC, AUPRC, and F1 are all standard and appropriate.
- Comparisons: The authors test against reconstruction-based and contrastive-based methods, offering thorough performance comparisons.
- Ablations: They turn off individual components (causality-preserving or disturbing augmentations, similarity filtering) to highlight each element’s contribution.
No significant flaws stand out. While it assumes predominantly standard training data, this is common in unsupervised detection research. Overall, the experiment design and analyses appear valid and consistent with standard practice.
补充材料
The author did not provide supplementary materials, so this item is not applicable.
与现有文献的关系
They extend two critical lines of multivariate time-series anomaly detection research:
- Contrastive Learning Approaches: Similar to techniques (e.g., CSI, CTAD) that generate synthetic anomalies, they introduce “causality-preserving” and “causality-disturbing” samples, making their contrastive training explicitly reflect causal structures.
- Causal Discovery: Previous works (e.g., CUTS+, causal formers) show that learning inter-variable causal graphs can improve forecasting. The authors directly embed such causal insights into anomaly detection, bridging causal discovery and self-supervised representation learning.
遗漏的重要参考文献
The paper cites primary contrastive anomaly-detection methods (e.g., CSI, CTAD) and noteworthy causal-discovery tools (e.g., CUTS+). However, it could reference earlier neural causal discovery work—like Neural Granger Causality [1] —as an additional example that merges neural networks with causal inference to further illustrate the lineage of ideas leading to the proposed CAROTS framework.
[1] Tank, Alex, et al. "Neural granger causality." IEEE Transactions on Pattern Analysis and Machine Intelligence 44.8 (2021): 4267-4279.
其他优缺点
Other Strengths
- Originality: Although each element (causal discovery, contrastive learning) has been studied, combining them into a coherent anomaly-detection pipeline is creative.
- Application Potential: The method’s strong performance on real industrial datasets (e.g., SWaT, WADI) hints at practical significance.
Other Weaknesses
- Clarity in Hyperparameters: The paper could further clarify how thresholds, temperature, or other tunings might generalize across domains.
- Interpretability: While causal relationships are central, it would be valuable to see deeper interpretability analyses linking detection results to specific causal graphs or disruptions.
其他意见或建议
- Writing Style: The manuscript reads overall, but some sections would benefit from tighter phrasing (e.g., focusing on the essential motivations and the underlying intuition of causal-based data augmentation).
- Minor Edits: The text occasionally uses broad statements like “overlook inter-variable causal relationships” without citations. Cite or clarify specific methods as examples.
- Discussion of Negative Results: Any cases where CAROTS fails or underperforms (e.g., if causal graphs are partially wrong) would further enrich the discussion.
We are grateful for Reviewer Xmpv’s detailed yet positive comments. Overall, the reviewer believes that “the paper’s claims are are supported by results on multiple datasets,” which “makes it suitable for real-world scenarios.” This response includes additional experiments and discussion to consolidate our contributions. We would love to answer further questions during the discussion period. Lastly, the “Interpretability analysis” will be updated later in the discussion period.
How inaccuracies in the learned causal graph impact CAROTS
To assess the robustness of CAROTS to inaccuracies in the causal graph, we study the performance of CAROTS on the SWaT dataset as diverse perturbations are introduced to the learned causal structure:
- random init: causal edges randomly initialized
- zero init: no causal edges (fully disconnected graph)
- flipped: cause-effect directions reversed
- noisy: Gaussian noise added to the causal matrix
- original: learned causal structure without perturbation
We report AUROC (mean ± std over 3 seeds) below:
| CAROTS | w/o CPA | w/o A_CD | w/o CPA,A_CD | |
|---|---|---|---|---|
| random | 0.841±0.004 | 0.833±0.002 | 0.844±0.004 | 0.833±0.004 |
| zero | 0.848±0.005 | 0.826±0.016 | 0.835±0.030 | 0.616±0.121 |
| flip | 0.831±0.009 | 0.836±0.004 | 0.836±0.010 | 0.836±0.004 |
| noisy | 0.839±0.004 | 0.837±0.004 | 0.842±0.005 | 0.840±0.005 |
| orig | 0.852±0.008 | 0.850±0.005 | 0.861±0.005 | 0.849±0.004 |
These results show that CAROTS is robust under moderate graph perturbations. Notably, the combining CPA and A_CD helps preserve performance even when the causal structure is partially inaccurate.
Results when partial anomalies are in the training set
To evaluate the robustness of CAROTS in more realistic settings, we conduct additional experiments where synthetic anomalies are injected into the train set at varying ratios (0% to 20%). Synthetic anomalies are generated by injecting point-level global anomalies (same as the outlier generation synthesis process in the main paper).
The results below (AUROC, mean ± std over 3 seeds on SWaT) show that CAROTS maintains strong performance even when the train data is partially contaminated.
| Ratio | AUROC |
|---|---|
| 0% | 0.861±0.003 |
| 0.1% | 0.856±0.006 |
| 1% | 0.845±0.002 |
| 3% | 0.852±0.003 |
| 5% | 0.848±0.006 |
| 10% | 0.856±0.001 |
| 20% | 0.847±0.001 |
Discussion of neural causal discovery works
While our method builds on recent causal discovery tools like CUTS+, we acknowledge that referencing earlier approaches like Neural Granger Causality [1] would help contextualize the development of CAROTS. We will revise the related work section to include and cite this line of research.
[1] Tank et al., Neural Granger Causality, 2018
Hyperparameter settings
The dynamic similarity threshold (from 0.5 to 0.9) is not dataset-specific but follows a fixed schedule. This threshold schedule is kept constant across all datasets and does not undergo dataset-specific tuning. Likewise, other hyperparameters such as temperature are selected based on standard practices from prior contrastive learning literature [2] and kept fixed throughout all experiments.
[2] Kim et al., Contrastive Time-series Anomaly Detection, 2024.
Interpretability analysis
CAROTS is inherently interpretable, as it leverages explicitly learned causal graphs and performs anomaly detection by identifying violations of these relationships. We are currently conducting following interpretability analyses:
- Forecasting-based attribution: identifying variables with high causal forecasting error and tracing them back to their parents in the graph to localize disrupted relationships.
- CDA attribution: comparing real anomalies with synthetic ones generated via CDA to identify which causal subgraphs were likely disturbed.
Other comments
- Tighter phrasing and broad statements: We will surely improve the writing of our paper and clarify broader statements in the revised version.
- “How inaccuracies in the learned causal graph impact CAROTS” presents results of incomplete causal graphs (in which CAROTS may underperform).
Questions
- Handing Imperfect Causal Graphs: included in “How inaccuracies in the learned causal graph impact CAROTS.”
- Similarity filter threshold: included in “Hyperparameter settings.”
- Computational Overhead: Due to the character limit, we would greatly appreciate it if the reviewer could refer to “Computational cost of our method” section to Reviewer J2ZH.
- Reusable Code: We fully agree that releasing code enhances reproducibility and transparency. We plan to make the implementation publicly available upon acceptance.
The author's answer, to some extent, resolved my doubts, so I kept the original positive rating.
Thank you again for your valuable suggestion. In our initial response, we noted that CAROTS is inherently interpretable due to its use of explicitly learned causal graphs, and that we were in the process of conducting further analyses to strengthen this claim.
We are happy to report that we have completed the proposed interpretability experiments. Specifically, we performed a forecasting-based attribution analysis using the Lorenz96 synthetic dataset, where the ground truth anomalous variables are known by construction. Each synthetic anomaly involves injecting abnormal values into 10 randomly chosen variables among 128, allowing us to evaluate whether CAROTS can correctly identify the sources of anomaly.
We use the forecasting error from CAROTS’s causality-conditioned forecaster as a proxy for variable-level anomaly attribution. By computing per-variable errors and comparing them to the true perturbed variables using AUROC, we quantify the model’s ability to localize causal disruptions. The results are as follows:
| Anomaly Type | AUROC |
|---|---|
| Point Global | 0.917 |
| Point Contextual | 0.874 |
| Collective Trend | 0.844 |
| Collective Global | 0.691 |
These results demonstrate that CAROTS can meaningfully identify the anomalous variables, particularly for point-level anomalies that involve localized causal violations. While performance is slightly lower for collective anomalies, the attribution signal remains useful.
This paper firstly addresses the problem of Multi-variate Time-Series Anomaly Detection (MTSAD) by incorporating causality relationships. The authors propose novel data augmentation methods, CPA and CDA, which generate samples by leveraging causality learned from previous causality learning approaches. Furthermore, they suggest a novel loss term, Similarity-filtered One-class Contrastive loss, which enables the model to capture the semantic diversity. Finally, CAROTS calculates anomaly score based not only distance (A_CL) but also causality preservation (A_CD). The results demonstrate that the proposed methods outperform existing approaches and exhibit robustness across diverse datasets.
给作者的问题
In the experiments on VAR, although CARLOTS is competitive, other baselines achieve better performance. I believe this trend differs from the results on other datasets and is likely related to dataset characteristics. I request that the authors provide additional analysis on this point.
论据与证据
It is intuitively convincing that leveraging causality relationships to distinguish anomalies from normal operations. To concentrate on this, the authors propose new augmenters to enable the model to capture these relationships. However, it remains unclear how to ensure that causality relationships in normal multivariate time-series remain consistent over time. This is a critical and fundamental assumption of the proposed approach, but there is no theoretical or experimental support for it. I believe this is a major weakness of the paper and strongly recommend that the authors to provide some evidence or discussion to address this issue.
方法与评估标准
There are no issues regarding the evaluation criteria in this paper. As for the proposed method, it is sound convincing, but the authors need to provide additional evidence or justification to support their assertions.
理论论述
There are no theoretical claims in this paper. Please check the comment in “Claims And Evidence”.
实验设计与分析
Despite their extensive experiments, the paper lacks comparisons with state-of-the-art methods, such as CARLA [1]. For a fair evaluation, I strongly recommend considering more recent and relevant works. Additionally, an ablation study on sigma is required. If it is too large, the samples generated by CPA may become anomaly rather than representing normal data. Lastly, the performance of the proposed methods are highly dependent to causal discover methods. CAROTS used CUTS+, but there is no ablation or other causal discover methods. I am curious about that how vary the performance of CAROTS depends on the causal discover methods.
[1] Darban, Zahra Zamanzadeh, et al. "CARLA: Self-supervised contrastive representation learning for time series anomaly detection." Pattern Recognition 157 (2025): 110874.
补充材料
I also review the supplementary material. Especially, I checked the supplementary material to figure out the standard deviation in the main table.
与现有文献的关系
The key contribution of the paper is leveraging causality relationships to discriminate the normal and anomaly. Despite additional justifications are needed, the proposed approach is convincing.
遗漏的重要参考文献
This paper cited the related works appropriately.
其他优缺点
The figures in the paper are well structured and help a lot to understand the methods and process.
其他意见或建议
No other comments or suggestions.
We would like to thank Reviewer aXkV for helpful comments, which we believe will enrich the depth of our work. We are delighted that the reviewer commends that the proposed method is intuitively convincing, outperforms existing approaches, and exhibits robustness across diverse datasets. We tried our best to answer all of the questions during the initial response period, and we are happy to engage in further discussion during the author-reviewer discussion period. Also, the “results of using other causal discovery methods” will be updated later in the discussion period.
Do causality relationships in normal multivariate time-series remain consistent over time?
While we do not assume strict stationarity, previous works in both classical [1,2] and deep learning-based causal discovery [3] has observed that causal structures often remain stable over time in real-world time-series.
To empirically assess whether this statement holds in our setting, we further analyze the evolution of causal structures in three benchmark datasets: SWaT, WADI, and PSM. For each dataset, we split the normal training data into four disjoint, time-ordered segments (Quarter 1 to 4), train a causal discovery model on each, and compute pairwise cosine similarities between the resulting graphs.
Causality Matrix Consistency across Time Segments (Cosine Similarity by Dataset)
| Quarters | SWaT | WADI | PSM |
|---|---|---|---|
| Q1vsQ2 | 0.911 | 0.965 | 0.955 |
| Q1vsQ3 | 0.923 | 0.966 | 0.953 |
| Q1vsQ4 | 0.928 | 0.959 | 0.918 |
| Q2vsQ3 | 0.978 | 0.973 | 0.952 |
| Q2vsQ4 | 0.978 | 0.964 | 0.898 |
| Q3vsQ4 | 0.981 | 0.963 | 0.915 |
The consistently high similarity indicates that the learned causal relationships remain stable across time segments, supporting the validity of our approach.
[1] Spirtes et al., Causation, Prediction, and Search, 2000.
[2] Peters et al., Elements of Causal Inference, 2017.
[3] Kong et al., CausalFormer:..., 2024.
Comparison with CARLA
According to the reviewer’s suggestion, we compare CAROTS with CARLA [4] under the same settings and datasets and report the mean ± std results over three seeds. The results below show that CAROTS outperforms CARLA on most datasets and metrics.
| Dataset | Metric | CARLA | CAROTS |
|---|---|---|---|
| SWaT | AUROC | 0.807±0.034 | 0.852±0.008 |
| AUPRC | 0.691±0.015 | 0.764±0.003 | |
| F1 | 0.742±0.022 | 0.791±0.008 | |
| WADI | AUROC | 0.533±0.056 | 0.622±0.042 |
| AUPRC | 0.103±0.047 | 0.260±0.021 | |
| F1 | 0.175±0.058 | 0.391±0.076 | |
| PSM | AUROC | 0.445±0.041 | 0.783±0.008 |
| AUPRC | 0.257±0.012 | 0.595±0.007 | |
| F1 | 0.444±0.001 | 0.603±0.011 | |
| SMD_2-1 | AUROC | 0.546±0.157 | 0.726±0.023 |
| AUPRC | 0.156±0.078 | 0.193±0.018 | |
| F1 | 0.202±0.087 | 0.299±0.026 | |
| SMD_3-7 | AUROC | 0.483±0.075 | 0.769±0.011 |
| AUPRC | 0.171±0.050 | 0.430±0.015 | |
| F1 | 0.254±0.069 | 0.564±0.011 | |
| MSL_P-14 | AUROC | 0.712±0.165 | 0.782±0.028 |
| AUPRC | 0.521±0.154 | 0.449±0.030 | |
| F1 | 0.639±0.113 | 0.599±0.051 | |
| MSL_P-15 | AUROC | 0.571±0.117 | 0.701±0.008 |
| AUPRC | 0.150±0.115 | 0.022±0.001 | |
| F1 | 0.272±0.136 | 0.087±0.004 |
[4] Darban et al., CARLA:..., 2025
Ablation study on sigma
Results of ablation study on σ in CPA over a wide range (0 to 0.4) are presented below (SWaT dataset; mean ± std over 3 seeds):
| σ | AUROC | AUPRC | F1 |
|---|---|---|---|
| 0 | 0.850±0.001 | 0.761±0.002 | 0.798±0.001 |
| 0.05 | 0.853±0.003 | 0.762±0.002 | 0.797±0.004 |
| 0.1 | 0.852±0.008 | 0.764±0.003 | 0.791±0.008 |
| 0.2 | 0.849±0.007 | 0.759±0.009 | 0.795±0.000 |
| 0.4 | 0.848±0.002 | 0.762±0.007 | 0.792±0.001 |
Performance remains stable across different σ’s, indicating that the generated samples do not degrade model quality, even with higher noise levels. These results suggest that CPA is robust to the choice of σ within a reasonable range.
Results of using other causal discovery methods
We agree that studying the impact of different causal discovery methods is important for assessing the generality of CAROTS. We are currently running experiments with alternative causal discovery methods, and we will upload the results as soon as the experiments are completed.
Explanation for the VAR results
CAROTS behaves differently on the VAR dataset because VAR has different characteristics from other datasets.
VAR is synthetically generated using a linear autoregressive process, where all variable relationships are linear and stable over time. As a result, methods that rely on modeling correlations or co-occurrence patterns (like TimesNet or USAD) are naturally well-suited for this setting.
In contrast, CAROTS is designed to detect anomalies that disrupt more complex or directional causal relationships, especially in non-linear or dynamic systems. This is why it performs particularly well on datasets like Lorenz96 or SWaT, which reflect those properties.
That said, CAROTS still achieves strong results on VAR for certain anomaly types, such as Point Contextual and Collective Global, where detecting multi-variable inconsistency is important. We believe this suggests that CAROTS complements correlation-based approaches and is especially useful when anomalies reflect deeper structural disruptions.
Thank you for your informative rebuttal. The authors' responses address most of my concerns, but I hope to see the comparison and analysis of other causal discovery methods. Nonetheless, I have kept the original positive score.
Thank you again for your thoughtful comments and for maintaining a positive score. As suggested, we conducted additional experiments to examine how CAROTS performs under different causal discovery methods. Specifically, we evaluated CAROTS using Neural Granger Causality (NGC) [1], CUTS [2], and CUTS+ [3] across six datasets. Below are the results (AUROC, mean ± std):
| Method | SWaT | WADI | SMD_2-1 | SMD_3-7 | MSL_P-14 | MSL_P-15 |
|---|---|---|---|---|---|---|
| NGC | 0.852±0.007 | 0.485±0.011 | 0.684±0.007 | 0.691±0.011 | 0.764±0.001 | 0.758±0.018 |
| CUTS | 0.855±0.007 | 0.490±0.014 | 0.726±0.077 | 0.694±0.033 | 0.764±0.000 | 0.662±0.003 |
| CUTS+ | 0.852±0.008 | 0.502±0.007 | 0.703±0.021 | 0.769±0.011 | 0.764±0.000 | 0.701±0.008 |
We find that CAROTS consistently performs well across all causal discovery methods, with only modest performance variation. While each method performs best on different datasets (e.g., CUTS+ on WADI and SMD_3-7; CUTS on SWaT and SMD_2-1; NGC on MSL_P-15), the overall performance remains robust and competitive. This indicates that CAROTS does not overly depend on a particular discovery algorithm or exact causal graph structure.
Instead of solely relying on the causal graph searched by a causal discoverer for anomaly detection, CAROTS uses the causal graph as a guide for generating semantically meaningful causality-preserving or disturbing augmentations for contrastive learning. Our additional results confirm that causality-aware contrastive learning of CAROTS enabled by causality-informed augmentations enables strong generalization across discovery methods and datasets.
We appreciate your suggestion, which helped strengthen our empirical validation. We will include this analysis and discussion in the revised version of the paper.
[1] Tank et al., Neural Granger Causality, TPAMI, 2021.
[2] Cheng et al., CUTS: Neural Causal Discovery from Irregular Time-series Data, ICLR, 2023.
[3] Cheng et al., CUTS+: High-Dimensional Causal Discovery from Irregular Time-Series, AAAI, 2024.
The paper proposes a way to detect anomalies from multivariate time-series data using causality. The proposed method employs two data augmentors to obtain causality-preserving and causality-disturbing samples, respectively. Afterwards, regarding those samples as positive and negative samples, contrastive learning is performed to train the encoder of the abnormal detector. Experiments on five real-world and two synthetic datasets validate the effectiveness of the proposed method.
update after rebuttal
Authors covered most of my concerns in the rebuttal, so I will increase my rating.
给作者的问题
In Table 4, why other combinations such as "w/o A_CL and A_CD" are not considered yet? Is ablation study on SWaT can represent the tendency for other datasets? Why not conducted the ablation study on other dataset? What motivates authors to use both A_CL and A_CD are used in scoring the anomaly? How and why CPA and CDA are more effective, compared to conventional data augmentation methods?
论据与证据
-
The use of two data augmentors to generate positive and negative examples of the contrastive learning. Also, applying the contrastive learning to train the encoder of the anomaly detector to achieve the causality-aware data anomaly detection.
-
Similarity-filtered one-class contrastive loss (SOC) is further proposed to incorporate hard samples during the training process.
方法与评估标准
- The method looks rather simple to involve two types of data augmentation methods and applying the existing contrastive learning on those datasets.
- There seems less theoretical validation for the reasons to propose each module.
理论论述
- There seems not much theoretical claim and analysis.
实验设计与分析
- Ablation study in Table 4 should be more complete:
(1) It should be conducted for more datasets (7 datasets) used for the entire experiments. It was originally conducted only for 1 dataset and the effect is hard to be captured. I think 1 more ablation study is essential for the same dataset as Table 2, to clearly see the effectiveness of the proposed method.
(2) Also, I request authors to report the baseline performance without performing the data augmentation and contrastive learning. It might be useful to judge the effectiveness of the method if authors could show the baseline performance.
(3) The reason for using two types of anomaly score is not yet clear since there is no comparison to "w/o A_CL and A_CD" that does not use both A_CL and A_CD.
In overall, I am not fully convinced about the effectiveness of each module, due to incomplete experimental settings yet.
补充材料
No supplemental submitted.
与现有文献的关系
The paper has a potential to impact various domain that involves the multivariate time-series data.
遗漏的重要参考文献
Reference looks rather complete.
其他优缺点
I think the most critical weakness of the paper are on experimental settings: Experimental settings, especially the ablation studies, are not complete yet. Authors need to design experiments to validate the (1) effectiveness of each module on "all" multiple benchmarks that are used, (2) gap between baseline and proposed methods.
Also, there is less theoretical validation for the effectiveness of each module.
其他意见或建议
I think authors to better design experiments in the aspect of what I mentioned in previous sections. Also, authors could include more theoretical analysis on each module.
We thank Reviewer vFwM for constructive comments. We are encouraged that the reviewer acknowledges our work's potential to impact various domain that involves multivariate time-series. We hope our response addresses your concerns. Should the reviewer have more follow-up questions, we would be happy to answer them during the discussion period.
Extended ablation study from Table 4
The table below presents the extended ver. of Table 4 on 7 real datasets and 2 synthetic datasets. For Lorenz96 and VAR, the reported values represent the average performance across the four different synthetic anomaly types. In summary, the extended results are generally consistent with Table 4 in the main paper.
| Config | SWaT | WADI | PSM | SMD_2-1 | SMD_3-7 | MSL_P-14 | MSL_P-15 | Lorenz96 | VAR |
|---|---|---|---|---|---|---|---|---|---|
| w/o CPA | 0.850 | 0.486 | 0.786 | 0.700 | 0.756 | 0.764 | 0.740 | 0.918 | 0.767 |
| w/o CDA | 0.842 | 0.488 | 0.789 | 0.623 | 0.732 | 0.764 | 0.609 | 0.917 | 0.785 |
| w/o SOC | 0.819 | 0.493 | 0.706 | 0.758 | 0.719 | 0.764 | 0.719 | 0.923 | 0.765 |
| w/o A_CL | 0.814 | 0.494 | 0.778 | 0.602 | 0.701 | 0.768 | 0.694 | 0.943 | 0.732 |
| CAROTS† (w/o A_CD) | 0.861 | 0.622 | 0.729 | 0.726 | 0.779 | 0.782 | 0.683 | 0.919 | 0.769 |
| CAROTS | 0.852 | 0.502 | 0.783 | 0.703 | 0.769 | 0.764 | 0.701 | 0.909 | 0.805 |
-
The setting without data augmentation and contrastive learning corresponds to using only A_CD, the causal forecasting-based anomaly score. This result, included in Table 4, is also included in the table above (w/o A_CL). While A_CD alone performs competitively, full CAROTS, which includes CPA, CDA, SOC, and A_CL, further improves the detection results.
-
Anomaly detection without both A_CL and A_CD is infeasible because, by definition, every detection method requires at least one anomaly score to score samples. Instead, in Table 4 and the extended table above, we report the results of using only one of the two scores (w/o A_CL and w/o A_CD) to demonstrate the efficacy of each score. CAROTS† (w/o A_CD) achieves the highest detection performance on 6 datasets, which highlights the effectiveness of the proposed causality-driven contrastive learning and A_CL. CAROTS, which additionally uses A_CD, further improves CAROTS† on 3 datasets, indicating that the auxiliary A_CD score offers complementary signals for anomaly detection.
-
Table 4 and the extended table demonstrate that CDA and CPA are more effective than conventional data augmentation methods because replacing either one with a conventional method (w/o CPA & w/o CDA) results in a performance drop. More explanation on their effectiveness is detailed in the section below.
Validation for the effectiveness of each module
We hope our extended ablation study, which empirically evidences the effectiveness of each module, alleviates reviewer's concern about less theoretical validation. Below, we clarify the intuitive justification behind the design of each module.
-
Novel Data Augmentation (CPA + CDA)
- CPA (Causality-Preserving Augmentor) generates causality-preserving variations by perturbing causing variables and using the causal forecaster to reconstruct affected variables, ensuring that augmented samples retain the normal causal behavior. This encourages the model to learn representations that belong to diverse yet causally consistent normal patterns.
- CDA (Causality-Disturbing Augmentor) breaks causal relationships to synthesize anomalies. By injecting perturbations into randomly extracted subgraphs of the causal graph, CDA produces samples that simulate how real anomalies disrupt inter-variable dynamics—something conventional augmentations cannot replicate.
- Compared to conventional time-series augmentation methods that rely on surface-level distortions (e.g., time warping, noise injection), CPA and CDA leverage the underlying causal structure from the train data to obtain more semantically meaningful and task-relevant augmentations. Together, augmentations from CPA and CDA enable the contrastive learning process to discriminate samples based on causal consistency rather than superficial similarity.
-
Novel SOC Loss guides contrastive learning to respect the semantic diversity within normal data by filtering out low-similarity positives early in training. In effect, it yields a more structured embedding space, grouped by semantic diversity.
-
Novel Anomaly Scores (A_CL and A_CD): Once contrastive learning with CPA, CDA, and SOC yields a causality-informed embedding space, A_CL detects anomalies by measuring how much a test sample deviates from the causality-preserving embedding space. In addition, A_CD utilizes the causal forecaster to obtain an auxiliary causality-driven signal for anomaly detection; if the causal forecaster yields a high forecasting error, the sample is more likely to be an anomaly.
Ethics review flag
We noticed that an ethical review flag was raised for our paper. To our understanding, our work does not involve any ethical concerns, but we would be grateful for any clarification to help us address this appropriately.
[1] The paper proposes a way to detect anomalies from multivariate time-series data using causality, employing two data augmentors to obtain causality-preserving and causality-disturbing samples, respectively. Afterwards, regarding those samples as positive and negative samples, contrastive learning is performed to train the encoder of the abnormal detector. Experiments on five real-world and two synthetic datasets validate the effectiveness of the proposed method.[Reviewer vFwM, Reviewer J2ZH] [2] Although each element (causal discovery, contrastive learning) has been studied, combining them into a coherent anomaly-detection pipeline is creative. [Reviewer Xmpv] [3] This paper proposes a new anomaly detection method called CAROTS, tailored for multivariate time-series data.[Reviewer Xmpv] [4] The authors demonstrate that existing approaches struggle more than CAROTS on anomalies that stem from “broken” causal relationships, thereby lending convincing evidence to the core claim that integrating a causal model helps.[Reviewer Xmpv]
[1] There seems less theoretical validation for the reasons to propose each module.[Reviewer vFwM] [2] Instead, the authors rely on conceptual justifications—particularly around the plausibility that “causality-preserving” vs. “causality-disturbing” samples guide a practical contrastive learning objective—and empirical evidence across multiple benchmark datasets. Hence, there were no formal proofs to check in the text, and all theoretical underpinnings (e.g., why preserving causal relationships should help anomaly detection) are primarily described at a high level rather than as fully proved theorems.[Reviewer Xmpv] [3] We humbly acknowledge that CAROTS may be less sensitive to anomalies that lie within the causal structure [Authors response to Reviewer J2ZH]