The Case for Learned Provenance-based System Behavior Baseline
摘要
评审与讨论
This paper proposes a new ML method for anomaly detection in provenance graphs. The results demonstrate the effectiveness of the proposed method.
给作者的问题
N/A
论据与证据
The claims and assumptions as well as the evaluation metrics are reasonable to me.
方法与评估标准
The proposed method is well presented.
理论论述
The paper does not have a theoretical claim.
实验设计与分析
The experiment design is comprehensive in general and the results are impressive. I would suggest adding experiments on transformers if possible.
补充材料
I read through it.
与现有文献的关系
This paper advances the field of ML-based anomaly detection, an important and practical domain.
遗漏的重要参考文献
I would suggest the authors discuss more existing works on anomaly detection, e.g., CADE [USENIX'21].
其他优缺点
The paper is well written.
其他意见或建议
N/A
Thank you for your review efforts and insightful comments. We provide our responses to each specific issue in order.
Q1: I would suggest adding experiments on transformers if possible.
A1: Intrusion detection and threat analysis is a computationally intensive task with stringent real-time requirements. Due to the complexity of attacks and their spatiotemporal persistence characteristics, it is necessary to detect large-scale provenance graphs.
We have try Transformers such as BERT and TinyBERT, due to the large dataset size (containing millions of unique system events), Transformers exhibited significantly higher processing times--as shown in Table 2, their event processing time is tens to hundreds of times that of traditional NLP models. Given the real-time requirements of the detection process, we abandoned some advanced models due to computational and time constraints. Referring to existing works, such as Flash [S&P'24], which uses Word2Vec for node encoding, and ProvDetector [NDSS'20], which employs Doc2Vec, we adopted similar NLP models and configurations, considering the need for efficient large-scale graph encoding.
Moreover, our experimental results (Table 4) indicate that the Transformer model fails to enhance detection performance in our task, primarily due to the unique characteristics of the provenance graph intrusion detection scenario. While the Transformer architecture surpasses traditional NLP models in contextual modeling capability, the long-term and distributed nature of network attacks necessitates a broader analytical scope beyond a limited n-hop neighbors. In this context, simply applying SOTA models proves ineffective, leading to challenges regarding both effectiveness and scalability.
Q2: I would suggest the authors discuss more existing works on anomaly detection, e.g., CADE [USENIX'21].
A2: We have supplemented a table of experimental results comparing with SOTA approaches, highlighting its advantages in accuracy and the reduction of false alarms. In the revised manuscript, since the detection granularity of our approach (path-level) differs from that of some SOTA works (graph-level or node-level), we have standardized the detection and comparison at the node level. The results demonstrate the advantages of our approach in terms of accuracy and the reduction of false positive.
Dataset: E3-CADETS
| TPs | FNs | FPs | F1-score | |
|---|---|---|---|---|
| Ours | 32 | 1 | 69 | 0.4776 |
| Nodoze | 31 | 2 | 667 | 0.0885 |
| ProvDetector | 20 | 13 | 77 | 0.3636 |
| Flash | 7 | 25 | 6996 | 0.0019 |
| Kairos | 23 | 10 | 119878 | 0.0003 |
Dataset: E3-THEIA
| TPs | FNs | FPs | F1-score | |
|---|---|---|---|---|
| Ours | 12 | 5 | 46 | 0.3200 |
| Nodoze | 12 | 5 | 105 | 0.1791 |
| ProvDetector | 0 | 17 | 91 | 0.0000 |
| Flash | 2 | 15 | 23330 | 0.0002 |
| Kairos | 11 | 6 | 10219 | 0.0021 |
Dataset: E3-TRACE
| TPs | FNs | FPs | F1-score | |
|---|---|---|---|---|
| Ours | 17 | 1 | 1448 | 0.0229 |
| Nodoze | 17 | 1 | 2689 | 0.0125 |
| ProvDetector | 5 | 13 | 44 | 0.1493 |
| Flash | 4 | 14 | 60484 | 0.0001 |
| Kairos | N/A | N/A | N/A | N/A |
Thank you for your thoughtful review, we will carefully refine our paper to ensure that its format, content, and presentation remain objective, accurate, and well-reasoned.
This paper proposes a learning-based anomaly detection method for provenance graphs, which are critical for cybersecurity. The approach decouples provenance graphs into system events, encodes them adaptively to handle out-of-vocabulary (OOV) elements and normality shifts, and trains lightweight regression models (MLP, LSTM, CNN) to predict event regularity scores. The experiment demonstrates high accuracy on DARPA datasets.
update after rebuttal
Thanks for the authors' rebuttal. I have raised my score since most of my concerns have been addressed. However, I'm still concerned about some new results during the rebuttal, specifically, the usability considering the low F1-score (0.4776) and how you deal with your 69 false positives. I hope the authors can address them in the revised paper.
给作者的问题
- Could the authors provide sufficient comparison experiments with other SOTA approaches?
- Why are GNN-based methods excluded from comparisons? Could they be used as baseline for evaluation?
- Could the authors elaborate on how the approach scales with increasing data rates and graph sizes in a production environment?
- How sensitive is the performance of your method to the choice of hyperparameters in the embedding and regression models and decay factor?
- Can the approach handle gradual shifts (not really unseen, but changing) in benign behaviors (e.g., software updates)?
论据与证据
Most of the claims are well-supported.
方法与评估标准
The methodology is clearly motivated by the challenges inherent in processing large-scale and dynamic provenance graphs. The authors explore multiple embedding models and regression models to determine the best combination for predicting event "regular scores" that indicate normal system behavior. The evaluation is performed on realistically simulated DARPA datasets, which is a conventional choice in this field. The authors use accuracy metrics, which are also standard for anomaly detection.
理论论述
The paper does not focus on deep theoretical derivations.
实验设计与分析
The paper reports detailed timing and accuracy metrics, which answers most of the research questions. I also like that the paper presents impressive ablation studies on embedding methods and OOV handling approaches.
However, it severely lacks comparison with other state-of-the-art provenance-based intrusion detection methods, such as those most related works the authors discuss. Besides, despite citing the scalability limitations of GNN-based methods, there have been more efficient GNNs (e.g., GAT, GraphSage), making it strange to exclude them arbitrarily from the comparison. In addition, the choice of and detection thresholds is also not evaluated.
补充材料
I have read the appendix, which contains the details of tag-propagation and additional results.
与现有文献的关系
The work builds on frequency-based anomaly detection (Hassan et al., 2019) and tag propagation (Li et al., 2024). It specifically addresses gaps in handling OOV elements and dynamic system behaviors, which are underexplored in prior PIDS literature.
遗漏的重要参考文献
-
GNN-based anomaly detection: The paper critiques GNNs’ computational overhead but omits comparisons with scalable GNN variants like GraphSAGE (Hamilton et al., 2017) or GAT (Veličković et al., 2018).
-
Temporal graph learning: Methods like TGAT (Xu et al., 2020) could strengthen the handling of temporal information in provenance graphs (C3).
其他优缺点
Strengths:
- The paper addresses a critical problem in real-time intrusion detection.
- The paper is well-written and well-structured; I enjoy reading it most of the time.
- It has a practical focus on real-time detection and storage efficiency, which is important to real-world use.
Weaknesses:
- Some parts of the paper would benefit from a clearer exposition, particularly regarding the selection and tuning of hyperparameters for the various embedding and regression models.
其他意见或建议
Please refer to the questions below.
Thanks for your review efforts and insightful comments. We provide responses to each specific issue.
Q1: GNN-based anomaly detection & Temporal graph learning
A1: GraphSAGE samples k-hop neighbors instead of traversing the entire graph. APT attacks exhibit complex and prolonged spatiotemporal characteristics, requiring collecting a comprehensive provenance graph over an extended period for full capture. Moreover, provenance graphs are inherently complex, with multiple nodes and high average degrees, causing exponential information growth during propagation.
GAT employs a global attention mechanism, making it more suitable for small graphs rather than large-scale provenance analysis. Our datasets are large: Cadets (309k entities, 5,509k events), Trace (505k entities, 910k events), and Theia (104k entities, 1,420k events). Cadets generates 2.6GB of audit logs daily over two weeks, while Theia spans 85GB in the same period.
Temporal graph learning models like TGAT or TGN encounter similar issues. While real-time detection is critical, GNNs struggle with streaming data, as time-windowed processing introduces latency. APT attacks spanning disjoint time windows further hinder GNNs from capturing the full attack chain.
Q2: The selection and tuning of hyperparameters for the various embedding and regression models
A2: We present the results of hyperparameter tuning for kernel regularizer. We observed similar patterns when tuning other hyperparameters like learning rates, loss functions and so on, finding that these hyperparameters had minimal impact on the model's regression performance. For embedding models, we used the default hyperparameters, adjusting only embedding dimensions, with results in Figure 3.
| Training Time(s) | Accuracy 1(%) | Accuracy 2(%) | |
|---|---|---|---|
| L2(0.0001) | 176.51 | 90.59 | 100.00 |
| L2(0.001) | 179.33 | 90.59 | 100.00 |
| L2(0.01) | 229.91 | 90.59 | 100.00 |
| L1(0.0001) | 174.01 | 90.59 | 100.00 |
| L1(0.001) | 278.39 | 90.59 | 100.00 |
| L1(0.01) | 123.69 | 35.86 | 0.00 |
Q3: Sufficient comparison experiments with other SOTA approaches
A3: We provide a comparison table against SOTA approaches in the Cadets dataset, showing that our approach effectively reduces false positives and minimizes manual analysis efforts. We observe comparable results in the Theia and Trace datasets. In the revised manuscript, we standardize the different detection granularity to the node level.
| TPs | FNs | FPs | F1-score | |
|---|---|---|---|---|
| Ours | 32 | 1 | 69 | 0.4776 |
| Nodoze | 31 | 2 | 667 | 0.0885 |
| ProvDetector | 20 | 13 | 77 | 0.3636 |
| Flash | 7 | 25 | 6996 | 0.0019 |
| Kairos | 23 | 10 | 119878 | 0.0003 |
Q4: Why are GNN-based methods excluded from comparisons? Could they be used as baseline for evaluation?
A4: In the preceding table, Flash utilizes GraphSAGE, while Kairos employs TGN. Flash reduces training overhead by limiting dataset size and graph traversal, while Kairos uses a time-window approach for real-time detection. Besides the delay introduced by collecting the first time window, attacks may extend across multiple non-contiguous time windows, resulting in incomplete detection of the attack chain.
Q5: How the approach scales with increasing data rates and graph sizes in a production environment?
A5: For real-time data streams, the model assigns a regular score immediately upon an event arrival, maximizing detection efficiency. The initial tags propagate and aggregate along graph edges as described in Section 3.2. Memory usage grows linearly, and the required cache is minimal. To prevent dependency explosions, we implement tag removal conditions to limit the number of cached tags, effectively controlling memory overhead.
Q6: How sensitive is the performance of your method to the choice of hyperparameters in the embedding and regression models and decay factor?
A6: The results for embedding dimensions are presented in Figure 3 (c&d). Q3 presents experiments on hyperparameter adjustments for regression models. We manually fine-tuned the decay factor to achieve the best detection results. As it is not a major contribution, further details are omitted from the main paper.
Q7: Can the approach handle gradual shifts in benign behaviors?
A7: Our approach effectively adapts to gradual changes in benign behaviors. Since gradual shifts may introduce OOV words that are not present in the training set, we encode them as zero vectors to ensure all anomalies are detected. Benign nodes generally share more common elements with benign behaviors in the training set, whereas malicious nodes deviate more significantly from benign behavioral patterns. Since our unit of processing is an event, the regression model assigns benign-like regular scores to events with normal nodes and much lower scores to events related to malicious nodes. These regular scores serve as initial tags for the tag propagation algorithm, propagating and aggregating along edges to compute path-level regular scores. Alerts are triggered only for sequences of multiple suspicious events.
This paper proposes a novel learning-based anomaly detection method that effectively embeds and analyzes large-scale provenance graphs. The approach integrates dynamic graph processing with adaptive encoding mechanisms, which facilitates compact embeddings, effectively addresses out-of-vocabulary (OOV) elements, and adapts to normal variations in dynamic real-world environments. The enhanced baseline is incorporated into a label propagation framework to enable real-time detection. The main contributions are threefold: (1) An adaptive event encoding mechanism is designed to dynamically generate event vectors by analyzing frequency characteristics in system logs; (2) A lightweight regression model is developed to capture normal event distribution patterns with low computational overhead; (3) An integrated framework combining offline learning and online detection is established, where the offline phase constructs behavioral baselines while the online phase performs anomaly path detection through direct comparison with real-time data streams.
给作者的问题
NA
论据与证据
This paper validates its claims through a comparative analysis of various embedding models in prediction tasks and anomaly path mining.
方法与评估标准
The proposed evaluation metrics – particularly prediction accuracy and precision/recall/F1-scores for fault detection – demonstrate practical significance for the addressed problem.
理论论述
The theoretical framework and mathematical formulations are fundamentally sound, offering valuable references for this research domain.However, the manuscript would benefit from explicit clarification of variable definitions in the equations, particularly regarding the specification of L and its operational semantics.
实验设计与分析
A comparative analysis of multiple embedding models in terms of performance is conducted, with a detailed examination of their respective advantages, disadvantages, and underlying reasons.
补充材料
A review of previous contributions in this field is conducted, including the definitions of different embedding models as well as studies on network attacks and defenses.
与现有文献的关系
This paper provides a detailed review of previous contributions, highlighting their advantages and limitations. Based on the identified limitations, corresponding experiments are designed to address the existing issues.
遗漏的重要参考文献
The selection of learning models does not include a discussion on Transformer or GRU. It would be beneficial to discuss and compare more recent embedding models.
其他优缺点
The experiments in the paper are comprehensive, with a strong emphasis on comparative studies. The proposed approach is tested on multiple models and datasets.
It effectively integrates graph networks with adaptive encoding, achieving compact embeddings of elements.
However, it does not include a comparative study with some of the latest neural networks or embedding models.
其他意见或建议
There is a spelling error in the first paragraph of Section 3.2. Certain definitions need to be clarified, such as the notation L in the text. Additionally, the formulas for some embedding models should be properly introduced.
伦理审查问题
NA
Thank you for your review efforts and insightful comments. We provide our responses to each specific issue in order.
Q1: The selection of learning models does not include a discussion on Transformer or GRU. It would be beneficial to discuss and compare more recent embedding models.
A1: Intrusion detection and threat analysis is a computationally intensive task with stringent real-time requirements. Due to the complexity of attacks and their spatiotemporal persistence characteristics, it is necessary to detect large-scale provenance graphs.
We have try Transformers such as BERT and TinyBERT, due to the large dataset size (containing millions of unique system events), Transformers exhibited significantly higher processing times--as shown in Table 2, their event processing time is tens to hundreds of times that of traditional NLP models. Given the real-time requirements of the detection process, we abandoned some advanced models due to computational and time constraints.
Moreover, our experimental results (Table 4) indicate that the Transformer model fails to enhance detection performance in our task, primarily due to the unique characteristics of the provenance graph intrusion detection scenario. While the Transformer architecture surpasses traditional NLP models in contextual modeling capability, the long-term and distributed nature of network attacks necessitates a broader analytical scope beyond a limited n-hop neighbors. In this context, simply applying SOTA models proves ineffective, leading to challenges regarding both effectiveness and scalability.
Q2: It does not include a comparative study with some of the latest neural networks or embedding models.
A2: The rationale for not utilizing the latest embedding models aligns with our previous discussion. Given the unique characteristics of the provenance-based intrusion detection task and the prolonged, distributed nature of APT attacks, a broader graph structure must be collected and processed as contextual information. In this scenario, directly applying latest models often incurs substantial computational overhead, resulting in suboptimal efficiency and scalability. Instead, we adopt a tag propagation algorithm, which efficiently aggregates contextual information with minimal storage for tags, thereby facilitating computationally efficient detection.
As system logs accumulate over time, the provenance graphs inherently grow to a large scale. The Cadets dataset comprises 309k entities and 5,509k events, the Trace dataset include 505k entities and 910k events, and the Theia dataset contains 104k entities and 1,420k events. Specifically, the Cadets dataset spans approximately two weeks, generating 2.6GB of audit logs per day, whereas the Theia dataset, also covering two weeks, amounts to a total of 85GB. These figures underscore the substantial scale of provenance graphs. Given the current state of technology, future analysis of log data from a single office computer may necessitate a high-performance server, posing significant challenges for real-world deployment.
GNN-based models or other latest neural networks primarily concentrate on small graphs or local n-hop neighbors. However, due to the prolonged spatiotemporal nature of attacks, attack chains typically span beyond the n-hop range. As a result, analyzing a larger graph structure becomes necessary, which increases the computational burden and leads to excessive resource consumption. Some methods, such as Flash [S&P'24] and Kairos [S&P'24], have employed advanced neural network models, including GNN and TGN, respectively. However, these models still face challenges in effectively solving the issues due to computational overhead. They either reduce the training dataset or overlook critical graph information, which results in suboptimal training performance and poor detection outcomes. Given these considerations, we did not employ the latest neural networks or embedding models for experiments, recognizing their inherent unsuitability for this task.
Q3: There is a spelling error in the first paragraph of Section 3.2. Certain definitions need to be clarified, such as the notation L in the text. Additionally, the formulas for some embedding models should be properly introduced.
A3: Thank you for your thoughtful review, we will carefully refine our paper to ensure that its format, content, and presentation remain objective, accurate, and well-reasoned. We will also add the necessary formulas to improve our paper.
The paper proposes a learning-based anomaly detection workflow for large-scale provenance graphs, addressing challenges like out-of-vocabulary elements and normality shifts in dynamic environments. It integrates dynamic graph processing with adaptive encoding to create compact embeddings, improving anomaly detection accuracy and adaptability. This approach is further enhanced with a tag-propagation framework for real-time anomaly path mining, significantly advancing provenance graph analysis for intrusion detection.
给作者的问题
Please refer to Weaknesses.
论据与证据
The proposed model claims that they can handle real-time dynamic graph problem (OOV words) and they surely through some ways (treat them as all-zero vectors or Doc2Vec) to well solve it.
方法与评估标准
The experimental settings, including datasets and evaluation criteria, are fair. The method consists of existing works, but the workflows make sense.
理论论述
There’s no theoretical claim.
实验设计与分析
The experimental settings, including datasets and evaluation criteria, are fair.
补充材料
I have reviewed all appendices of this paper.
与现有文献的关系
The contributions lie in the application perspective, all used modules come from existing works.
遗漏的重要参考文献
The references covering contributions are comprehensive and well-explained.
其他优缺点
Strength:
-
The paper is well-written.
-
The experimental settings are fair. The whole benchmark is relatively comprehensive.
-
The whole paper is well-organized and easy to follow.
Weaknesses:
-
The paper feels more like a benchmark for a new application scenario, with limited methodological innovation.
-
I wonder how many anomalous edges in the experiments are associated with OOV nodes. This may help to explain why treating OOV words as zero vectors significantly boosts detection performance. Could authors provide the ratio of anomalous nodes to normal nodes among all OOV nodes? Additionally, if the majority of OOV nodes exhibit normal behavior (in terms of edges), how would this affect performance?
-
I suggest renaming the Ablation Study section to Parameter Sensitivity, as Figure 3 doesn’t seem to effectively demonstrate an ablation study in this context.
其他意见或建议
There are some inconsistent format errors, like the title of Appendix A.2.
Thank you for your review efforts and insightful comments. We provide our responses to each specific issue in order.
Q1: The paper feels more like a benchmark for a new application scenario, with limited methodological innovation.
A1: This manuscript work towards to address an important and open research problem, namely, performing real-time provenance-based intrusion detection in an adaptive and scalable manner.
Specifically, the continuously generated audit logs in complex modern operating systems lead to the constant emergence of new entities and relationships, which causes out-of-vocabulary (OOV) problem. The presence of OOV words poses challenges to adaptability of the model. Furthermore, the complexity of attacks and their spatiotemporal persistence characteristics necessitate the detection of large-scale provenance graphs, leading to significant computational and memory overhead. These challenges remain open problems in the field of cybersecurity, and existing works lack a general solution to address them. In this paper, we propose an adaptive and scalability approach to reduce false positive and conduct efficient provenance-based intrusion detection.
Q2: How many anomalous edges in the experiments are associated with OOV nodes? Could authors provide the ratio of anomalous nodes to normal nodes among all OOV nodes? If the majority of OOV nodes exhibit normal behavior, how would this affect performance?
A2: We first refine the definition of OOV nodes mentioned in the comment. Each node may contain multiple words, and when a new node appears, it should be referred to as a node containing OOV words rather than simply an OOV node.
Anomalous edges fall into three categories: (1) Both nodes are seen, but the edge represents an anomalous access; (2) One of the nodes contains OOV words. (3) Both nodes contain OOV words.
In the provenance graphs constructed from the ta1-cadets-e3-official.json.1 log of the CADETS dataset, there are a total of 2,163,141 edges, among which 351 are anomalous. Specifically, there are 2 edge of the first type, 30 edges of the second type, and 319 edges of the third type.
In the provenance graphs generated from the ta1-cadets-e3-official.json.1 log of the CADETS dataset, there are a total of 114,969 nodes. Among them, 2,372 nodes contain OOV words, accounting for 2.06% of all nodes. Among these OOV-containing nodes, 833 are anomalous nodes, while 1,539 are normal nodes, with anomalous nodes comprising 35.11% of the total OOV-containing nodes.
Indeed, most OOV-containing nodes exhibit normal behavior. However, these nodes generally share more common elements with the training dataset, whereas malicious nodes deviate more significantly from the behavioral patterns in the training dataset. Since our unit of processing is an event (edge), the regression model assigns benign-like regular scores to events with normal nodes and much lower scores to event related to malicious nodes in most situations. During real-time detection, these regular scores serve as initial tags for the tag propagation algorithm, propagating and aggregating along information streams to compute path-level regular scores. Alerts are triggered only for sequences of multiple suspicious events.
Q3: I suggest renaming the Ablation Study section to Parameter Sensitivity.
A3: We agree that Figure 3 (a/b/c/d) addresses parameter sensitivity, but we believe that Figure 3 (d/f) demonstrates how our approach to processing provenance graphs significantly improves performance in terms of storage and computational resource compared to traditional frequency databases. We provide a comparison table against SOTA approaches in the Cadets dataset, showing that our approach effectively reduces false positives and minimizes manual analysis efforts. We observe comparable results in the Theia and Trace datasets.
| TPs | FNs | FPs | F1-score | |
|---|---|---|---|---|
| Ours | 32 | 1 | 69 | 0.4776 |
| Nodoze | 31 | 2 | 667 | 0.0885 |
| ProvDetector | 20 | 13 | 77 | 0.3636 |
| Flash | 7 | 25 | 6996 | 0.0019 |
| Kairos | 23 | 10 | 119878 | 0.0003 |
In the revised manuscript, since the detection granularity of our approach (path-level) differs from that of some SOTA works (graph-level or node-level), we have standardized the detection and comparison at the node level. The results demonstrate the advantages of our approach in terms of accuracy and the reduction of false positive. Nodoze [NDSS'2019] and ProvDetector [NDSS'2020] are based on frequency databases, and the results highlight that our approach to processing and embedding provenance graphs can improve detection performance compared to traditional methods.
Q4: There are some inconsistent format errors.
A4: Thank you for your thoughtful review, we will carefully refine our paper to ensure that its format, content, and presentation remain objective, accurate, and well-reasoned.
The authors propose a learning-based approach for efficiently embedding and analyzing large-scale provenance graphs. Their method integrates dynamic graph processing with adaptive encoding techniques to generate compact embeddings capable of handling out-of-vocabulary (OOV) elements and distributional shifts in dynamic environments.
One potential concern is the role of OOV nodes in the observed performance gains. Specifically, it would be helpful to know how many anomalous edges in the experiments are associated with OOV nodes. This insight could shed light on why representing OOV elements as zero vectors significantly enhances detection accuracy to further improve the paper.