Association-Focused Path Aggregation for Graph Fraud Detection
摘要
评审与讨论
This paper addresses the challenges of high-order interactions, intricate dependencies, and the sparsity of fraudulent entities in graph-based fraud detection. The authors point out that existing mainstream fraud detectors suffer from narrow receptive fields, focusing mainly on local neighborhoods and neglecting long-range structural associations. To address these limitations, the paper proposes a novel Graph Path Aggregation (GPA) method that leverages variable-length path sampling, behavior-associated path encoding, path interaction, and aggregation for enhanced fraud detection. Experimental results demonstrate that GPA significantly outperforms mainstream methods, shows superior robustness to noisy labels, and offers stronger interpretability. Additionally, the authors introduce G-Internet, a new synthetic benchmark dataset designed to facilitate research on interpretable fraud detection.
优缺点分析
Strengths:
- The proposed approach utilizes variable-length path sampling, behavior-associated path encoding, path interaction, and aggregation-enhanced fraud detection, resulting in a much larger receptive field compared to conventional methods.
- The method demonstrates stronger robustness to noisy labels, as evidenced by experiments.
- The model provides excellent interpretability, allowing for a better understanding of the detected fraudulent patterns.
- The introduction of the G-Internet synthetic dataset is a valuable contribution to the community, enabling further research and benchmarking in interpretable fraud detection.
Weakness:
- The construction of the synthetic dataset is not sufficiently justified. It remains unclear how well the fixed-rule simulated anomalies reflect real-world fraudulent behaviors, raising concerns about the real-world applicability and generalizability of the results.
- The reliance on LLM-generated pseudo-labels for website attributes introduces potential noise and bias. The paper does not assess the reliability of these pseudo-labels or their possible effects on model performance.
- The dataset lacks transparency: key details such as user attribute distributions, representative behavioral patterns, and concrete case studies (for both normal and anomalous users) are missing, limiting the ability of the community to fully evaluate the data quality and the scope of the results.
- The evaluation lacks statistical rigor. Given the class imbalance, reporting only mean values (e.g., AP) without standard deviations, confidence intervals, or significance testing makes it difficult to judge the robustness and reproducibility of the reported gains.
- The manuscript does not provide an analysis of the boundary conditions for the proposed method.
问题
- I am concerned about the dataset construction. While defining anomalous behavior via fixed rules is understandable, it is unclear how the simulated anomalies compare to real-world fraudulent activities. The authors should clarify the representativeness and limitations of the synthetic behaviors.
- The paper mentions using an LLM to generate keywords for each website, which essentially acts as pseudo-labeling. The reliability of LLM-generated pseudo-labels should be discussed, and their potential impact on downstream performance should be evaluated.
- More details about the dataset are necessary. For example, the distribution of user attributes, typical behavioral examples for individual users, and more case studies of both normal and anomalous instances would greatly help the community assess the dataset quality and method effectiveness.
- The reported metrics (especially AP) are sensitive to class imbalance, yet only mean values are presented. The paper lacks standard deviations, confidence intervals, or statistical significance analysis.
局限性
The paper does not analyze the boundary conditions under which the proposed method works or fails. I recommend that the authors explicitly discuss the assumptions, potential failure cases, and the method’s robustness to violations of key assumptions.
最终评判理由
The authors have provided satisfactory responses to my questions, particularly regarding the motivation behind dataset construction and the discussion on boundary conditions. As I have already given positive feedback in the previous discussion, I choose to maintain my original score.
格式问题
None
W1: The construction of the synthetic dataset is not sufficiently justified. It remains unclear how well the fixed-rule simulated anomalies reflect real-world fraudulent behaviors, raising concerns about the real-world applicability and generalizability of the results.
Q1: I am concerned about the dataset construction. While defining anomalous behavior via fixed rules is understandable, it is unclear how the simulated anomalies compare to real-world fraudulent activities. The authors should clarify the representativeness and limitations of the synthetic behaviors.
We sincerely appreciate your valuable feedback.
In fact, the 12 empirical fraud rules used to construct the G-Internet dataset were not from our imagination but were patterns already extracted by our collaborating institutions from real-world data. As we mentioned in the "Limitation and Discussion" section, due to privacy restrictions, our collaborating institutions were unable to provide real-world data directly, so they gave us these 12 empirical fraud rules, and we reversely constructed the actual data based on these rules. In other words, we adopted an indirect approach to obtain real-world data, which was a helpless way to circumvent privacy restrictions. During the process of reverse-engineering the dataset through real-world rules, we have added substantial details and tried our very best to restore the real world.
Besides, the proposed GPA consistently outperforms baselines on the five other real-world datasets (Elliptic, T-Finance, T-Social, YelpChi, and Amazon), particularly achieving a +9.2% improvement in AP on the T-Social dataset, which has demonstrated its transferability to real-world patterns.
We hope our explanation can address any doubts or concerns you may have. If necessary, we are more than willing to provide further elaboration.
W2: The reliance on LLM-generated pseudo-labels for website attributes introduces potential noise and bias. The paper does not assess the reliability of these pseudo-labels or their possible effects on model performance.
Q2: The paper mentions using an LLM to generate keywords for each website, which essentially acts as pseudo-labeling. (a) The reliability of LLM-generated pseudo-labels should be discussed, (b) and their potential impact on downstream performance should be evaluated.
Your constructive feedback is much appreciated.
(a) Regarding the reliability of LLM-generated pseudo-labels:
In constructing the G-Internet dataset, website keywords generated by LLMs are primarily used to simulate annotation information that is difficult to obtain in real-world scenarios. Their reliability depends on the strong representational capability of LLMs in capturing semantic associations. Although LLM-generated content carries some degree of uncertainty, we mitigate potential risks through the following measures:
-
Manual rule constraints: Keyword generation is strictly limited to predefined website categories like banking, gambling, etc., to avoid semantic drift caused by open-domain generation.
-
Posterior verification: Manually sampling and reviewing generated keywords to ensure consistency with the website category. For example, when sampling from the keyword library of the "APK download" website, strongly associated words such as "installation package" and "cracked version" are deemed reliable words. This design essentially follows common practices in simulation-based datasets, where the goal is not to pursue perfect annotations but to construct a structured benchmark that can be flexibly adjusted.
(b) Regarding the pseudo-labels' potential impact on downstream performance:
Firstly, it is worth noting that we allow for the presence of noise (i.e., some unreliable or mislabelled keywords), as it is also inevitable in real-world data.
Furthermore, we conducted an in-depth analysis of how noise in LLM-generated features affects downstream performance:
- Noise sensitivity test on website labels: Randomly alter some website labels to simulate noise from incorrect website keywords. The impact of noisy website labels on model performance under different noise ratios is as follows. It can be observed that, under the current dataset conditions, adding noise to website labels has a greater effect on downstream performance as the noise ratio increases. This indirectly indicates that the constructed dataset contains relatively low levels of noise. The marginal effect of the noise magnitude in website labels on downstream performance also suggests that the method possesses a certain degree of noise resistance.
| Noise Ratio | Current | Current+20% | Current+40% | Current+80% | |
|---|---|---|---|---|---|
| GAT | AUC | 94.2 | 87.2 | 88.6 | 84.7 |
| AP | 57.9 | 36.7 | 33.0 | 30.7 | |
| BWGNN | AUC | 95.8 | 92.3 | 92.4 | 92.1 |
| AP | 72.7 | 57.2 | 47.7 | 45.6 | |
| GPA | AUC | 99.8 | 98.0 | 95.2 | 94.8 |
| AP | 97.6 | 82.2 | 65.3 | 61.3 |
- Noise sensitivity test on user labels: As evidenced in Table 3 of the paper, we have specifically validated GPA's robustness against user label noise, where it significantly outperforms baseline methods under both asymmetric and symmetric noise conditions.
- Cross-dataset generalization validation: As shown in Table 2, GPA outperforms significantly across multiple real-world datasets (Elliptic, T-Finance, T-Social, Yelpchi, and Amazon), demonstrating that the model's core capabilities are independent of the specific characteristics of G-Internet.
The combination of rigorous experimental validation and targeted bias mitigation strategies helps ensure that LLM-generated attributes do not compromise GPA's reliability or performance. We appreciate the reviewer for highlighting this important discussion and will further clarify these points in the revised version.
W3: The dataset lacks transparency: key details such as user attribute distributions, representative behavioral patterns, and concrete case studies (for both normal and anomalous users) are missing, limiting the ability of the community to fully evaluate the data quality and the scope of the results.
Q3: More details about the dataset are necessary. For example, the distribution of user attributes, typical behavioral examples for individual users, and more case studies of both normal and anomalous instances would greatly help the community assess the dataset quality and method effectiveness.
We sincerely appreciate your valuable feedback. The most essential details have been provided in Appendix A.1 "Detailed Dataset Construction". Since the details in the dataset construction process are numerous and intricate, it is not possible to show all of them in the paper. We appreciate your understanding.
All original materials and source code used for dataset construction are preserved by us and will be made public after the paper is published.
W4: The evaluation lacks statistical rigor. Given the class imbalance, reporting only mean values (e.g., AP) without standard deviations, confidence intervals, or significance testing makes it difficult to judge the robustness and reproducibility of the reported gains.
Q4: The reported metrics (especially AP) are sensitive to class imbalance, yet only mean values are presented. The paper lacks standard deviations, confidence intervals, or statistical significance analysis.
Thank you for your constructive suggestion. Here, we report the numerical results (mean ± std) for GPA across all datasets, demonstrating its robust numerical stability.
| Dataset | G-Internet | Elliptic | T-Finance | T-Social | YelpChi | Amazon | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Metric | AUC | AP | AUC | AP | AUC | AP | AUC | AP | AUC | AP | AUC | AP |
| GPA | 99.8±0.04 | 97.6±0.26 | 91.1±0.54 | 75.5±0.59 | 97.2±0.07 | 89.6±0.13 | 99.6±0.02 | 96.0±0.15 | 91.9±0.18 | 73.6±0.47 | 98.1±0.06 | 92.4±0.12 |
We will include the numerical results with standard deviations in the revised version.
W5: The manuscript does not provide an analysis of the boundary conditions for the proposed method.
L1: The paper does not analyze the boundary conditions under which the proposed method works or fails. I recommend that the authors explicitly discuss the assumptions, potential failure cases, and the method’s robustness to violations of key assumptions.
We greatly value these constructive suggestions. Your perspective highlights several critical considerations that we should address in our ongoing work.
The proposed GPA method is designed to detect fraud by leveraging long-range structural associations, and its effectiveness relies on the assumption that fraudulent patterns can be captured through path-based interactions. Potential failure cases may arise when fraudulent behaviors lack discernible path-level patterns or when the graph structure is overly noisy, obscuring meaningful associations. Additionally, the method assumes that the sampled paths are representative of the underlying fraud mechanisms. If this condition is violated, such as due to insufficient path diversity or biased sampling, performance may degrade. To enhance robustness, future work could explore adaptive sampling strategies and incorporate mechanisms to handle highly noisy or sparse graphs.
On further analysis of boundary conditions, we will include a dedicated discussion in the revised manuscript to address these points. This will include an examination of scenarios where the method might underperform, such as when fraudulent entities exhibit minimal connectivity or when the graph lacks sufficient labeled data for training. We believe this addition will provide readers with a clearer understanding of the method’s limitations and practical applicability.
Thank you once again for your insightful suggestion!
I appreciate the authors’ feedback. I now understand the motivation behind the dataset construction, specifically the goal of interpretable path-level analysis. Given that the revised version emphasizes the study of boundary conditions and provides further details about the dataset, I have no additional concerns at this time.
Dear Reviewer AdLw,
Thank you once again for your time and effort in reviewing our paper. As the discussion period draws to a close, we would like to know if our previous rebuttal has addressed your concerns. If you have further questions, please do not hesitate to let us know.
We are eagerly looking forward to your response.
Sincerely,
Authors of Paper #9464
Dear Reviewer AdLw,
We are delighted to hear that our clarifications regarding the motivation for dataset construction (particularly the focus on interpretable path-level analysis) and the added details on boundary conditions have adequately addressed your concerns.
Once again, we deeply appreciate your acknowledgment of our work and are truly grateful for the time and effort you have dedicated to reviewing our paper!
Best regards,
Authors of Paper #9464
This paper introduces a novel fraud detection method called Graph Path Aggregation (GPA), which addresses the limitations of existing graph-based fraud detectors by capturing long-range structural associations between entities. The authors propose a framework that includes variable-length path sampling, behavior-associated path encoding, path interaction, and aggregation-enhanced fraud detection. Additionally, they synthesize G-Internet, the first benchmark dataset for internet fraud detection, to facilitate interpretable association analysis. Experiments across multiple fraud scenarios demonstrate that GPA outperforms mainstream methods by up to +15% in Average Precision (AP) and exhibits robustness to noisy labels.
优缺点分析
Strengths:
- The methodology is well-designed, with clear steps from path sampling to fraud detection. The ablation study and robustness tests (e.g., noisy labels) are thorough.
- The paper is well-structured, with detailed descriptions of each component (e.g., path encoding, interaction). Figures and tables effectively support the narrative.
- Fraud detection is a critical problem, and the proposed method addresses key challenges (e.g., long-range dependencies, interpretability).
- The focus on path-level associations and the integration of behavior features are novel. The approach diverges from traditional GNNs by emphasizing global patterns.
Weaknesses:
- The G-Internet dataset is synthetic, which may limit the generalizability of results. Real-world validation is missing.
- While GPA performs well, the computational cost of path sampling/interaction is not deeply analyzed. Scalability for very large graphs (e.g., T-Social) is unclear.
- The paper could better differentiate GPA from prior path-based GNNs (e.g., PathNet) and justify the need for variable-length paths.
问题
- G-Internet is synthetic. How do you plan to validate GPA on real-world data, given privacy constraints? Could you simulate noise/artifacts to better mimic real data?
- Path sampling/interaction seems expensive, especially for large graphs like T-Social. How does GPA scale compared to baselines (e.g., training time per batch)?
- The paper notes that very long paths hurt performance due to "information redundancy" What is the optimal range, and how is it determined? Is there a theoretical justification?
- How does GPA fundamentally differ from PathNet or RAW-GNN, which also use paths?
局限性
Yes, the authors discuss limitations (e.g., synthetic data, privacy constraints) in Appendix A.9.
最终评判理由
Based on the authors' rebuttal and discussion, I keep the previous score unchanged.
格式问题
There are no such concerns.
W1: (a) The G-Internet dataset is synthetic, which may limit the generalizability of results. Real-world validation is missing.
Q1: G-Internet is synthetic. (b) How do you plan to validate GPA on real-world data, given privacy constraints? (c) Could you simulate noise/artifacts to better mimic real data?
Thank you for your valuable comments. These are important considerations for our work.
(a) The other five datasets used in the paper (Elliptic, T-Finance, T-Social, Yelpchi, and Amazon) are real-world datasets, on which GPA consistently outperforms baselines, particularly achieving a +9.2% improvement in AP for T-Social, which has demonstrated its transferability to real-world patterns.
(b) In the near future, I will be doing an internship at a collaborating institution, where I will have access to real-world data and can conduct validation. I am very much looking forward to it.
(c) Yes. In fact, we have added substantial details that simulate noise/artifacts to better mimic real data, including but not limited to:
-
We randomly mask some user attributes to increase the challenge.
-
We randomly mask some elements in the website feature to increase the challenge.
-
Anomalous users may also visit a large number of normal websites while visiting the websites involved in the fraud rules.
-
For benign users, they visit more normal websites, but they may also visit fraud-related websites.
-
A large number of benign users satisfy the majority of the conditions in at least one fraud rule.
We have also made a very detailed division of the distribution of user attributes. All original materials and source code used for dataset construction are preserved by us and will be made public after the paper is published.
W2: While GPA performs well, the computational cost of path sampling/interaction is not deeply analyzed. Scalability for very large graphs (e.g., T-Social) is unclear.
Q2: Path sampling/interaction seems expensive, especially for large graphs like T-Social. How does GPA scale compared to baselines (e.g., training time per batch)?
We sincerely appreciate your insightful feedback.
Computational costs
Theoretically, the time complexity and space complexity of path sampling, path encoding, and path interaction are as follows:
| Path Sampling | Path Encoding | Path Interaction | |
|---|---|---|---|
| Time Complexity | |||
| Space Complexity |
where denotes the number of nodes/paths and denotes the dimension of path embedding.
In practice, due to pytorch's low-level optimizations, the time required for path interaction does not increase significantly even if the number of paths grows a lot. The average inference time taken by the model to process a batch of data from G-Internet and T-Social is as follows:
| GPA | Path Sampling | Path Encoding | Path Interaction | Path Aggregation | Fraud Detection | Total | |
|---|---|---|---|---|---|---|---|
| G-Internet | Inference time per batch | 8.731ms | 14.025ms | 2.502ms | 0.034ms | 0.110ms | 25.402ms |
| T-Social | Inference time per batch | 4.345ms | 1.974ms | 1.745ms | 0.019ms | 0.079ms | 8.162ms |
For G-Internet, the main computational overhead of the proposed method stems from path encoding, followed by path sampling and path interaction. For T-Social, the main computational overhead of the proposed method stems from path sampling, followed by path encoding and path interaction. This difference is primarily due to the different approaches to path encoding. G-Internet encodes path based on atomic events (Eq. (5)), whereas the other five anonymous datasets adopt a simpler alternative approach (Lines 159-160) due to the lack of node and edge information, resulting in lower time consumption during the path encoding phase. Overall, both path sampling and path encoding are fast, while path interaction and path aggregation are even faster.
The overall efficiency of the proposed GPA can also be demonstrated through comparisons with other baselines. We have also measured the model's running time. For training, we record the average time taken by the model to process a batch of data (including forward pass, backward pass, and gradient descent). For inference, we record the average time taken by the model to process a batch of data (forward pass only). The dataset used is G-Internet. The model settings used for measuring running time are kept the same as those in the main paper. All experiments use a batch size of 1024 and are conducted on a single NVIDIA GeForce RTX 3090 GPU. The results are as follows:
| G-Internet | CARE-GNN | GDN | GPA |
|---|---|---|---|
| Training time per batch | 821.5ms | 340.3ms | 95.4ms |
| Inference time per batch | 688.3ms | 119.3ms | 25.4ms |
Scalability for very large graphs (e.g., T-Social)
The proposed GPA can easily handle large-scale graphs. In each iteration, a subset of nodes is first sampled, and then several paths are sampled with these nodes as centers. This means that each iteration does not require the entire graph as input and instead only a small portion of the graph is extracted for processing, similar to the concept of minibatch or subgraph sampling, resulting in low GPU memory usage. The processes of path sampling, encoding, interaction, and aggregation are fast. We tested the model's running time on the large-scale graph fraud dataset T-Social (containing 5.78 million nodes and 146 million edges) using the same configuration as the aforementioned test on the G-Internet dataset, and obtained the following results, which further demonstrate the scalability of GPA to large-scale graphs.
| T-Social | CARE-GNN | GDN | GPA |
|---|---|---|---|
| Training time per batch | 601.4ms | 164.2ms | 50.8ms |
| Inference time per batch | 189.6ms | 30.5ms | 8.2ms |
Q3: The paper notes that very long paths hurt performance due to "information redundancy" What is the optimal range, and how is it determined? Is there a theoretical justification?
Thank you for your insightful comment. The optimal range of path length depends on specific fraud scenarios. For complex fraud scenarios, longer paths would help extract more comprehensive fraudulent patterns, while for simpler fraud cases, shorter paths would be more advantageous. Currently, as the complex nature of fraud graphs makes theoretical derivation challenging, we approach the determination of optimal path length as a hyperparameter optimization problem. However, we consistently observe that excessively long paths inevitably have negative effects, so we hypothesize that this is due to information redundancy. The longer the path, the more severe the information loss after path encoding.
W3: The paper could better differentiate GPA from prior path-based GNNs (e.g., PathNet) and justify the need for variable-length paths.
Q4: How does GPA fundamentally differ from PathNet or RAW-GNN, which also use paths?
Thank you for your constructive suggestion. Although PathNet and RAW-GNN also use paths in their frameworks, their primary objective is to address the heterophily issue in graphs, where traditional GNNs' neighbor aggregation mechanisms may inappropriately mix information between heterogeneous nodes—unlike our GPA method, which is designed to address the long-range dependency issue of fraudulent entities in graphs. Thus, our proposed GPA fundamentally differs from PathNet or RAW-GNN from the very starting point. Moreover, PathNet and RAW-GNN typically only utilize short paths of 2–7 hops and do not involve path interactions, indicating that they still operate within the framework of "models with local perception".
Our GPA method leverages paths to address the long-range dependency issue of fraudulent entities in graphs, where traditional GNNs with only "local perception" lack a global perspective on high-order, complex, sparse, yet concealed fraudulent patterns. Besides, our approach can also unveil hidden fraudulent patterns across a broader scope.
On the need for variable-length paths: Variable-length paths are essential because real-world fraud patterns exhibit diverse structural spans. Fixed-length paths either truncate critical long-range associations or introduce noise from irrelevant nodes in shorter fraud scenarios. Our variable-length sampling adapts to this heterogeneity by dynamically capturing paths of contextually relevant lengths. As demonstrated by the ablation study in Section 5.3, variable-length paths can yield performance improvements, highlighting their critical value.
Thanks for your thorough rebuttal. At this point, I don't have any more concerns. I'll stick with my original score, and I'm now inclined to accept this paper.
Dear Reviewer Q88F,
We appreciate your positive assessment and are glad the paper meets your expectations. Thank you once again for your valuable insights, which have undoubtedly strengthened our work.
Best regards,
Authors of Paper #9464
Dear Reviewer Q88F,
Thank you once again for your time and effort in reviewing our paper. As the discussion period draws to a close, we would like to know if our previous rebuttal has addressed your concerns. If you have further questions, please do not hesitate to let us know.
We are eagerly looking forward to your response.
Sincerely,
Authors of Paper #9464
The paper presents a graph-based path sampling and aggregation algorithm to detect fraudulent entities online. The authors argue that path sampling is necessary to capture long-range structural dependencies between entities and unveil fraud patterns. A new dataset is also proposed to simulate the online fraud pattern. Experiments on various datasets demonstrate that the proposed method is better than baselines.
优缺点分析
Strengths
- Path sampling and aggregation is a legitimate solution for graph-based fraud detection, and it is more flexible than GNNs
- The dataset and baseline selection are comprehensive
- Interpretability analysis demonstrates the advantage of the proposed GPA method
Weaknesses
- The newly curated dataset is poorly described. GPA's AUC already hitting 99.8% leaves no improvement space for follow-up work
- The definitions and equations in Sec. 3 are introduced based on the proposed Internet Fraud dataset, while GPA has been tested on five other datasets as well. It's unclear how GPA generalizes to those graph datasets that do not have the same node/edge types and features
问题
- What is the longest path that the GPA is encoded in the G-Internet dataset?
- Does GPA distinguish between behavior feature and node feature in datasets other than G-Internet?
- According to Table 4, the behavior feature plays a very critical role in boosting fraud detection performance. I am wondering if other baselines also use the behavior feature as part of the node feature?
局限性
- The Sec. 3 should be generalizable to all graph fraud datasets instead of G-Internet only
- The dataset introduction in Sec. 4 is confusing; I need to read the Appendix to fully understand the dataset construction process
- I appreciate the author's effort in building a new fraud dataset simulating real-world fraud patterns, but the near-perfect performance makes the dataset almost useless
最终评判理由
The authors' rebuttal and final remark explain most of my concerns.
In terms of the contribution of the curated datasets, the authors explain how the dataset facilitates the validation of the proposed method, which makes sense to me. I persist that the new dataset may not be used as a benchmark for fraud detection since its performance is saturated, but the way the authors built the graphs and the subsequent analysis have their merit.
Considering all the factors above, I've updated my rating to borderline accept to reflect changes.
格式问题
No.
W1: (a) The newly curated dataset is poorly described. (b) GPA's AUC already hitting 99.8% leaves no improvement space for follow-up work.
We sincerely apologize for any confusion caused.
(a) The most essential details have been provided in Appendix A.1 "Detailed Dataset Construction". Since the dataset construction process is intricate, it is not possible to show all of them in the paper. We appreciate your understanding. All original materials and source code used for dataset construction are preserved by us and will be made public after the paper is published.
(b) It is important to emphasize that the core purpose of the constructed G-Internet dataset is to facilitate interpretable association analysis (as highlighted in the paper's Line 10-11, 54-56, 67-68, etc.), whose value is far beyond the detection accuracy. Existing fraud datasets are anonymized, obscuring entity semantics and hindering the analysis of model interpretability. Therefore, we constructed G-Internet as a transparent benchmark, whose nodes (users/websites) and edges (user visits websites / user to user) carry clear real-world semantics, enabling path-level analysis of fraudulent patterns. For example, attention maps in Fig. 2 exposing path-level reasoning explains how the proposed method flags fraud.
It is also worth noting that the difficulty of the constructed dataset can be adjusted by modifying the noise ratio (see Table 3, the dataset's performance can be arbitrarily adjusted by the noise ratio), positive/negative ratio, etc. The results presented in this work are based on one specific configuration. Upon paper's publication, we will release all construction materials and source code, enabling follow-up studies to freely modify these settings for different experimental scenarios.
Moreover, the high performance (AUC 99.8, AP 97.6) of our GPA method is achieved under circumstances where the baseline models perform poorly. The poor performance of baseline models indicates the challenges of our constructed dataset, yet our method excels in addressing these challenges. Our method also achieves near-perfect results on larger public benchmarks like T-Social (AUC 99.6, AP 96.0), highlighting its robustness and generalizability.
We sincerely hope these explanations will help you better assess the contribution of our work.
W2: The definitions and equations in Sec. 3 are introduced based on the proposed Internet Fraud dataset, while GPA has been tested on five other datasets as well. It's unclear how GPA generalizes to those graph datasets that do not have the same node/edge types and features.
We are sincerely sorry for the confusion. The proposed GPA method is generally applicable to all graph fraud datasets in the paper.
Since other datasets are anonymized with node/edge/relation types unknown, we could not introduce our path-based method from an interpretable view. For this purpose, we specifically constructed the G-Internet dataset with transparent data and structure as a demonstration carrier for our method. This design aims to help readers more intuitively understand the methodology (as we explained in Line 121-123). But actually, the method pipeline in Sec. 3 is designed for general graph structures, which means it is also applicable to those graph datasets that do not have the same node/edge types and features. Our approach, to some extent, "homogenizes" heterogeneous graphs, adopting a unified processing paradigm for different node/edge types and features. Additionally, it is worth noting that under the overall framework of path interaction and aggregation, the specific implementations of each module can be flexible since the diversity of graph data necessitates case-by-case analysis.
Below, we elaborate on how the proposed GPA adapts to all graph fraud datasets:
Path Sampling: When sampling paths via random walk, only the ID sequences of the nodes and connected edges need to be recorded. These nodes and edges can be of any type and have nothing to do with specific semantic information.
Path Encoding: Essentially, this involves combining the features of nodes and edges (if any) along the path. For datasets with known node/edge meanings, they can be processed similarly to G-Internet. If there are other special physical meanings in nodes/edges, custom processing can be applied. Since the other five anonymized datasets cannot provide sufficient node/edge meanings, we took an approximate alternative approach, as we explained in the paper:
Line 159-160: For other datasets with the meaning of relations/events unknown, just simply concatenate the node features along the path ...
For behavior feature design, different datasets may adopt different approaches according to their own conditions. There is no fixed standard, as we explained in the paper:
Line 173-174: For different scenarios, unique behavior feature can be designed on a case-by-case basis.
For the other five anonymized datasets with unclear node/edge meanings, we designed specific behavior features for them. For details, please refer to our response to Q2 below.
Path Interaction and Aggregation & Fraud Detection: All paths are uniformly encoded into fixed-length vectors, so the processing flow is identical across datasets.
We hope our explanations can adequately address your concerns.
Q1: What is the longest path that the GPA is encoded in the G-Internet dataset?
Thank you for your comment. Theoretically, GPA can encode paths of unlimited length (Eq. (3)(4)). However, too-long paths may lead to information redundancy and information loss after path encoding.
Q2: Does GPA distinguish between behavior feature and node feature in datasets other than G-Internet?
We are glad to clarify this aspect of work. Yes, GPA also distinguishes between behavior feature and node feature in datasets other than G-Internet. The behavior feature is a highly customizable part, as we explained in the paper:
Line 173-174: For different scenarios, unique behavior feature can be designed on a case-by-case basis.
However, the other five datasets used in the paper are anonymized, making it impossible to discern the specific meanings of entities. Therefore, we could only assume all nodes were of the same type and adopted a unified approach: introducing decision tree binning encoding, which used decision trees to split attribute values into exclusive bins, using the bin index as discretized features. It then derived relative relations between a node's attribute and its neighbors to characterize behavior features.
Experiments on T-Social confirm that the behavior feature designed for anonymized datasets remains effective.
| T-Social | AUC | AP |
|---|---|---|
| w/o behav. feat. | 99.0 | 91.1 |
| w/ behav. feat. | 99.6 | 96.0 |
Q3: According to Table 4, the behavior feature plays a very critical role in boosting fraud detection performance. I am wondering if other baselines also use the behavior feature as part of the node feature?
Thank you for your insightful comment. No, the baseline models do not incorporate behavior feature as part of their node feature. The design and integration of behavior feature constitute a novel contribution of our method.
Existing baselines primarily focus on refining the aggregation mechanisms for neighbor nodes, while our designed behavior features capture environmental properties of nodes, which are different to the nodes' inherent attributes. This environmental context remains unexplored in prior works.
We also validate the applicability of the designed behavior feature on other fraud detectors as follows, which shows that on G-Internet and T-Social, behavior features provide modest performance gains but are far from dominant. Furthermore, ablation study (Table 4 in the paper) confirms that GPA's performance primarily comes from its path-level interaction for fraudulent pattern uncovering, while behavior features only serve as auxiliary enhancements.
| G-Internet | CARE-GNN | PC-GNN | GDN | GPA | |
|---|---|---|---|---|---|
| w/o behav. feat. | AUC | 70.2 | 79.6 | 90.7 | 99.3 |
| AP | 15.5 | 22.4 | 45.7 | 90.4 | |
| w/ behav. feat. | AUC | 72.4 | 83.4 | 92.9 | 99.8 |
| AP | 18.6 | 36.9 | 54.8 | 97.6 |
| T-Social | CARE-GNN | PC-GNN | GDN | GPA | |
|---|---|---|---|---|---|
| w/o behav. feat. | AUC | 78.3 | 96.9 | 88.4 | 99.0 |
| AP | 41.2 | 80.3 | 52.3 | 91.1 | |
| w/ behav. feat. | AUC | 78.9 | 97.1 | 89.5 | 99.6 |
| AP | 43.8 | 81.5 | 55.6 | 96.0 |
L1: The Sec. 3 should be generalizable to all graph fraud datasets instead of G-Internet only.
We apologize again for the confusion. Sec. 3 is generally applicable to all graph fraud datasets in the paper.
Please refer to our response to W2 above.
L2: The dataset introduction in Sec. 4 is confusing; I need to read the Appendix to fully understand the dataset construction process.
We sincerely apologize for the confusion. Due to space constraints, we had to provide only an overview of the dataset construction process in the main text (Sec. 4) and place the detailed construction process in Appendix A.1. As the dataset construction involves an extensive amount of detail, we deeply appreciate your time in reviewing the Appendix and regret any inconvenience caused by this structural limitation.
L3: I appreciate the author's effort in building a new fraud dataset simulating real-world fraud patterns, but the near-perfect performance makes the dataset almost useless.
We apologize again for the confusion. The core purpose of the constructed G-Internet dataset is to facilitate interpretable association analysis (as highlighted in the paper's Line 10-11, 54-56, 67-68, etc.), whose value is far beyond the detection accuracy.
Please refer to our response to W1 (b) above.
Finally, thank you once again for your meticulous review. If you have any other unclear parts, we welcome your corrections and look forward to more discussions with you!
I appreciate the comprehensive responses from the authors. The authors have clarified my questions regarding the behavior feature.
However, my biggest concern is still the contribution and utility of the G-Internet dataset. If the primary contribution of the new dataset is interpretability, there is only small portion of the experiment for the interpretability, which is not enough. As the authors claim G-Internet as the the first benchmark dataset in the field of internet fraud detection, authors should clearly state how this dataset could be used in the future.
Since the proposed methods have decent performance on other benchmarks and the interpretability analysis makes sense, I would raise my rating to weak reject.
Dear Reviewer nY14,
Thank you once again for your time and effort in reviewing our paper. As the discussion period draws to a close, we would like to know if our previous rebuttal has addressed your concerns. If you have further questions, please do not hesitate to let us know.
We are eagerly looking forward to your response.
Sincerely,
Authors of Paper #9464
We sincerely appreciate your consideration in raising your rating. This is a great encouragement for us to continue enhancing our work.
Regarding your unresolved concerns, we would like to provide additional clarification. We are eager to earn your recognition.
Reviewer's new comment 1: If the primary contribution of the new dataset is interpretability, there is only small portion of the experiment for the interpretability, which is not enough.
The interpretability brought by the new dataset (G-Internet) lies not only in the interpretable experimental results (Sec. 5.4), but also in almost every aspect of the paper, such as the dataset itself and the methodology, something that is impossible with other anonymous datasets:
- Interpretable Dataset: The constructed G-Internet dataset is transparent, with clear meanings of nodes, edges, and relations, making it inherently interpretable.
- Interpretable Methodology: Taking this transparent dataset as a demonstration carrier, we could introduce our path-based method from an interpretable view, enabling intuitive understanding for readers and demonstrating the proposed method's interpretable nature.
- Interpretable Results: The interpretability analysis (Sec. 5.4) aligns with the primary goal (uncovering common fraud patterns) of our method and the fraud rules inherent in the dataset.
"Interpretable Dataset → Interpretable Methodology → Interpretable Results" forms a coherent and self-consistent framework. Hence, this is the complete essence of interpretability brought by the new dataset, not merely the interpretable results.
Additionally, we would like to clarify the following:
- What Sec. 5.4 "Interpretability" presents is the most critical interpretability results. This is our primary goal, which is to reveal common fraud patterns at the path-level, as well reflected in Fig. 2 & 3.
- Sec. 5.4 could be further expanded with more detailed annotations. For instance, in Fig. 2, when a user falls victim to a brushing scam, the path-level attention map highlights strong correlations among the paths going through zoom-like, meiqia-like, and banking websites. Many such examples exist. Since our method aims to find fraud-related websites and does not require more fine-grained results, we did not provide more detailed annotations in Fig. 2. We will include more attention map illustrations in the revised version, explicitly annotating fraud activity types and high-attention website categories.
Reviewer's new comment 2: As the authors claim G-Internet as the first benchmark dataset in the field of internet fraud detection, authors should clearly state how this dataset could be used in the future.
Future utility of the G-Internet dataset includes but is not limited to:
- Simulating real-world Internet fraud scenarios: Emerging Internet fraud is rapidly evolving and widespread with concealed tactics, posing great harm to society. As the first benchmark dataset in Internet fraud detection, G-Internet serves as a viable alternative when real-world fraud data is unavailable. It is fully transparent and adjustable.
- Evaluating method interpretability: The dataset focuses on Internet fraud, where fraudulent users exhibit strong associations with fraud-related websites while also visiting numerous benign websites. Given its transparent structure, future researchers can use it to qualitatively or quantitatively evaluate whether their methods can accurately identify key fraud-related websites from a user's browsing history (similar to Fig. 2 in the paper).
- Investigating noise robustness: Real-world data is inherently noisy, requiring fraud detectors to be noise-resistant. The constructed G-Internet enables flexible noise definition and adjustment. We examined user label noise in Sec. 5.2 and website label noise in response to Reviewer #AdLw. These analyses provide valuable references for future work, encouraging the development of more robust methods to tackle the challenges of real-world noise.
- Enabling advanced research directions: The transparent dataset enables researchers to explore more challenging tasks. For example, researchers could design methods to extract fraud rules (more concrete than the uncovered fraud patterns in our work) from the graph data. In real-world applications, these refined rules could replace empirical rules (Table 3) and then be deployed on servers to advance fraud detection technologies.
All original materials and source code used for dataset construction will be made public to facilitate follow-up studies across diverse experimental scenarios.
Thank you once again for your time and effort in reviewing our paper. The Author-Reviewer discussion has been extended till Aug 8, 11.59pm AoE. If you have any further questions or comments, please do not hesitate to let us know. We truly value your insights and look forward to continuing this constructive discussion.
Thanks for the quick follow-up regarding my concerns on the utility of the proposed dataset and the question about the interpretability. I will take them into consideration during the reviewer-AC discussion phase.
Dear Reviewer nY14,
Thank you for your thoughtful review and for considering our responses during the discussion phase. We believe our constructed G-Internet dataset plays an important and positive role in terms of: (1) its own interpretability compared to other anonymized datasets, (2) assisting the interpretable description of the proposed path-based method (the main contribution of our paper), (3) enabling the presentation of the final interpretable analysis results for hidden fraud pattern uncovering (the primary goal of our paper), as well as (4) its potential applications in future usage.
Once again, we sincerely appreciate the time and effort you've dedicated to evaluating our work. Thank you!
Sincerely,
Authors of Paper #9464
This paper proposes a novel method called Graph Path Aggregation (GPA) to overcome the “receptive field” limitations commonly observed in existing GNN-based fraud detection models. GPA samples variable-length paths from each user node, encodes these paths by incorporating behavioral features, and leverages self-attention mechanisms to model interactions among paths, thereby capturing long-range structural associations. The aggregated path-level information is then used to predict whether a user is fraudulent or benign. Additionally, the authors introduce G-Internet, a new benchmark dataset specifically designed for internet fraud detection. Through extensive experiments across multiple domains—including finance, online reviews, and social networks—the proposed method demonstrates up to 15% improvement in Average Precision (AP) compared to existing approaches.
优缺点分析
Strengths: • Novel path-based approach: Unlike conventional GNNs that primarily rely on 1-hop neighbor information, the proposed GPA method captures long-range structural dependencies by aggregating path-level information combined with behavioral features. This represents a significant modeling innovation. • Strong performance and robustness: GPA consistently outperforms existing methods across multiple domains. Notably, it maintains a high AP of 83.8% even under 80% label noise, demonstrating strong robustness to noisy annotations. • Interpretability contribution: The analysis of attention maps (Figures 2 and 3) reveals that the model assigns higher attention weights to fraud-related paths, offering a visually intuitive explanation that aligns well with domain knowledge and enhances interpretability. • Contribution of the G-Internet dataset: Existing fraud detection datasets are often anonymized, making interpretability difficult. G-Internet, constructed based on well-defined fraud rules, provides a transparent, interpretable benchmark that facilitates the evaluation of association-focused detection methods.
Weaknesses: • Limitations of G-Internet as a simulation-based dataset: Since G-Internet is synthetically generated based on 12 predefined rules rather than real-world data, there is a risk that the model may overfit to these handcrafted patterns. As such, it may fail to generalize to the complexity and evolving nature of actual fraud behavior. • Limited path lengths in certain datasets: Although the paper emphasizes capturing long-range dependencies, the maximum path length is only 5 in some datasets (e.g., Elliptic, T-Finance). This narrow scope may weaken the claim of capturing truly long-range associations and reduce the distinctiveness compared to traditional GNNs. • Lack of computational complexity and scalability analysis: The GPA framework involves intensive operations such as path sampling, behavior feature construction, and attention-based interaction. However, the paper does not provide an analysis of computational complexity (e.g., O-notation) or discuss the scalability of the approach to large-scale graphs. • Disparity in baseline performance: In Table 2, many baseline models perform extremely poorly on G-Internet, while GPA achieves substantially higher scores. This raises concerns that the dataset might be tailored to favor the proposed method. • Limited evaluation metrics: In fraud detection, Recall is often more critical than Precision. However, the paper evaluates performance using only AUC and Average Precision (AP). Including metrics like Recall or F1-score would provide a more comprehensive and practically relevant evaluation.
问题
- The G-Internet dataset is simulation-based and may not fully capture the dynamic and evolving patterns of real-world fraud. Do the authors have any plans to validate the model on real datasets or to collaborate with external partners for real-world evaluation? 2. For datasets where the maximum path length is limited to 5 (e.g., Elliptic or T-Finance), it is unclear whether GPA captures fundamentally different information compared to standard GNNs like GCN or GraphSAGE. Do the authors have plans to provide more experimental evidence to support the distinctiveness of GPA under such settings? 3. Given the computational complexity of GPA—including path sampling, behavior encoding, and attention-based interactions—it would be helpful to include an analysis of time and space complexity. Can the authors provide further discussion on the scalability of GPA to large-scale graphs (e.g., with tens of millions of nodes)? 4. In fraud detection tasks, Recall is often more critical than Precision. Beyond Average Precision (AP), could the authors also report metrics such as Recall@K or F1-score to more comprehensively demonstrate the effectiveness of GPA?
局限性
The authors have not adequately addressed the limitations or potential negative societal impacts of their work. The paper lacks discussion on key limitations such as the limited realism of the simulation-based G-Internet dataset, risks of overfitting to predefined fraud patterns, computational scalability, and the narrow scope of evaluation metrics. Moreover, it does not consider potential societal consequences, including deployment risks in high-stakes settings, the possibility of reinforcing biases through behavior-based features, and the dangers of premature adoption due to overclaimed performance.
最终评判理由
The authors have provided satisfactory responses to my questions overall. In particular, I appreciate the inclusion of Recall and F1-score metrics, which are important in fraud detection tasks and show strong results in this case. I believe this addition strengthens the paper. I had previously assigned a positive rating, and based on the rebuttal, I do not find any compelling reasons to either raise or lower it. Therefore, I will maintain my current rating.
格式问题
W1: (a) Limitations of G-Internet as a simulation-based dataset: Since G-Internet is synthetically generated based on 12 predefined rules rather than real-world data, there is a risk that the model may overfit to these handcrafted patterns. As such, it may fail to generalize to the complexity and evolving nature of actual fraud behavior.
Q1: (b) The G-Internet dataset is simulation-based and may not fully capture the dynamic and evolving patterns of real-world fraud. (c) Do the authors have any plans to validate the model on real datasets or to collaborate with external partners for real-world evaluation?
We appreciate the reviewer for raising these points.
(a) In fact, the 12 empirical fraud rules used to construct the G-Internet dataset were not predefined based on our imagination but were patterns already extracted by our collaborating institutions from real-world data. As we mentioned in the "Limitation and Discussion" section, due to privacy restrictions, our collaborating institutions were unable to provide real-world data directly, so they gave us these 12 empirical fraud rules, and we reversely constructed the actual data based on these rules. In other words, we adopted an indirect approach to obtain real-world data, which was a helpless way to circumvent privacy restrictions. During the process of reverse-engineering the dataset through real-world rules, we have added substantial details and tried our very best to restore the real world.
(b) The dataset can support the dynamic expansion of fraud rules and custom noise injection to simulate the evolution characteristics of fraudulent activities. Since existing public graph fraud datasets are all static, to maintain consistency in the research, the experimental analysis in this paper is conducted under static data scenarios as well.
GPA consistently outperforms baselines on the five other real-world datasets (Elliptic, T-Finance, T-Social, Yelpchi, and Amazon), particularly achieving a +9.2% improvement in AP on the T-Social dataset, which has demonstrated its transferability to real-world patterns.
(c) Yes, in the near future, I will be doing an internship at a collaborating institution, where I will have access to real-world data and can conduct validation. I am very much looking forward to it.
Q2/W2: For datasets where the maximum path length is limited to 5 (e.g., Elliptic or T-Finance), it is unclear whether GPA captures fundamentally different information compared to standard GNNs like GCN or GraphSAGE. Do the authors have plans to provide more experimental evidence to support the distinctiveness of GPA under such settings?
We appreciate your insightful comment. The selection of path length should be tailored to the specific fraud scenario, as certain fraud patterns may be more effectively captured through shorter paths. Using path interaction and aggregation can effectively preserve the original structural information, whereas the aggregation mechanisms of standard GNNs (such as GCN or GraphSAGE) are to fuse node information, potentially leading to the loss of structural fraud information (i.e., the over-smoothing).
Furthermore, the core differences between our method and standard GNNs are as follows: First, standard GNNs only aggregate information from one-hop neighbors, which differs from our higher-order path-based approach. Second, standard GNNs' aggregation process lacks interaction between neighboring nodes, with weight allocation based solely on the nodes' own features. In contrast, our GPA method, through path interaction, can produce interpretable clustering-like effects (please refer to the visual results in Fig. 2, Sec. 5.4). In future work, we will further validate this through additional visualization analysis.
Q3/W3: Computational complexity and scalability to large-scale graphs.
Your constructive feedback is much appreciated. These suggestions present important opportunities for us to refine and strengthen our work.
Computational complexity
Theoretically, the time complexity and space complexity of path sampling, path encoding, and path interaction are as follows:
| Path Sampling | Path Encoding | Path Interaction | |
|---|---|---|---|
| Time Complexity | |||
| Space Complexity |
where denotes the number of nodes/paths and denotes the dimension of path embedding.
In practice, due to pytorch's low-level optimizations, the time required for path interaction does not increase significantly even if the number of paths grows a lot. The average inference time taken by the model to process a batch of data from G-Internet is as follows:
| GPA | Path Sampling | Path Encoding | Path Interaction | Path Aggregation | Fraud Detection | Total |
|---|---|---|---|---|---|---|
| Inference time per batch | 8.731ms | 14.025ms | 2.502ms | 0.034ms | 0.110ms | 25.402ms |
We have also measured the model's running time. For training, we record the average time taken by the model to process a batch of data (including forward pass, backward pass, and gradient descent). For inference, we record the average time taken by the model to process a batch of data (forward pass only). The dataset used is G-Internet. The model settings used for measuring running time are kept the same as those in the main paper. All experiments use a batch size of 1024 and are conducted on a single NVIDIA GeForce RTX 3090 GPU. The results are as follows:
| G-Internet | CARE-GNN | GDN | GPA |
|---|---|---|---|
| Training time per batch | 821.5ms | 340.3ms | 95.4ms |
| Inference time per batch | 688.3ms | 119.3ms | 25.4ms |
Scalability to large-scale graphs
The proposed GPA can easily handle large-scale graphs. In each iteration, a subset of nodes is first sampled, and then several paths are sampled with these nodes as centers. This means that each iteration does not require the entire graph as input and instead only a small portion of the graph is extracted for processing, similar to the concept of minibatch or subgraph sampling, resulting in low GPU memory usage. The processes of path sampling, encoding, interaction, and aggregation are fast. We tested the model's running time on the large-scale graph fraud dataset T-Social (containing 5.78 million nodes and 146 million edges) using the same configuration as the aforementioned test on the G-Internet dataset, and obtained the following results, which further demonstrate the scalability of GPA to large-scale graphs.
| T-Social | CARE-GNN | GDN | GPA |
|---|---|---|---|
| Training time per batch | 601.4ms | 164.2ms | 50.8ms |
| Inference time per batch | 189.6ms | 30.5ms | 8.2ms |
Q4/W5: Additional metrics (Recall@K and F1-Score).
Thank you for your constructive suggestion. Here, we report the Recall@K and F1-Score metrics across all datasets. The results indicate that our proposed GPA achieves excellent performance on these metrics compared to baseline models.
| Dataset | G-Internet | Elliptic | T-Finance | T-Social | YelpChi | Amazon | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Metric | Recall@K | F1-Score | Recall@K | F1-Score | Recall@K | F1-Score | Recall@K | F1-Score | Recall@K | F1-Score | Recall@K | F1-Score |
| GAT | 58.6 | 77.6 | 37.9 | 57.3 | 79.8 | 64.9 | 42.1 | 63.4 | 44.2 | 57.1 | 82.6 | 70.5 |
| BWGNN | 64.1 | 77.7 | 42.5 | 59.2 | 84.2 | 89.1 | 75.8 | 85.3 | 56.7 | 76.6 | 85.9 | 91.5 |
| PC-GNN | 26.3 | 56.4 | 43.8 | 62.9 | 79.1 | 58.2 | 73.5 | 48.6 | 43.8 | 63.7 | 85.3 | 89.9 |
| GPA | 93.4 | 96.5 | 69.9 | 81.7 | 84.5 | 92.2 | 90.3 | 95.4 | 68.2 | 79.0 | 86.1 | 92.3 |
W4: Disparity in baseline performance: In Table 2, many baseline models perform extremely poorly on G-Internet, while GPA achieves substantially higher scores. This raises concerns that the dataset might be tailored to favor the proposed method.
We sincerely regret any confusion caused. The issue is that existing benchmarks such as T-Social, T-Finance, and Elliptic are mostly homogeneous graphs with simple types of nodes and edges/relations, but in the real world, fraud graph structures are highly heterogeneous, with complex types of nodes and edges/relations. Thus, our newly constructed G-Internet is not tailored to favor the proposed method, but to favor the real world.
Consequently, existing baselines performing extremely poorly on our constructed G-Internet dataset is because G-Internet exhibits high heterogeneity (diverse node types and edge types), whose challenge far exceeds that of other datasets. In contrast, datasets such as T-Social, T-Finance, and Elliptic are all structurally homogeneous graphs, thus they are simpler and easier to be trained by other baseline models.
We hope our explanation can address your concerns.
L1: The authors have not adequately addressed the limitations or potential negative societal impacts of their work. The paper lacks discussion on key limitations such as the limited realism of the simulation-based G-Internet dataset, risks of overfitting to predefined fraud patterns, computational scalability, and the narrow scope of evaluation metrics. Moreover, it does not consider potential societal consequences, including deployment risks in high-stakes settings, the possibility of reinforcing biases through behavior-based features, and the dangers of premature adoption due to overclaimed performance.
Many thanks for these helpful suggestions. Regarding the first few points you mentioned, we have touched upon them in our responses to the previous questions, which you can refer to. Regarding your final point about potential societal consequences, we have strengthened our discussion on this issue:
While GPA enhances fraud detection across high-stakes domains, we acknowledge deployment risks and emphasize that real-world implementation requires rigorous safeguards (such as combining manual review to avoid fully automated decision-making) against potential biases. In addition, our performance claims are grounded in multi-scenario benchmarks, and we will provide more rigorous performance claims in the revised version.
Sorry for the late response — I’ve been traveling abroad. Overall, I believe the authors have provided satisfactory answers to my questions. In particular, the inclusion of Recall and F1-score metrics is important, and the results look strong as well. I had already given a positive rating previously, and I don’t see any new reasons to either raise or lower it. Therefore, I will maintain my current rating.
Dear Reviewer AGHk,
Thank you very much for your time and effort in reviewing our paper. We truly appreciate your feedback and are delighted to hear that our rebuttal addressed your questions satisfactorily.
We also wish you a pleasant trip!
Best regards,
Authors of Paper #9464
Dear Reviewer AGHk,
Thank you once again for your time and effort in reviewing our paper. As the discussion period draws to a close, we would like to know if our previous rebuttal has addressed your concerns. If you have further questions, please do not hesitate to let us know.
We are eagerly looking forward to your response.
Sincerely,
Authors of Paper #9464
Dear Reviewers:
As we approach the close of the discussion period, please read the authors' responses and engage in the discussion with the authors.
AC
The proposed Graph Path Aggregation (GPA) model exploits variable-length path sampling, behavior-associated path encoding, path interaction and aggregation, and aggregation-enhanced fraud detection for overcoming the limitation of narrow receptive fields in existing fraud detection methods. Additionally, the G-Internet dataset was constructed to support interpretable analysis of fraudulent behaviors in the field of internet fraud detection. Reviewers gave positive scores on this work in their final reviews and acknowledged its model design, novel ideas, and performance. During the rebuttal and discussion, the authors provided additional clarifications/explanations and additional experiments, which should be incorporated into the final version of the manuscript (e.g., motivation of the new dataset, additional metrics, computational complexity/costs, noise sensitivity tests, etc.).