GraphTrail: Translating GNN Predictions into Human-Interpretable Logical Rules
We generate formula based global explainations of graph neural networks using symbolic regression over computation trees identified through Shapley values.
摘要
评审与讨论
The authors introduce a new global explanation generation method for GNNs. Their method uses the fact that message passing GNNs break a graph into a set of computation trees. They use Shapley values to then compute the influence of each computation tree. This is then mapped to a boolean formula over concepts. They note that the search space of trees is linear vs the exponential search space of all possible subgraphs. They compute computation trees as paths for each node. Therefore, there are n possible computation trees; one for each node. After each of the computation trees are evaluated via Shapley values the top-k trees are sent to their logical formulator where logical rules are generated. They then evaluate and compare fidelity scores across several different GNN architectures and datasets. They also evaluate the effect of varying the choice of k.
优点
S1. The paper fills a gap in the literature; utilizing shapley values to generate global explanations of computation trees. S2. The idea seems sound and successfully reduces the computationally intractable problem of computing shapley values over all possible subgraphs to just the computational trees of graphs. S3. Experimental results seem to support some of the authors’ claims and some ideal behaviour required in interpretable AI can be witnessed.
缺点
W1. The experimental results are not sufficient to justify this method’s superiority as an explanation method. The fidelity of their explanations just compares to GLG. It is understandable that this is a global explanation method that uses logical rules so it should compare to other similar global explanation methods, there still should be some comparison to other explanation methods.
W2. The explanations generated can be more interpretable than other global explanation methods. However, the explanations are still not very easy to read and can be very long and complex.
问题
I suggest the author’s add additional experiments that strengthen this method’s claims. Adding other global explanation methods and even instance-level explanations could demonstrate the strengths of this method.
Please also address all the weaknesses.
局限性
Yes.
Q1/W1: I suggest the author’s add additional experiments that strengthen this method’s claims. Adding other global explanation methods and even instance-level explanations could demonstrate the strengths of this method.
Answer: We did not compare with any other explainer since none of the existing algorithms generate a logical formula over concepts to explain GNN predictions. Nonetheless, based on this suggestion, we have added three more explainers: GNNExplainer (local) [1], PGExplainer (local) [2] and XGNN (global) [3].
For local explainers, which produce an explanation subgraph for each input graph, we generated explanations for all graphs in the training set. We then applied the logical formula as an OR operation over all these subgraphs. In practice, this means that if any of the local explanations from a graph with a GNN-predicted class Y is present in a test graph, then that test graph is predicted to be from class Y.
XGNN, being a global explainer, operates differently. Instead of producing a graph, it uses a graph generator that creates graphs, maximizing the prediction likelihood of a target class label by the GNN being explained. For our comparison, we generated 3 graphs per class from XGNN and, similar to the local explainers, employed a logical formula that is an OR operation over all these explanations.
The results of this expanded comparison are presented in the table below. Notably, XGNN, despite being a global explainer, shows the weakest performance. This is likely because the graphs it generates are rarely contained (subgraph isomorphic) in any of the test graphs. Local explainers also demonstrate lower efficacy, as their local explanations fail to capture global patterns effectively.
| Algorithm | BAMultiShapes | Mutag | Mutagenicity |
|---|---|---|---|
| GnnExplainer | 0.54 | 0.73 | 0.46 |
| PGExplainer | 0.54 | 0.71 | 0.44 |
| XGNN | 0.45 | 0.30 | 0.43 |
| GLGExplainer | 0.51 | 0.74 | 0.62 |
| GraphTrail | 0.87 | 0.83 | 0.72 |
[1] Rex Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. 2019. GNNExplainer: generating explanations for graph neural networks. NeurIPS '19.
[2] Dongsheng Luo, Wei Cheng, Dongkuan Xu, Wenchao Yu, Bo Zong, Haifeng Chen, and Xiang Zhang. 2020. Parameterized explainer for graph neural network. In NeurIPS '20.
[3] Hao Yuan, Jiliang Tang, Xia Hu, and Shuiwang Ji. 2020. XGNN: Towards Model-Level Explanations of Graph Neural Networks. In KDD '20.
W2. The explanations generated can be more interpretable than other global explanation methods. However, the explanations are still not very easy to read and can be very long and complex.
Answer: We acknowledge the potential for further enhancing GNN explainability. However, it's crucial to recognize that in datasets where the underlying property is inherently a function of multiple concepts (such as motifs), the ground-truth logical formula itself may necessarily be complex. This is exemplified in the Mutag and Mutagenicity datasets, where the property is attributed to the presence of eight distinct toxicophores (subgraphs) [4]. Similarly, the BAMultishapes dataset, which provides ground-truth logical formulas, exhibits formula sizes (variables + operators) exceeding 10 for both classes.
Given these intrinsic complexities, the challenge lies in striking a balance between interpretability and fidelity. Generating concise, easily interpretable formulas while maintaining high fidelity to the underlying data patterns remains an open research problem.
[4] Debnath AK, Lopez de Compadre RL, Debnath G, Shusterman AJ, Hansch C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J Med Chem. 1991 Feb;34(2):786-97.
Appeal to the reviewer:
We appreciate the reviewer's constructive feedback on our work. We have incorporated additional baselines as suggested. We would be grateful if the reviewer could reassess our paper in light of these improvements and consider adjusting the rating accordingly.
I would like to keep the rating as it is due to the following reasons.
-
The additional experiments should be done in a rigid manner in the original submitted paper instead of in the the feedback in a rather ad-hoc manner.
-
For W2, the authors' feedback does not satisfactorily address the weak point.
Dear Reviewer AvDp,
Thank you for your feedback on our rebuttal. We appreciate your input and would like to seek further clarification on two points you raised:
-
Experimental Design: You mentioned that our comparison seems ad hoc. Could you please elaborate on which aspects you find ad hoc? We would value your insights on how to design a more robust experiment. As we noted in our rebuttal, GLG is currently the only technique we're aware of that generates a logical formula, which limits our comparison options.
-
Interpretability of Formulas: You suggested that existing global explainers might produce more interpretable formulas. Could you provide specific examples of such explainers? To our knowledge, GLG is the only existing baseline in this area. In our analysis, we demonstrated that our approach outperforms GLG in both accuracy and interpretability.
We look forward to your responses, as they will help us improve our work and address any remaining concerns.
Thank you for your time and expertise.
Best regards, The Authors
Dear reviewer,
Please respond to the rebuttal as is expected of you as a NeurIPS reviewer asap! Thanks
This paper proposes a novel method for providing global explanations for GNNs by constructing logical formulas that offer easy-to-understand interpretations for each class. The authors first propose using computation trees instead of subgraphs to construct explanations. They then suggest using Shapley values to evaluate the contribution of each computation tree to the classification results. Finally, they derive logical formulas corresponding to each class through symbolic regression. Compared to traditional instance-level explanation methods, this approach can explain the decision-making process of GNNs from a higher level. Moreover, compared to other model-level explanation methods, this method does not require prior knowledge about the dataset and is easier to understand. Overall, this paper is innovative, has a clear approach, and provides convincing experiments.
优点
-
Originality
- Compared to previous works that employ concept-based explanations for GNNs (e.g., GLGExplainer), the proposed concept vector form is more comprehensible. This concept vector also addresses a significant challenge in subsequent Shapley value calculations. Additionally, this method is end-to-end, reducing complexity.
- The use of computation trees instead of instance-level subgraph extractors as concepts results in explanations that better reflect the inherent properties of the dataset and are less influenced by the quality of instance-level explainers. The exploration of Shapley values in the context of graphs is relatively novel. The paper analyzes the reasons for the high complexity of Shapley value calculations on graphs and successfully reduces this complexity through a combination of various methods.
-
Quality
- This paper starts with the problem of providing global explanations for GNNs, reviewing existing solutions and identifying their shortcomings. It proposes a new approach that uses computation trees and Shapley values to construct concept formulas. The approach is clear, identifying the deficiencies of current methods and proposing solutions. The paper then analyzes the challenges of calculating Shapley values and successfully addresses these challenges using existing methods. This section details the difficulties of computing Shapley values on graphs, providing an effective method for efficient computation.
- Next, the paper uses symbolic regression to generate logical formulas from concept vectors. This process is clear and intuitive. Finally, it tests the proposed method on multiple datasets using GLGExplainer as a baseline, evaluating Fidelity, Robustness, and Data Efficiency. The results demonstrate that this is a reliable and effective method. Overall, the paper presents a clear approach, offering a new method for explaining GNNs using logical formulas and an efficient way to compute Shapley values on graphs, with detailed and comprehensive experiments.
-
Clarity
- The paper provides detailed explanations of the problem definition, relevant concepts (such as Shapley value, concept, graph isomorphism, rooted computation tree), and experimental aspects. It clearly describes the task objectives and provides corresponding mathematical formulations, with strict mathematical definitions of the relevant concepts. The experimental section is well-organized, detailing the GNNs used, the process of selecting the hyperparameter kkk, and the standards used in the explanation process.
- Overall, most of the content in this paper is clearly presented and easily understandable by readers.
-
Significance
- The significance of this paper lies primarily in the construction of subgraphs as concepts using computation trees and the efficient computation of Shapley values on graphs.
- First, constructing subgraphs using computation trees greatly reduces the complexity of the search space. Given the message-passing paradigm of GNNs, using computation trees aligns better with the information processing of GNNs compared to other graph structures.
- Second, the paper studies methods for computing Shapley values on graphs, proposing an efficient computation scheme based on existing methods. This contributes to the application of Shapley values in the graph domain.
缺点
- The main weakness of this paper is that it does not adequately describe the specific calculation methods for Shapley values and the symbolic regression methods used. Instead, it merely cites the sources of these methods without providing appropriate descriptions in the text or the appendix. For example, in line 207, it only states that it uses methods from A unified approach to interpreting model predictions without further describing the sampling strategy. Similarly, in lines 247-248, it mentions the multi-population evolutionary algorithm used for symbolic regression but lacks a basic description of this method. In the appendix, lines 521-526 only state that the Depth-First Canonical Form (DFCF) is used to convert computation trees to Canonical Form for efficient tree isomorphism testing, but it also lacks specific descriptions.
- Such a presentation, where the methods are only cited without basic descriptions, makes it difficult for readers to understand the proposed methods. Additionally, Table D in the paper shows the accuracy of the GNNs used in the experiments on various datasets, but the accuracy appears to be lower than that reported in other papers. This may indicate that the GNNs used in the experiments were not sufficiently trained (this issue will be elaborated on in the Questions section).
问题
- According to the experimental setups in EiG-Search, D4Explainer, and GLGExplainer, the GNN accuracies trained on the BAMultiShapes, Mutag, Mutagenicity, and NCI1 datasets are higher than those presented in Table D of this paper. Were the GNNs used in this paper's experiments sufficiently trained?
- In line 225, the top computation trees with the highest Shapley values are selected and denoted as . From which dataset or set are these computation trees selected? Is this set composed of computation trees from every node of all graphs in the training set?
局限性
Apart from the limitations discussed in Section 5 of the paper, in Eq. 7, when calculating , the GNN may not provide correct classification probabilities for the embedding due to insufficient generalization ability. This can affect the calculation of the Shapley values and, consequently, the overall performance of the method.
W1: The main weakness of this paper is that it does not adequately describe the specific calculation methods for Shapley values and the symbolic regression methods used. Instead, it merely cites the sources of these methods without providing appropriate descriptions in the text or the appendix. For example, in line 207, it only states that it uses methods from A unified approach to interpreting model predictions without further describing the sampling strategy. Similarly, in lines 247-248, it mentions the multi-population evolutionary algorithm used for symbolic regression but lacks a basic description of this method. In the appendix, lines 521-526 only state that the Depth-First Canonical Form (DFCF) is used to convert computation trees to Canonical Form for efficient tree isomorphism testing, but it also lacks specific descriptions.
Answer: Due to space constraints, we decided to omit the referred details, as these algorithms serve as tools rather than original contributions of our work. We understand that this decision impacted the comprehension of certain sections. Based on your feedback, we propose to include more detailed descriptions provided in the global response above. These descriptions will be added to the appendix, with references from appropriate sections in the main manuscript.
Q1: GNN accuracy in EiG-Search, D4Explainer, and GLGExplainer, are higher. Were the GNNs used in this paper's experiments sufficiently trained?
Answer: GNN accuracies are hard to replicate exactly due to variabilities such as train:test splits and training on a single seed (while we have reported results across three splits and three seeds). Nonetheless, to address this concern, we have taken the exact model weights from EiG-Search and measured the performance of GraphTrail and GLGExplainer. The results are in Table 6 of the pdf attached to the global response. Consistent with previous results, GraphTrail outperforms by similar margins.
| EiG-Search | MUTAG | Mutagenicity | NCI1 |
|---|---|---|---|
| GLGExplainer | 0.56 +- 0.21 | 0.61 +- 0.03 | 0.55 +- 0.01 |
| GraphTrial | 0.86 +- 0.06 | 0.75 +- 0.01 | 0.72 +- 0.01 |
Q2: In line 225, the top computation trees with the highest Shapley values are selected and denoted as . Is this set composed of computation trees from every node of all graphs in the training set?
Answer: This is correct.
L1: Apart from the limitations discussed in Section 5 of the paper, in Eq. 7, when calculating , the GNN may not provide correct classification probabilities for the embedding due to insufficient generalization ability. This can affect the calculation of the Shapley values and, consequently, the overall performance of the method.
Answer: As a global GNN explainer, our primary objective is to derive a logical formula that accurately reflects the outputs of the GNN, rather than the ground truth labels. This approach means that even in cases where the GNN's prediction is incorrect due to poor generalizability, our explainer aims to produce a logical formula that predicts the same incorrect label. This fundamental property guides our design choice to compute Shapley values based on the GNN's predictions rather than the ground truth labels. By doing so, we ensure that our explanations faithfully represent the GNN's decision-making process, including any potential errors or biases, thus providing a true reflection of the model's behavior rather than an idealized version of what it should do.
Appeal to the reviewer:
We sincerely thank the reviewer for their thoughtful and constructive feedback on our work. In response to the suggestions provided, we have made the following significant improvements to our manuscript:
- Included detailed descriptions of the additional content recommended by the reviewer.
- Provided further empirical evidence demonstrating the efficacy of GraphTrail on an open-source model.
In light of these improvements, we kindly request the reviewer to reassess our work and consider adjusting the rating to reflect these enhancements. We remain open to any further suggestions or queries you may have.
Dear Reviewer Sg1i,
Thank you for your constructive feedback. We have addressed your comments in our rebuttal. As we are less than a day away from the end of the discussion phase, we would greatly appreciate your feedback on the clarifications provided in our rebuttal.
Regards,
The Authors
Thanks for your response. I will keep the current score.
This paper introduces GraphTrail, an end-to-end global GNN explainer providing logic formulas over sub-graphs level concepts. These concepts are extracted at subgraph level by using the Shapley values and, then, the GNN predictions are mapped into logic via symbolic regression. Different experiments show that GraphTrail is improving wrt other GNN explainers, that has robust and accurate performances.
优点
-
The problem investigated in the paper is very interesting, the paper is well written and the problem well formulated.
-
The way different existing techniques have been combined to design GraphTrail is very interesting.
缺点
-
Some aspects of the comparison against GLGExplainer should be importantly clarified. Also as these two models have such a similar core, more experimental analysis should be done to validate the advantages of GraphTrail wrt to GLGExplainer. Moreover, I had a look at the GLGExplainer paper, and the fidelity valued reported there (Table 2 and 4), and the one in this paper are very different, hence I'm asking if Table 2 of GraphTrail paper is actually reliable or there is an explanation for the difference of results reported in the two papers. While here it is claimed that "both GRAPHTRAIL and GRAPHTRAIL-s significantly outperform GLGEXPLAINER", the results in the GLGExplainer paper for the fidelity in the two datasets (on the test set) are: BAMultiShapes 0.96 ± 0.03 Mutagenicity 0.81 ± 0.01 the results reported in the GRaphTrail paper for GLGExplaier are: BAMultiShapes 0.51 ± 0.03 Mutagenicity 0.62 ± 0.02 Hence I'm wondering either the GLGExplainer had a mistake in their implementation, the authors of GraphTrail were not able to correctly reproduce the GLGExplainer performances, or there are some additional hypothesis distinguishing these experiments that I didn't understand. Similarly for the provided explanations in the different datasets (the comparison in Figure 5 here), they seem quite different from the ones reported in the GLGExplainer. This should be clarified in order to understand if actually GraphTrail is performing better or not than on of its main competitor.
-
Definition 5 is a bit unclear, could you make it more precise? Even if Fig 2 helps a lot to understand the tree construction.
-
I see the computational advantage of only considering Rooted Computation Trees as concepts, but I guess it is tricky in practice to find an L that works for all the graphs in D, and that part of the complexity of the problem is understand the minimum L that make a good explanation.
Minor comments and typos:
- Please split the rigorous definition from the comment in Definition 4. Hence in practice, the concepts of a graph are all its subgraphs, i.e. this include the original graph itself, disconnected nodes and so on, right?
- Curiously, in the abstract of GLGExplainer and GraphTrail there is this common sentence: "..making GLGExplainer a promising diagnostic tool for learned GNNs." and "..GraphTrail makes it an invaluable diagnostic tool for refining GNNs.." It seems strange that it is a coincidence, as GraphTrail describes in details the GLGExplainer paper. I'd suggest to rephrase in a more original way (in addition to having changed the adjective "promising" to "invaluable").
问题
-
"The objective of our work is to develop an end-to-end global explainer that (1) mines the subgraph concepts used by a black-box GNN model, and then (2) uncovers the boolean logic used by the GNN over these concepts to make its predictions." If I understand correctly, your model is a post-hoc GNN explainer, even if I think this is never mentioned in the paper. Or I'm missing something here?
-
What do you mean by "unique" subgraph? This has not been defined as far as I see.
-
Obs 1 and 2 are simply theorems (or lemmas) with a proof, also the proofs seem formal to me. So I don't understand why the authors are calling them observations.
-
Is it possible to evaluate the logic rules in terms of accuracy to make the prediction? The fidelity is great, but also the accuracy of the extracted explanations should be evaluated to measure the quality of the extracted explanations for a certain task.
局限性
The model limitations have been correctly described.
W1(a). Comparison vs GLGExplainer
Ans: The algorithmic foundations of GLG and GraphTrail are significantly different (please see Lines 47-63). These include:
-
Non-reliance on local explainers: GLG assumes local explanations as an input and then operates over these to identify the logical formula. GraphTrail observes that relying on local explanations creates a disconnect with the objective, as they lack a global understanding of the model. Hence, GraphTrail mines the concepts directly from the training data.
-
Concepts: In GLG, each concept in the formula corresponds to a feature vector and not a subgraph. These vectors represent the embedding of a cluster of subgraphs generated by the instance explainer. Hence, in its original form, the formula is not human-interpretable. To convert into a human-interpretable formula, GLG randomly selects a subgraph from the cluster, assuming all subgraphs in a cluster are similar. We show in App. F that the graphs in a cluster are diverse and hence randomly picking one is not a faithful explanation. GraphTrail doesn't suffer from this issue since the concepts map to computation trees.
-
Formula construction: While GLG uses an entropy layer [1] to construct the logical formula, GraphTrail uses symbolic regression.
W1(b) Conduct additional experiments.
Ans. We have conducted the following experiments, which are reported in `the rebuttal pdf'.
-
Accuracy of generated formulas against ground-truth (
Table 1) -
The statistical significance of the results via paired T-tests (
Table 2) -
Precision, recall and f-score (
Tables 3-5) of the generated formulas against GNN output.
W1(b). ...the fidelity in GLGExplainer, and the one in this paper are different...
Ans: Indeed, GLG is not reproducible. We highlight this explicitly in lines 307-308 of our main draft with a pointer to Appendix D offering a discussion on the possible causes. As we note in Appendix D, there is definite test data leakage in the Mutagenicity dataset. We include the screenshot of the corresponding code in Fig G. Specifically, in this dataset, the presence of NO2 and/or NH2 motifs indicate a positive class label (mutagenic). However, a graph may be positive due to other factors (See [2]). GLG alters the train set by only including graphs that contain NO2 and NH2. This selection bias dramatically simplifies the explanation task, as the model is essentially fitted to explain NH2 or NO2 presence. When we include all graphs, the result deteriorates.
In other datasets, GLG has released the instance explanations being used for deriving the global explanation, but not the exact train sets used that generate these instance explanations. When we run the GLG code (released by authors) on raw datasets using the reported hyper-parameters, the produced instance explanations don't match the released instance explanations. Hence, the fidelity numbers are different.
If we use the released explanations, the reported results in GLG match. This isolates the reproducibility issue to the generation of instance explanations, similar to our observation in Mutagenicity.
We had reached out to the authors via email requesting: (i) the pipeline that produce the release explanations, (ii) justification for the train set selection in Mutagenicity. We have not received any response.
For transparency, our code and datasets are publicly available. Our results can be verified independently.
W2. Simplify Def. 5
Ans: We'll change it as follows.
A computation tree, also known as a receptive field, is a fundamental concept in GNNs that describes how information propagates through the graph during the neural network's operation. Formally, it is defined as follows.
Given a graph , a node and the number of layers in the GNN, the computation tree rooted at is constructed as follows:
- Enumerate all paths (including those with repeated vertices) starting from and extending up to hops away.
- Merge these paths into a tree structure where, (i) The root is always , and (ii) Two nodes from different paths are merged if: a. They are at the same depth in their respective paths, and b. All their ancestors in the paths have already been merged. This tree represents how information flows to node during rounds of message passing in the GNN.
W3: Determine ?
Ans: The value of is known to us. It represents the number of layers in the GNN being explained. Since each node's embedding is determined by its -hop computation tree, is not derived and given as input.
W4: Split Def. 4 from comment. What is the space of concepts?
Ans. Sure.
Space of concepts: In GLG, the space of all subgraphs includes the graph itself, disconnected nodes, etc. However, GraphTrail notes that a node embedding is solely determined by its -hop neighborhood. Consequently, this space is limited to only the -hop computation tree of each node.
W5: similar sentence in abstract.
Ans. We will rephrase.
Q1: ...your model is a post-hoc explainer
Ans. Yes. We'll state this explicitly.
Q2: What's meant by "unique" subgraph?
Ans. A unique subgraph (or tree) refers to a subgraph that is not isomorphic to any other subgraph in the collection extracted from the graph(s).
Q3: Obs 1 & 2 are theorems/lemmas.
Ans. We'll change to Lemmas.
Q4: Accuracy of logical rules.
Ans. Added as discussed in W1(a).
[1] Entropy-based logic explanations of neural networks, AAAI'22
[2] Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J Med Chem'91
[3] What graph neural networks cannot learn: depth vs width. ICLR'20
Appeal
In light of the additional experiments and clarifications, we kindly request the reviewer to reassess the rating of our work.
I thank the authors for the response and the provided clarifications. As a side note, I also would like to reassure the authors that they don't need to appeal for the reassessment of their work as done to all the reviewers, as the most important aspect in a rebuttal is not increasing the score obtained but clarifying the important aspects of their research and the soundness/originality of the proposal. Moreover, any reviewer is perfectly knowing that he/she can update the score during/after the rebuttal.
I see your points on the bias introduced in the Mutagenicity dataset, but I didn't understand if this selection only regards the training set or also the test set. Otherwise it would not be clear why enlarging the training set would reduce the GLG performances. Have you tried your model on the "reduced" dataset considered by GLG? I think this would make a more fair comparison wrt findining the hyperparameters that maximise the performance of both the models. Curiously, I saw that in their paper the GLG authors report: "As discussed in Section 3, we used PGExplainer (Luo et al., 2020) as the Local Explainer. However, we modified the procedure for discretizing weighted graphs into a set of disconnected motifs. Indeed, the authors in Luo et al. (2020) limited their analysis to graphs that contained the ground truth motifs and proposed to keep the top-k edges as a rule-of-thumb for visualization purposes. For Mutagenicity, over which PGExplainer was originally evaluated, we simply selected the threshold θ that maximises the F1 score of the local explainer over all graphs, including those that do not contain the ground-truth motif. If I understood correctly the ground-truth motif refers to NH2 and NO2, hence it seems exactly the opposite of what you found in the code. Could you point out in the github repository of the GLG paper where you found the snippet you indicated in the paper: https://github.com/steveazzolin/gnn_logic_global_expl/tree/master I think this is a crucial point to clarify.
Clarified this point I'll be happy to consider reassessing my score.
We appreciate the reviewer's engagement in this discussion phase and are pleased to provide clarifications to their queries below. Additionally, we would like to express our apologies for requesting a re-assessment of ratings. We appreciate the opportunity to address the concerns and improve our submission.
Clarification 1:
As discussed in Section 3, we used PGExplainer (Luo et al., 2020) as the Local Explainer. However, we modified the procedure for discretizing weighted graphs into a set of disconnected motifs. Indeed, the authors in Luo et al. (2020) limited their analysis to graphs that contained the ground truth motifs and proposed to keep the top-k edges as a rule-of-thumb for visualization purposes. For Mutagenicity, over which PGExplainer was originally evaluated, we simply selected the threshold θ that maximises the F1 score of the local explainer over all graphs, including those that do not contain the ground-truth motif.
If I understood correctly the ground-truth motif refers to NH2 and NO2, hence it seems exactly the opposite of what you found in the code.
Answer: As the GLG authors note in their quoted statement, they only alter the discretization process, which is an operation conducted post training ("However, we modified the procedure for discretizing weighted graphs into a set of disconnected motifs.") Test-data leakage happens during training due to only including graphs that contain NO2 and NH2 for the mutagenic class. In discretization, where they select the optimal value of , no filtering is performed. Let us elaborate.
Given a graph with adjacency matrix , PGExplainer outputs a continuous-valued mask . The mask represents the probability of an edge being part of the local explanation. To extract the local explanation, this mask needs to be discretized into a binary matrix . is used as a threshold in this particular process of converting the continuous valued mask into a binary matrix. The optimal is selected by maximizing the F1 score of the local explanations. The set for optimizing is unfiltered.
Clarification 2
Could you point out in the github repository of the GLG paper where you found the snippet?
Answer: The snippet code is from the PGExplainer repo. It can be found in lines 73-90 at https://github.com/flyingdoog/PGExplainer/blob/master/MUTAG.ipynb?short_path=39bc9a0. This is the codebase used by GLG to generate the local explanations.
Subsequently, while loading these explanations, the GLG code again explicitly checks for the presence of NO2 and NH2 in each graph and performs class attribution of molecules solely based on their presence/absence. See lines lines 269-283 in https://github.com/steveazzolin/gnn_logic_global_expl/blob/master/code/local_explanations.py. The NO2 and NH2 structures are initialized in lines, 23-36.
Clarification 3:
I didn't understand if this selection only regards the training set or also the test set. Otherwise it would not be clear why enlarging the training set would reduce the GLG performances. Have you tried your model on the "reduced" dataset considered by GLG?
Answer: The selection process applies exclusively to the training set. However, it is crucial to understand that this is NOT a case of reduced training set. A reduced training set would involve randomly selecting a smaller subset of graphs from both mutagenic and non-mutagenic classes. Instead, what we have here is an engineered training set.
In this engineered set, specifically those graphs from the mutagenic class that contain NO2 and/or NH2 are selected. These groups are known to be part of the ground-truth motifs associated with mutagenicity. Ordinarily, this knowledge should be withheld from the training process to maintain the integrity of the supervised learning paradigm.
By incorporating this label-related information into the training set selection, we are essentially leaking data that should be part of the evaluation criteria into the learning process. This violation undermines a fundamental principle of supervised learning: the separation of training data from the ground truth used for evaluation.
Dear Reviewer x6B5,
As we approach the end of the discussion phase, we are keen to know if the clarifications offered on the non-reproducibility of GLGExplainer address your concerns about the veracity of our claims. Specifically, we highlighted the following aspects:
-
We identified the exact lines of code in GLGExplainer and PGExplainer (the local explainer used as input by GLG) that explicitly check for NO2 and NH2 presence during training. This use of ground truth label data compromises the integrity of supervised learning.
-
We clarified that the quoted line from GLGExplainer refers to discretization as a post-training process, while the test-data leakage occurs during training.
-
We emphasized that the GLG setup does not involve reduced training data, but rather fits training data to a pre-determined outcome. We also note that the impact of reduced training data is already discussed in Section 4.4 and Figure 4 of our manuscript.
We would appreciate your feedback on whether these clarifications sufficiently address your concerns.
Sincerely,
Authors
Dear Reviewer x6B5,
Since we are less than a day away from the closure of the discussion phase, we would be very grateful to know if our clarifications on the non-reproducibility of GLGExplainer address your concerns. Your response would help us get to closure on this aspect of our work.
regards,
Authors.
Dear authors, Sorry for my late reply but I'm currently out of office with scarce internet connection. Yes you clarified my doubts and 'll be happy to increase my score.
Best
Dear Reviewer,
Thank you for your support. We are happy to know that our response has addressed your comments.
Regards,
Authors
The authors propose a novel method to provide instance-level GNN explanations that uncover the combinatorial reasoning learned by a GNN from the training data. They do so by mining discriminative subgraph-level concepts using Shapley values and mapping them to human-interpretable boolean formulas over these concepts through symbolic regression.
优点
- the authors describe a novel instance-level explainability method they designed for GNN models
- the authors compare their method against a SOTA technique and outperform it
- the authors assess their method against multiple datasets
缺点
- the real-world datasets the authors use to assess their method only correspond to datasets describing collections of molecules
- we miss some ablation studies to understand the contribution of particular components of the proposed technique
问题
We consider the paper interesting and relevant. Nevertheless, we would like to point to the following improvement opportunities:
GENERAL COMMENTS
(1) - did the authors consider trying their method on some datasets from a different domain / not describing collections of molecules?
(2) - did the authors consider performing some ablation studies to understand, e.g., what is the contribution of selecting trees by considering Shapley values vs. a random selection?
(3) - did the authors consider the statistical significance of the results obtained? E.g., the Wilcoxon signed-rank test.
FIGURES
(4) - All figures: consider a color palette that would be friendly for colorblind people. The following link could be helpful in this regard: https://davidmathlogic.com/colorblind/#%23D81B60-%231E88E5-%23FFC107-%23004D40
TABLES
(5) - Table 1: align numeric values to the right to make differences in magnitude evident. Use the same number of decimals across rows.
(6) - Table 2: report the same number of decimals in all cases. Apart from the averages, the authors are reporting the standard deviation?
局限性
The authors have adequately acknowledged the limitations of their work.
We thank the reviewer for the positive comments on our work. Please find below our responses to the suggestions and concerns raised.
Q1/W1: Did the authors consider trying their method on some datasets from a different domain / not describing collections of molecules?
Answer: We note that BAMultishapes is not a molecular dataset. In this dataset, the gap between GraphTrail and GLGExplainer is the highest ( vs. ).
W2/Q2: What is the contribution of selecting trees by considering Shapley values vs. a random selection?
Answer: We have conducted this ablation study. As evident from the table below, choosing random trees significantly deteriorates the fidelity, indicating the value of Shapley-based selection.
| Algorithm | BAMultishapes | Mutag | Mutagenicity | NCI1 |
|---|---|---|---|---|
| GraphTrail with Shapley | 0.87 +- 0.01 | 0.83 +- 0.08 | 0.72 +- 0.03 | 0.70 +- 0.03 |
| GraphTrail with random trees | 0.54 +- 0.03 | 0.68 +- 0.04 | 0.55 +- 0.01 | 0.54 +- 0.02 |
Q3 - did the authors consider the statistical significance of the results obtained? E.g., the Wilcoxon signed-rank test.
Answer: We have conducted paired T-tests to measure the statistical significance of the results. The results are presented below, which shows that the improvement obtained by GraphTrail over GLGExplainer is statistically significant.
| BAMultishapes | MUTAG | Mutagenicity | NCI1 |
|---|---|---|---|
| t=58.952, -value= 0 | t=2.908, -value=0.017 | t=23.519, -value= 0 | t=12.159, -value= 0 |
Q4 FIGURES (4) - All figures: consider a color palette that would be friendly for colorblind people.
Answer: Sure, we will change the color palette as per the guidelines in the link.
Q(5) - Table 1: align numeric values to the right to make differences in magnitude evident. Use the same number of decimals across rows.
Answer: We will make this correction.
Q(6) - Table 2: report the same number of decimals in all cases. Apart from the averages, the authors are reporting the standard deviation?
Answer: We will modify the table as suggested.
Appeal to the reviewer
With the inclusion of statistical significance tests, ablation study and clarifications, we hope the reviewer finds our manuscript improved. If the reviewer agrees, we would appreciate support for our work by increasing the rating accordingly.
We have read the authors' response and their responses to the rest of the reviewers. Thank you for providing the clarifications. We have no further questions and have decided to keep our positive score.
We sincerely thank the reviewers for their insightful and constructive feedback. Below, we provide a comprehensive point-by-point response to their comments. Additionally, we have attached a PDF document containing various new empirical analyses as suggested by the reviewers. Key revisions include:
-
Enhanced Empirical Benchmarking:
- Integration of three new baselines: GNNExplainer, PGExplainer, and XGNN
- Additional assessment metrics (along with Precision, Recall, F1) to measure efficacy of explainers.
- Accuracy of the generated formulas against ground-truth labels.
- Statistical Significance of the results with paired T-tests.
- Ablation study on selection of computation trees with Shapley values vs. a random selection.
-
Clarifications and presentation improvements:
- Potential reasons behind non-reproducibility in GLGExplainer.
- Additional details on symbolic regression, canonical label construction, and optimizing Shapley computation.
- Various other presentation enhancements as suggested by the reviewers
We hope these revisions have strengthened our manuscript. We kindly request the reviewers to reassess their ratings in light of this rebuttal. We are open to further engagement with the reviewers for any additional queries or suggestions.
Detailed descriptions of Symbolic Regression, Depth-first canonical form and efficient Shapley computation as requested by Reviewer Sg1i
Symbolic Regression:
In symbolic regression, given a set of input-output pairs , where is an input vector, and is the output value (or label), and a set of operators, such as addition, subtraction, etc., the goal is to find a symbolic equation and corresponding function such that , while also reducing the complexity (number of operators and variable) of the formula. The loss function is presented in Eq. 11. Finding the optimal formula is not computationally tractable since the number of formulas grows exponentially with the number of variables and operations. Hence, from various approximation strategies in the literature [23] and we use [6], which leverages an evolutionary algorithm.
The process begins by initializing populations, each with random expressions of complexity (3 in our experiments). For each , a set is created to store the expressions with the smallest loss (Eq. 11) at each complexity level within that population. Additionally, a global set is maintained to store the expressions with lowest loss at each complexity across all populations.
The algorithm iterates over each population for a fixed number of epochs, allowing them to evolve. Evolution is driven by running tournaments within each population, where the expression with lowest loss is declared the winner. A copy of the winner is created and chosen for mutation or crossover, forming . Mutating an expression involves repeatedly applying random operations from a set of operations, e.g., replacing/adding operators and variables. Crossover selects the two best expressions, , from the population and swapping random sub-expressions between them, forming . The newly formed expression(s) replace the oldest expression(s) in the population.
Once evolution within each population is complete, the sets and are updated with the best expressions. Once the chosen number of epochs is complete, the expression from with minimum loss is returned.
Depth-first Canonical Form (DFCF):
Canonical Labeling: A canonical label of a graph is a unique representation that remains unchanged under isomorphisms, i.e., two graphs and have the same canonical label iff they are isomorphic.
We construct the canonical label of a rooted computation tree through Depth-First Canonical Labeling. From a labelled rooted unordered tree we can derive many labelled rooted ordered trees as shown in
Fig. 1 in the rebuttal pdf. There is a one-to-one correspondence between a labelled rooted ordered tree and its depth-first string encoding. The ordering of the strings orders the trees and the minimum in the ordering is the canonical label. Each string is a depth-first (preorder) traversal that uses 'dollar' to represent a backtrack and 'hash' to represent the end of the string encoding. In sorting, 'hash' is greater than 'dollar' and both these symbols are greater than other labels.
The Depth-First Canonical Form (DFCF) is constructed by sorting the vertices of a rooted unordered tree level by level. At each level, the vertices are sorted first by their labels and then by the ranks of their children at respective levels.
Fig. 2 in the rebuttal pdfillustrates the process.
Shapley
SHAP represents the Shapley value (SV) explanation as an additive feature attribution method and fits a weighted linear regression model to approximate the model's output. The weights in this model correspond to the SVs (contribution of features). It specifies the explanation as: where is the model, is the coalition vector, is the maximum coalition size, and is the SV for a feature . Kernel-SHAP is a SHAP variant to estimate SVs. It calculates SVs via:
- Coalition Sampling: It randomly selects subsets of features, coalitions . It assigns weights to each coalition using the SHAP kernel: ; total features, coalition size.
- Predictions: For each coalition, Kernel-SHAP maps the coalition back to the original features and obtains a prediction afterwards.
- Fitting: Using the calculated weights and predictions, Kernel-SHAP fits a linear model.
Please see the attached pdf.
Dear reviewers,
First of all, thank you for your service to the ML community. Writing high-quality reviews and engaging with authors' responses is essential to a healthy ML research community.
It is essential that you read the rebuttals and provide a response and/or follow-up questions within the next few days. This will allow the authors sufficient time to react. While a detailed response addressing all points is not necessary, at a minimum, you should indicate you have read and considered the review and whether you will maintain or revise your score. Please also take the time to read the other reviews. Understanding your fellow reviewers' key arguments can help converge towards more similar ratings in cases of diverging scores.
I want to thank you again, and I look forward to following the discussions here.
Three of the four reviewers agree that the paper should be accepted. The consensus among these reviewers is that the authors describe a novel instance-level explainability method for GNNs, that the experiments are strong, outperforming existing methods, and that the number of datasets in the experiments is extensive. One reviewer tended towards rejecting the paper, but the arguments were short and weak, lacking details. After the rebuttal said reviewer simply acknowledged the rebuttal and did not change their score. Hence, I weigh their assessment of the paper much lower than those of the positive reviewers.