Enhancing Mutual Information Estimation in Self-Interpretable Graph Neural Networks
A self-interpretable graph learning framework based on the information bottleneck principle with a new scheme of restricting mutul information between graph random variables
摘要
评审与讨论
Existing self-interpretable Graph Neural Networks (GNNs) built upon Graph Information Bottleneck (GIB) suffer from the burdensome of mutual information estimation. To address this issue, this work proposes a novel framework for self-interpretable GNNs with an enhanced technique for mutual information estimation, namely GENIMI. Experiment results indicate the proposed GENIMI enjoys improved predictive and interpretable performance.
优点
This paper is well-written and easy to follow. The motivation for improving the mutual information estimation in the GIB framework is clear and crucial. Empirical results show that the proposed GEMINI enjoys competitive performances of GNN prediction and interpretability.
缺点
However, the reviewer is concerned with some theoretical details.
-
For the predictive term in Eqn. 3, the appropriate formula derivation is: .
-
Does and share the same subgraph generator? If so, what is the intuition behind using to approach ?
问题
The authors are encouraged to address the concerns in Weaknesses.
伦理问题详情
N/A
Q1:For the predictive term in Eq. 3, the appropriate formula derivation is: .
A1: Thanks for your comments. We apologize for the error in Eqn. 3, where the denominator of the first term of line 1 should be instead of . We have revised it in the paper. Please note, however, that this mistake does not impact the following derivations and conclusions presented in the paper.
Q2: Does and share the same subgraph generator? If so, what is the intuition behind using to approach ?
A2: Thanks for your question! is the probability that is sampled by based on a underlying subgraph generation process parameterized by . However, learning is not enough to calculate the exact value of , it should also include a subgraph matching step to compute , leading to a NP-hard complexity. Previous works (e.g., GSAT) made some assumptions to ignore the subgraph matching step. Nonetheless, this has resulted in an imprecise learning of and an inaccurate calculation of .
Instead, we simultaneously learn and approximate by a variational distribution . Based on Eq.7 in the paper, the maximization of is equivalent to minimizing . Thus, we learn in Eq.8 with a fixed . The optimization of is achieved by minimizing the total GIB objective. Please refer to the Common concern 1 for a rigorous analysis.
In the detailed implementation, the subgraph generator is used to instantiate the subgraph generation process, and hence determines the distribution . is implemented as a neural network taking and as inputs and outputs a probability. The network includes two GNNs to extract representations of and respectively and then calcutate a probability. The parameters in and are learned iteratively, as depicted in the pseudo-code in Appendix.
The goal of the paper is to evaluate the mutual information (MI) between an input graph and a key subgraph. To tackle this problem, the authors propose a novel framework called GEMINI, which trains self-interpretable graph models and addresses the challenge of distorted and imprecise estimations in graph MI estimation research. The authors construct a variational distribution over the critical subgraph and create an effective MI upper bound estimator. The proposed method is shown to be effective according to empirical results.
优点
-
This paper is well-organized and well-written. The authors provide sufficient details about their work and easy to understand.
-
Estimating the mutual information between the input graph and the subgraph is both important and challenging.
缺点
-
This work closely follows GSAT[1]. Its main theoretical contribution is the addition of the information bottleneck (IB) upper bound loss to the objective of GSAT[1], which is based on the idea of variational CLUB[2].
-
Does the proposed model's ability to remove the spurious correlation come from the framework of GSAT? Can GEMINI provide a theoretical guarantee for the removal of spurious correlations?
-
Some of the numerical results reported in Table 1 and Table 2 are quite different from those reported in GSAT. The differences are particularly noticeable with the numbers that involve MNIST-75sp in Table 1 and those associated with SPMotif in Table 2, relating to GIN and GSAT. It would be helpful if the authors could provide further details about their implementations and explanations for these differences.
[1] Miao, Siqi, Mia Liu, and Pan Li. "Interpretable and generalizable graph learning via stochastic attention mechanism." In International Conference on Machine Learning, pp. 15524-15543. PMLR, 2022. [2] Cheng, Pengyu, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. "Club: A contrastive log-ratio upper bound of mutual information." In International conference on machine learning, pp. 1779-1788. PMLR, 2020.
问题
-
On page 3, in the last sentence before Eq.3, should it be a “lower” bound of ?
-
I cannot find the curve of GSAT in the second subfigure of Fig. 2(d). Is it missing or unavailable?
Q2: Does the proposed model's ability to remove the spurious correlation come from the framework of GSAT? Can GEMINI provide a theoretical guarantee for the removal of spurious correlations?
A2: Theoretically justifying the capability of the proposed method for removing spurious signals can be challenging, since we need to mathematically formalize the removal of spurious signals. Therefore, we conducted an experiment to empirically compare the resistance to spurious signals between the proposed method and GSAT. We have also added a section in Appendix A.6.
Specifically, we select three checkpoints from SPMotif (0.9) and SPMotif (0.7) for both methods and randomly generate 3000 spurious graphs to evaluate the spurious score and the predictive score. Each spurious graph is generated by randomly selecting a motif graph and a base graph and then assembling the two. For each spurious graph , we denote its label as (determined by its motif graph) and its spurious-aligned class as (determined by its base graph). The spurious-aligned class is the label of motif graph with which a spurious base graph is correlated in the training set SPMotif (0.9) and SPMotif (0.7). For example, in SPMotif (0.9), a spurious base graph tree has a probability of 0.9 to appear with a motif graph house together. Since the label for house is 0, the spurious-aligned class for tree is 0.
In the evaluation dataset, we ensure for each . Referring to the established work (Post hoc explainers may be ineffective for detecting unknown spurious correlation), we define the spurious score as
where is the output probabilities of a model, i.e., a three-dimensional vector in our setting. is the number of evaluation samples, i.e., 3000 in this experiment. The spurious score indicates the average probability of a model mistakenly predicting a graph to its spurious class. Similarly, we calculate the predictive score as
The predictive score indicates the average probability of a model correctly predicting a graph to its ground truth class.
The results are shown in the table:
| SPMotif (0.9) | SPMotif (0.7) | |||
|---|---|---|---|---|
| Spurious Score | Predictive Score | Spurious Score | Predictive Score | |
| GSAT | 0.0985 | 0.7517 | 0.0108 | 0.7711 |
| GEMINI | 0.0708 | 0.7776 | 0.0029 | 0.9123 |
We find that for models trained from both datasets, the spurious score of GEMINI is smaller than that of GSAT. Moreover, the predictive score of GEMINI is significantly better than that of GSAT, especially when the spurious signal is moderate, e.g., SPMotif (0.7). These results verify the superior resistance and robustness against spurious signals compared to GSAT.
Q1: This work closely follows GSAT[1]. Its main theoretical contribution is the addition of the information bottleneck (IB) upper bound loss to the objective of GSAT[1], which is based on the idea of variational CLUB[2].
A1: Thanks for your comments. The novelty of the paper is that we point out the limitation of GSAT and propose a framework to approximate by the variational distribution .
In detail, GSAT assumed the generated subgraph is an edge-weighted variant of , where the nodes in and are one-to-one correspondence, and the edges in are sampled to obtain . However, these works overlook the scenario where the number of nodes in is less than that in . Even if the number of nodes in equates to that of , the correspondence between nodes in and those in may be nonunique. In such cases, it is necessary to perform subgraph matching before calculating the probability . Since subgraph matching is NP-hard, it becomes extremely difficult to accurately compute the value of . Given the learned , previous studies have not been able to compute an effective value of . Moreover, computing the prior distribution over graphs is equally challenging as . GSAT involves the term to compute the upper bound value of . However, the calculation of requires to designate a random graph sampling process and hence encounters similar difficulties as when calculating .
Instead, our proposed method approximates by a variational distribution . Based on Eq.7 in the paper, the maximization of is equivalent to minimizing || . Thus, we learn in Eq.8. In addition, we use the CLUB to calculate the upper bound of , which avoids considering the prior .
Q3: Some of the numerical results reported in Table 1 and Table 2 are quite different from those reported in GSAT. The differences are particularly noticeable with the numbers that involve MNIST-75sp in Table 1 and those associated with SPMotif in Table 2, relating to GIN and GSAT.
A3: Thanks for your comments! For the SPMotif results in Table 2, there is a parameter to control the size of the base (spurious) graph in the generation of SPMotif. In GSAT, the size is relatively larger, e.g., 20 for tree, 60 for wheel, in the test set. We find that a large base graphs prone to result in distribution shift. We slightly adjust the parameter to be smaller, e.g., 15 for tree, 30 for wheel, to relieve the distribution shift and facilitate comparison. Because we mainly aim to focus on the spurious correlation and interpretation performance, instead of distribution shifts. Moreover, we generate more samples for SPMotif compared with the implementation of GSAT. We generate 30000 samples for a SPMotif dataset, while the number is 3000 in GSAT according to their source code. Hence, the predictive performance of GSAT on SMPotifs in Table 2 is better than their originally reported result.
For the MNIST-75sp results in Table 1, We adopt the strictly identical dataset settings as with the GSAT baseline. The difference may come from some model parameters, adopted loss coefficients, and randomness, since the dataset size of MNIST-75sp is only 20000 and we only run three times to obtain the averaged results, which may also affect the evaluation numbers.
Q4: On page 3, in the last sentence before Eq.3, should it be a “lower” bound of ?
A4: Yes, you are correct. We are sorry for this typo and have made a revision.
Q5: I cannot find the curve of GSAT in the second subfigure of Fig. 2(d). Is it missing or unavailable?
A5: The second subfigure Fig. 2(d) compares node probabilities. GSAT solely learns edge probabilities, and hence node probabilities remain 1.0. The node probability for GIN is also 1.0. So the curve of GIN overlaps over the curve of GSAT. We have added a clarification in the caption of Fig. 2(d).
The Graph Information Bottleneck framework significantly enhances the self-interpretability of Graph Neural Networks. However, current approaches in estimating the mutual information between graph explanations and their original forms frequently yield distorted and imprecise estimations, ultimately compromising the effectiveness of the model. In response to these limitations, this paper introduces a novel framework called GEMINI to address these challenges.
优点
- They utilize a MI upper bound estimator based exclusively on the conditional probability distribution.
- They introduce a variational distribution and its suitable instantiation for the conditional probability distribution.
- Extensive experiments demonstrate the effectiveness of the proposed framework.
缺点
- They employ established MI estimator theory, which appears easily extendable to the graph domain. In my view, it seems they have not drawn particularly interesting conclusions or specific designs for graphs. I have some reservations about the novelty of the proposed framework.
- The experimental results are not convincing enough. SOTA explainers for GNNs should be set as baselines. Moreover, the proposed model did not exhibit a significant improvement compared to these baselines.
- The paper's writing and organization require enhancements. For instance, it is challenging for readers to discern the corresponding relationships between the limitations and the contributions.
问题
Please refer to the weaknesses.
- They employ established MI estimator theory, which appears easily extendable to the graph domain. In my view, it seems they have not drawn particularly interesting conclusions or specific designs for graphs. I have some reservations about the novelty of the proposed framework.
- The experimental results are not convincing enough. SOTA explainers for GNNs should be set as baselines. Moreover, the proposed model did not exhibit a significant improvement compared to these baselines.
- The paper's writing and organization require enhancements. For instance, it is challenging for readers to discern the corresponding relationships between the limitations and the contributions.
Q1: They employ established MI estimator theory, which appears easily extendable to the graph domain. In my view, it seems they have not drawn particularly interesting conclusions or specific designs for graphs. I have some reservations about the novelty of the proposed framework.
A1: Thanks for your comments. The novelty of the paper is that we point out the limitation of previous GIB works and propose a framework to approximate by the variational distribution . The experimental results demonstrate that we can have a better estimation of the GIB term compared with GSAT and DIR.
In detail, previous GIB works think the generated subgraph is an edge-weighted variant of , where the nodes in and are one-to-one correspondence, and the edges in are sampled to obtain . However, these works overlook the scenario where the number of nodes in is less than that in . Even if the number of nodes in equates to that of , the correspondence between nodes in and those in may be nonunique. In such cases, it is necessary to perform subgraph matching before calculating the probability . Since subgraph matching is NP-hard, it becomes extremely difficult to accurately compute the value of . Given the learned , previous studies have not been able to compute an effective value of , which affects the upper bound value of . Instead, our proposed method approximates by a variational distribution with the learned . Based on Eq.7 in the paper, the maximization of is equivalent to minimizing || . Thus, we learn in Eq.8.
For a formal analysis of the limitations of previous works, please refer to Common concern 1.
Q2: The experimental results are not convincing enough. SOTA explainers for GNNs should be set as baselines. Moreover, the proposed model did not exhibit a significant improvement compared to these baselines.
A2: Thank you for your feedback. Indeed, GSAT is the state-of-the-art baseline with both predictive and interpretable capabilities. Additionally, GNNExplainer and PGExplainer are frequently employed as post-hoc explanation methods for GNNs in prior research. As demonstrated in Table 1 & 2, our proposed method maintains comparable predictive performance with mainstream GNNs (e.g., GIN), and improves the interpretation AUC over the second-best baseline across five datasets. The average improvement is about 3% 5% over the state-of-the-art self-interpretable baseline. Thus, we believe our model exhibits a significant advancement in interpretability, contributing to the existing body of research on GNNs.
Q3: The paper's writing and organization require enhancements. For instance, it is challenging for readers to discern the corresponding relationships between the limitations and the contributions.
A3: Thanks for your suggestions. We have revised the paper more clearly to distinguish the limitation analysis and our proposed method. We also added the rigorous formulation and limitation analysis in Appendix A.1. Please refer to Section 3, 4, and Appendix A.1 in the revised paper.
In this paper, the authors introduce a novel approach for approximating the Graph Information Bottleneck (GIB). Their method focuses on modeling the distribution of arbitrary subgraphs and graphs, while also bypassing the need to model the prior of subgraphs. Experimental results seem to demonstrate the effectiveness of the proposed method.
优点
-
The paper is well-written, with the authors providing a thorough derivation of the proposed GIB approximation and presenting a clear step-by-step explanation of their method.
-
The authors employ the CLUB technique to circumvent the need for modeling the prior of subgraphs. This approach appears to relax the assumptions made in previous methods, enhancing the flexibility and applicability of the proposed approach.
缺点
-
The model architecture in this paper bears a resemblance to GSAT, and it would be advantageous if the authors could explicitly delineate the key distinctions between the two. Furthermore, the experimental results suggest a notable enhancement over GSAT despite their similar model architectures. Providing inference codes for model reproduction would greatly facilitate the validation of these results and contribute to the paper's overall reproducibility and transparency.
-
The authors assert that the proposed method can generate sparse subgraphs even without the need for sparse regularization. However, it is evident that L_sp is introduced as a subgraph term to regulate graph sparsity. In the ablation study, the authors argue that this term is essential, which appears to be inconsistent with their initial claim in the introduction.
问题
-
See the comments above.
-
What is the benefit of modeling arbitrary subgraphs and graphs? Since G_{sub}^1 should be sampled from G_1 and should not be related to G_2.
-
Why is the MI upper bound approximation proposed in the method better than previous methods?
Q1: The model architecture in this paper bears a resemblance to GSAT, and it would be advantageous if the authors could explicitly delineate the key distinctions between the two. Furthermore, the experimental results suggest a notable enhancement over GSAT despite their similar model architectures. Providing inference codes for model reproduction would greatly facilitate the validation of these results and contribute to the paper's overall reproducibility and transparency.
A1: Thanks for your advice! The key difference between GSAT and our proposed method is that we improve the calculation of and . Actually, is the probability that is a generated subgraph from based on a subgraph generation process parameterized by . However, knowing is not enough to calculate the exact value of , it should also include a subgraph matching step to compute , leading to a NP-hard complexity. GSAT made some assumptions to ignore the subgraph matching step. Nonetheless, this has resulted in an imprecise learning of and an inaccurate calculation of .
Instead, we simultaneously learn and approximate by a variational distribution . Based on Eq.7 in the paper, optimizing the objective in Eq. 8 minimizes the KL divergence between and . The optimization of is achieved by minimizing the total GIB objective.
Please refer to the Common concern 1 for a comprehensive and rigorous analysis. For the inference codes, we have provided the source code in the supplementary material to facilitate the reproduction of our experimental results.
Q2: The authors assert that the proposed method can generate sparse subgraphs even without the need for sparse regularization. However, it is evident that L_sp is introduced as a subgraph term to regulate graph sparsity. In the ablation study, the authors argue that this term is essential, which appears to be inconsistent with their initial claim in the introduction.
A2: Thanks for your comments. In the first section of the ablation study (Effect of the term), we remove the term and rely solely on the regularization. Figure 2(b) demonstrates that could generate sparse graphs, e.g., the average edge/node probability is about . We claim is essential for generating even sparser graphs, e.g., average edge/node probability of . The term is used to provide users with the opportunity to achieve a desired level of subgraph sparsity based on their preferences.
Q3: What is the benefit of modeling arbitrary subgraphs and graphs? Since should be sampled from and should not be related to .
A3: Thanks for your questions! Different from the previous works, we provide a more accurate calculation of and . In detail, we approximate by a variational distribution , and use the CLUB upper bound for .
When adopting our bound to calculate the upper bound of , a subgraph instance of may not be directly sampled from a graph instance of . According to the CLUB bound in Eq. 5 of the paper, for the second term , and come from their own marginal distributions. Hence when we calculate this term, we need to calculate where is sampled from , and is sampled from . may not come from .
Q4: Why is the MI upper bound approximation proposed in the method better than previous methods?
A4: Thanks for your questions! The reason for the utilization of CLUB bound is to avoid the estimation of the prior distribution . Computing over graphs is equally challenging as . For example, GSAT regarded the prior probability of a subgraph instance as the product of edge probabilities on , where the existence of each edge follows an independent Bernoulli distribution. However, is essentially , which relies on the parent graph (as we have stated, in GSAT and other works, must be a weighted version of and the matching between node sets of and is implicitly assumed to be known), and is not a real prior distribution over graphs. A real graph prior should be able to be calculated for an arbitrary graph without knowing its parent graph . Hence, we should utilize an MI bound that does not depend on .
Summary of the revision
We sincerely thank all the reviewers for the valuable comments, which help to improve the quality of our work. We summarize the revision in the updated version as follows:
- We revised section 4 to correct some typos and remove limitation analysis to a new section.
- We added the new section 3 to distinguish the limitation analysis of current graph MI estimations and our proposed method.
- We added a section in the Appendix to rigorously formulate the analysis in section 3.
- We revised the introduction to be consistent with our revised paper.
- We added an experiment to investigate the spurious score of different models.
Common concern 1: Limitations of existing methods when calculating
To rigorously illustrate the difficulty of calculating the exact values of and the necessity of the proposed variational distribution , we need to introduce several definitions first. Note that we use upper case (e.g., , ) to represent graph random variables, and use lower case (e.g., , , , ) to indicate graph instances.
1.1 Subgraph matching
Let's consider two graph instances, and . The node sets of and are represented by and respectively, while their edge sets are denoted by and . We assume that . A valid subgraph matching from to is an injective function that maps the nodes of to nodes of , such that for each edge in , the mapped edge is still in . The set of all possible subgraph matching from to is defined as , where
1.2 The probability distribution over subgraphs
For a graph instance , assuming is the learned edge probability for and is the learned node probability for node . We consider the following procedure to obtain a subgraph instance from :
- Sampling each node basd on a Bernoulli distribution .
- For all sampled nodes in the first step, sampling edges between and based on a Bernoulli distribution . If there is no edge between and on , the is regarded as 0.
In the following, we formally derivate the probability mass function on the subgraph sample space induced by and edge/node probabilities (which is determined by ), and the probability space of random subgraphs sampled from . Note that we mainly focus on the graph topology since the sampling procedure and learned probabilities only involve the graph substructure information.
For a graph instance with nodes, we denote the subgraph space of as , which contains all possible subgraphs from , i.e., where is the set of all possible non-induced subgraphs of with exactly nodes. In other words, for any , .
Based on the sampling (or graph generation) process above, determines a probability space of a random subgraph variable, for which the sample space is , and the probability measure is implicitly determined. In the following, we formalize the probability mass function.
For a specific subgraph matching instance , we define
to indicate the likelihood of nodes in an instance are preserved from under the mapping , where is the learned node probability for node , which belongs to .
Similarly, we define
to indicate the probability of edges on the instance to be preserved from under the mapping , where is the learned edge probability for edge , which belongs to . The is the indicator function. Note that if doesn't exist on , we can regard the probability as 0.
The probability mass function on should be
The summation indicates that all possible subgraph matchings from to should be considered when calculating the probability of being sampled from .
Common concern 2: Core contributions
2.1 Point out limitations of existing methods for graph MI estimation
The key difficulty of exactly calculating relies on finding all subgraph matchings . In previous works, is always regarded as an edge-weighted version of and the number of nodes in equates to that of . In that case, the subgraph matching set actually degrades into a one-element set , where is implicitly assumed to be known, i.e., the one that maps each node of the weighted to the corresponding node of the unweighted . However, this is not true for general subgraphs, which may be smaller than , i.e., . Because for these smaller subgraphs, there may exist multiple subgraph matchings, i.e., . Besides, even if the size of equates to the size of , the subgraph matching can also be non-unique.
Moreover, computing the prior distribution over graphs is equally challenging as . The previous works also involve the term to compute the bounds of graph MI. For example, GSAT regarded the prior probability of a subgraph instance as the product of edge probabilities on , where the existence of each edge follows an independent Bernoulli distribution. However, is essentially , which relies on the parent graph (as we have stated, in GSAT and other works, must be a weighted version of and the matching between node sets of and is implicitly assumed to be known), and is not a real prior distribution over graphs. A real graph prior should be able to be calculated for an arbitrary graph without knowing its parent graph . Hence, we should utilize an MI bound that does not depend on the calculation of .
2.2 Adopt to approximate
In light of the above analysis, we have to either resort to a subgraph matching algorithm to find the exact , or try to approximate the ground truth probability mass function . Since calculating is extremely time-consuming and intractable (NP-hard) for general graphs, we adopt the later scheme, i.e., utilizing a variational distribution (which is implemented by a neural network) to approximate .
Moreover, once is obtained, we need to calculate the upper bound of graph mutual information . Previous works require designating a prior to obtain a graph MI upper bound, while is also difficult to design over the graph domain. We adopt an MI upper bound (CLUB) which solely relies on to circumvent .
Common concern 3: relation with GSAT
Our proposed method GEMINI indeed resembles GSAT in its computational architecture. However, we enhance the estimation of the upper bound of in the graph information bottleneck framework. Specifically,
-
GSAT calculated the probability by multiplying probabilities of the preserved edges in the subgraph without considering subgraph matching. It is problematic as we have analyzed above. Unlike GSAT, we adopt a variational to approximate this conditional probability and develop a learning objective to optimize .
-
GSAT utilized a prior distribution , which is hard to estimate. In our work, we utilize CLUB upper bound that does not depend on , circumventing the difficulty of designating a prior distribution over subgraphs.
Dear Reviewer,
Thank you for taking the time to review our paper. We appreciate your feedback and would like to address your concerns by revising the original paper, adding more experiments, and elaborating more on the technical details. We hope that our revisions address your concerns and provide a more comprehensive and compelling response.
We look forward to your feedback and hope to address any further questions or comments you may have!
Thank you.
Dear Reviewers,
Thank you for dedicating your time to review our work. Your feedback is highly valued, and we hope that our responses have successfully addressed all the issues you raised. In light of this, we humbly request that you reconsider your scoring. Moreover, as the period for open discussion is set to end in several hours, we kindly remind you to share any remaining concerns you might still have. We will be happy to provide further clarification!
Thank you so much for your consideration!
This paper studies interpretable graph networks with improved mutual information estimation. XAI in graph network is an important topic and this paper presents a way to improve current methods. However, the reviewers are concerns about the technical advances of this work. Thus a reject is recommended.
为何不给更高分
The reviewers are concerns about the technical advances of this work.
为何不给更低分
NA
Reject