PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
5
6
6
5
3.0
置信度
ICLR 2024

Enhancing Mutual Information Estimation in Self-Interpretable Graph Neural Networks

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

A self-interpretable graph learning framework based on the information bottleneck principle with a new scheme of restricting mutul information between graph random variables

摘要

Graph neural networks (GNNs) with self-interpretability are pivotal in various high-stakes and scientific domains. The information bottleneck (IB) principle holds promise to infuse GNNs with inherent interpretability. In particular, the graph information bottleneck (GIB) framework identifies key subgraphs from the input graph $G$ that have high mutual information (MI) with the predictions while maintaining minimum MI with $G$. The major challenge is dealing with irregular graph structures and gauging the conditional probabilities for evaluating MI between these subgraphs and $G$. Existing methods for estimating the MI between graphs often present distorted and loose estimations, thereby undermining model efficacy. In this work, we propose a novel framework GEMINI for training self-interpretable graph models, which tackles the key challenge of graph MI estimations. We construct a variational distribution over critical subgraphs, based on which an efficient MI upper bound estimator for graphs is built. Besides the proposed theoretical framework, we devise a practical instantiation of different modules in GEMINI. We compare GEMINI thoroughly with both self-interpretable GNNs and post-hoc explanation methods on eight datasets with both interpretation and prediction performance metrics. Results reveal that GEMINI outperforms state-of-the-art self-interpretable GNNs on interpretability and achieves comparable prediction performance compared with mainstream GNNs.
关键词
graph neural networksinformation bottleneckinterpretability

评审与讨论

审稿意见
5

Existing self-interpretable Graph Neural Networks (GNNs) built upon Graph Information Bottleneck (GIB) suffer from the burdensome of mutual information estimation. To address this issue, this work proposes a novel framework for self-interpretable GNNs with an enhanced technique for mutual information estimation, namely GENIMI. Experiment results indicate the proposed GENIMI enjoys improved predictive and interpretable performance.

优点

This paper is well-written and easy to follow. The motivation for improving the mutual information estimation in the GIB framework is clear and crucial. Empirical results show that the proposed GEMINI enjoys competitive performances of GNN prediction and interpretability.

缺点

However, the reviewer is concerned with some theoretical details.

  1. For the predictive term in Eqn. 3, the appropriate formula derivation is: I(Gsub;Y)=Ep(Gsub,Y)logp(YGsub)p(Y)Ep(Gsub,Y)qω(YGsub)+H(Y)I(G_{sub};Y)=E_{p(G_{sub},Y)}\log{\frac{p(Y|G_{sub})}{p(Y)}}\geq E_{p(G_{sub},Y)}q_{\omega}(Y|G_{sub})+H(Y).

  2. Does pϕ(GsubG)p_{\phi}(G_{sub}|G) and qθ(GsubG)q_{\theta}(G_{sub}|G) share the same subgraph generator? If so, what is the intuition behind using qθ(GsubG)q_{\theta}(G_{sub}|G) to approach pϕ(GsubG)p_{\phi}(G_{sub}|G)?

问题

The authors are encouraged to address the concerns in Weaknesses.

伦理问题详情

N/A

评论

Q1:For the predictive term in Eq. 3, the appropriate formula derivation is: I(Gsub;Y)=Ep(Gsub,Y)logp(YGsub)p(Y)Ep(Gsub,Y)logqω(YGsub)+H(Y)I(G_{sub};Y) = E_{p(G_{sub},Y)} \log \frac{p(Y|G_{sub})}{p(Y)} \geq E_{p(G_{sub},Y)} \log q_\omega(Y|G_{sub}) + H(Y).

A1: Thanks for your comments. We apologize for the error in Eqn. 3, where the denominator of the first term of line 1 should be p(Y)p(Y) instead of p(Gsub)p(G_{sub}). We have revised it in the paper. Please note, however, that this mistake does not impact the following derivations and conclusions presented in the paper.

Q2: Does pϕ(GsubG)p_\phi(G_{sub}|G) and qθ(GsubG)q_\theta(G_{sub}|G) share the same subgraph generator? If so, what is the intuition behind using qθ(GsubG)q_\theta(G_{sub}|G) to approach pϕ(GsubG)p_\phi(G_{sub}|G)?

A2: Thanks for your question! pϕ(GsubG)p_\phi(G_{sub}|G) is the probability that GsubG_{sub} is sampled by GG based on a underlying subgraph generation process parameterized by ϕ\phi. However, learning ϕ\phi is not enough to calculate the exact value of pϕ(GsubG)p_\phi(G_{sub}|G), it should also include a subgraph matching step to compute pϕ(GsubG)p_\phi(G_{sub}|G), leading to a NP-hard complexity. Previous works (e.g., GSAT) made some assumptions to ignore the subgraph matching step. Nonetheless, this has resulted in an imprecise learning of ϕ\phi and an inaccurate calculation of I(Gsub;G)I(G_{sub};G).

Instead, we simultaneously learn ϕ\phi and approximate pϕp_\phi by a variational distribution qθq_\theta. Based on Eq.7 in the paper, the maximization of Ep(G,Gsub)[logqθ(GsubG)]E_{p(G, G_{\text{sub}})} [ \log q_\theta( G_{\text{sub}}|G) ] is equivalent to minimizing Ep(G)[KL(pϕ(GsubG)qθ(GsubG))]E_{p(G)}[KL(p_\phi(G_{\text{sub}}|G) || q_\theta(G_{\text{sub}}|G))]. Thus, we learn qθq_\theta in Eq.8 with a fixed ϕ\phi. The optimization of ϕ\phi is achieved by minimizing the total GIB objective. Please refer to the Common concern 1 for a rigorous analysis.

In the detailed implementation, the subgraph generator gϕg_\phi is used to instantiate the subgraph generation process, and hence determines the distribution pϕ(GsubG)p_\phi(G_{sub}|G). qθq_\theta is implemented as a neural network taking GsubG_{sub} and GG as inputs and outputs a probability. The network qθq_\theta includes two GNNs to extract representations of GsubG_{sub} and GG respectively and then calcutate a probability. The parameters in gϕg_\phi and qθq_\theta are learned iteratively, as depicted in the pseudo-code in Appendix.

审稿意见
6

The goal of the paper is to evaluate the mutual information (MI) between an input graph and a key subgraph. To tackle this problem, the authors propose a novel framework called GEMINI, which trains self-interpretable graph models and addresses the challenge of distorted and imprecise estimations in graph MI estimation research. The authors construct a variational distribution over the critical subgraph and create an effective MI upper bound estimator. The proposed method is shown to be effective according to empirical results.

优点

  1. This paper is well-organized and well-written. The authors provide sufficient details about their work and easy to understand. 

  2. Estimating the mutual information between the input graph and the subgraph is both important and challenging.

缺点

  1. This work closely follows GSAT[1]. Its main theoretical contribution is the addition of the information bottleneck (IB) upper bound loss LGCLUBL_{GCLUB} to the objective of GSAT[1], which is based on the idea of variational CLUB[2].

  2. Does the proposed model's ability to remove the spurious correlation come from the framework of GSAT? Can GEMINI provide a theoretical guarantee for the removal of spurious correlations?

  3. Some of the numerical results reported in Table 1 and Table 2 are quite different from those reported in GSAT. The differences are particularly noticeable with the numbers that involve MNIST-75sp in Table 1 and those associated with SPMotif in Table 2, relating to GIN and GSAT. It would be helpful if the authors could provide further details about their implementations and explanations for these differences.

[1] Miao, Siqi, Mia Liu, and Pan Li. "Interpretable and generalizable graph learning via stochastic attention mechanism." In International Conference on Machine Learning, pp. 15524-15543. PMLR, 2022. [2] Cheng, Pengyu, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. "Club: A contrastive log-ratio upper bound of mutual information." In International conference on machine learning, pp. 1779-1788. PMLR, 2020.

问题

  1. On page 3, in the last sentence before Eq.3, should it be a “lower” bound of I(Gsub;Y)I(G_{sub};Y)?

  2. I cannot find the curve of GSAT in the second subfigure of Fig. 2(d). Is it missing or unavailable?

评论

Q2: Does the proposed model's ability to remove the spurious correlation come from the framework of GSAT? Can GEMINI provide a theoretical guarantee for the removal of spurious correlations?

A2: Theoretically justifying the capability of the proposed method for removing spurious signals can be challenging, since we need to mathematically formalize the removal of spurious signals. Therefore, we conducted an experiment to empirically compare the resistance to spurious signals between the proposed method and GSAT. We have also added a section in Appendix A.6.

Specifically, we select three checkpoints from SPMotif (0.9) and SPMotif (0.7) for both methods and randomly generate 3000 spurious graphs to evaluate the spurious score and the predictive score. Each spurious graph is generated by randomly selecting a motif graph and a base graph and then assembling the two. For each spurious graph gig_i, we denote its label as yiy_i (determined by its motif graph) and its spurious-aligned class as sis_i (determined by its base graph). The spurious-aligned class is the label of motif graph with which a spurious base graph is correlated in the training set SPMotif (0.9) and SPMotif (0.7). For example, in SPMotif (0.9), a spurious base graph tree has a probability of 0.9 to appear with a motif graph house together. Since the label for house is 0, the spurious-aligned class for tree is 0.

In the evaluation dataset, we ensure yisiy_i \neq s_i for each gig_i. Referring to the established work (Post hoc explainers may be ineffective for detecting unknown spurious correlation), we define the spurious score as

SpurScore=1Ni=1Nf(gi)[si].SpurScore = \frac{1}{N}\sum_{i=1}^{N} f(g_i)[s_i].

where f(gi)f(g_i) is the output probabilities of a model, i.e., a three-dimensional vector in our setting. NN is the number of evaluation samples, i.e., 3000 in this experiment. The spurious score indicates the average probability of a model mistakenly predicting a graph to its spurious class. Similarly, we calculate the predictive score as

PredScore=1Ni=1Nf(gi)[yi].PredScore = \frac{1}{N}\sum_{i=1}^{N} f(g_i)[y_i].

The predictive score indicates the average probability of a model correctly predicting a graph to its ground truth class.

The results are shown in the table:

SPMotif (0.9)SPMotif (0.7)
Spurious ScorePredictive ScoreSpurious ScorePredictive Score
GSAT0.09850.75170.01080.7711
GEMINI0.07080.77760.00290.9123

We find that for models trained from both datasets, the spurious score of GEMINI is smaller than that of GSAT. Moreover, the predictive score of GEMINI is significantly better than that of GSAT, especially when the spurious signal is moderate, e.g., SPMotif (0.7). These results verify the superior resistance and robustness against spurious signals compared to GSAT.

评论

Q1: This work closely follows GSAT[1]. Its main theoretical contribution is the addition of the information bottleneck (IB) upper bound loss IGCLUBI_{GCLUB} to the objective of GSAT[1], which is based on the idea of variational CLUB[2].

A1: Thanks for your comments. The novelty of the paper is that we point out the limitation of GSAT and propose a framework to approximate pϕ(GsubG)p_\phi(G_{sub}|G) by the variational distribution qθ(GsubG)q_\theta(G_{sub}|G).

In detail, GSAT assumed the generated subgraph GsubG_{sub} is an edge-weighted variant of GG, where the nodes in GsubG_{sub} and GG are one-to-one correspondence, and the edges in GG are sampled to obtain GsubG_{sub}. However, these works overlook the scenario where the number of nodes in GsubG_{sub} is less than that in GG. Even if the number of nodes in GsubG_{sub} equates to that of GG, the correspondence between nodes in GsubG_{sub} and those in GG may be nonunique. In such cases, it is necessary to perform subgraph matching before calculating the probability pϕ(GsubG)p_\phi(G_{sub}|G). Since subgraph matching is NP-hard, it becomes extremely difficult to accurately compute the value of pϕ(GsubG)p_\phi(G_{sub}|G). Given the learned ϕ\phi, previous studies have not been able to compute an effective value of pϕ(GsubG)p_\phi(G_{sub}|G). Moreover, computing the prior distribution r(Gsub)r(G_{sub}) over graphs is equally challenging as pϕ(GsubG)p_\phi(G_{sub}|G). GSAT involves the term r(Gsub)r(G_{sub}) to compute the upper bound value of I(Gsub;G)I(G_{sub};G). However, the calculation of r(Gsub)r(G_{sub}) requires to designate a random graph sampling process and hence encounters similar difficulties as when calculating pϕ(GsubG)p_\phi(G_{sub}|G).

Instead, our proposed method approximates pϕp_\phi by a variational distribution qθq_\theta. Based on Eq.7 in the paper, the maximization of Ep(G,Gsub)[logqθ(GsubG)]E_{p(G, G_{\text{sub}})} \left[ \log q_\theta( G_{\text{sub}}|G) \right] is equivalent to minimizing Ep(G)[KL(pϕ(GsubG)E_{p(G)}[KL(p_\phi(G_{\text{sub}}|G) || qθ(GsubG))]q_\theta(G_{\text{sub}}|G))]. Thus, we learn qθq_\theta in Eq.8. In addition, we use the CLUB to calculate the upper bound of I(Gsub;G)I(G_\text{sub};G), which avoids considering the prior r(Gsub)r(G_\text{sub}).

评论

Q3: Some of the numerical results reported in Table 1 and Table 2 are quite different from those reported in GSAT. The differences are particularly noticeable with the numbers that involve MNIST-75sp in Table 1 and those associated with SPMotif in Table 2, relating to GIN and GSAT.

A3: Thanks for your comments! For the SPMotif results in Table 2, there is a parameter to control the size of the base (spurious) graph in the generation of SPMotif. In GSAT, the size is relatively larger, e.g., 20 for tree, 60 for wheel, in the test set. We find that a large base graphs prone to result in distribution shift. We slightly adjust the parameter to be smaller, e.g., 15 for tree, 30 for wheel, to relieve the distribution shift and facilitate comparison. Because we mainly aim to focus on the spurious correlation and interpretation performance, instead of distribution shifts. Moreover, we generate more samples for SPMotif compared with the implementation of GSAT. We generate 30000 samples for a SPMotif dataset, while the number is 3000 in GSAT according to their source code. Hence, the predictive performance of GSAT on SMPotifs in Table 2 is better than their originally reported result.

For the MNIST-75sp results in Table 1, We adopt the strictly identical dataset settings as with the GSAT baseline. The difference may come from some model parameters, adopted loss coefficients, and randomness, since the dataset size of MNIST-75sp is only 20000 and we only run three times to obtain the averaged results, which may also affect the evaluation numbers.

Q4: On page 3, in the last sentence before Eq.3, should it be a “lower” bound of I(Gsub;Y)I(G_{sub};Y)?

A4: Yes, you are correct. We are sorry for this typo and have made a revision.

Q5: I cannot find the curve of GSAT in the second subfigure of Fig. 2(d). Is it missing or unavailable?

A5: The second subfigure Fig. 2(d) compares node probabilities. GSAT solely learns edge probabilities, and hence node probabilities remain 1.0. The node probability for GIN is also 1.0. So the curve of GIN overlaps over the curve of GSAT. We have added a clarification in the caption of Fig. 2(d).

审稿意见
6

The Graph Information Bottleneck framework significantly enhances the self-interpretability of Graph Neural Networks. However, current approaches in estimating the mutual information between graph explanations and their original forms frequently yield distorted and imprecise estimations, ultimately compromising the effectiveness of the model. In response to these limitations, this paper introduces a novel framework called GEMINI to address these challenges.

优点

  • They utilize a MI upper bound estimator based exclusively on the conditional probability distribution.
  • They introduce a variational distribution and its suitable instantiation for the conditional probability distribution.
  • Extensive experiments demonstrate the effectiveness of the proposed framework.

缺点

  • They employ established MI estimator theory, which appears easily extendable to the graph domain. In my view, it seems they have not drawn particularly interesting conclusions or specific designs for graphs. I have some reservations about the novelty of the proposed framework.
  • The experimental results are not convincing enough. SOTA explainers for GNNs should be set as baselines. Moreover, the proposed model did not exhibit a significant improvement compared to these baselines.
  • The paper's writing and organization require enhancements. For instance, it is challenging for readers to discern the corresponding relationships between the limitations and the contributions.

问题

Please refer to the weaknesses.

  • They employ established MI estimator theory, which appears easily extendable to the graph domain. In my view, it seems they have not drawn particularly interesting conclusions or specific designs for graphs. I have some reservations about the novelty of the proposed framework.
  • The experimental results are not convincing enough. SOTA explainers for GNNs should be set as baselines. Moreover, the proposed model did not exhibit a significant improvement compared to these baselines.
  • The paper's writing and organization require enhancements. For instance, it is challenging for readers to discern the corresponding relationships between the limitations and the contributions.
评论

Q1: They employ established MI estimator theory, which appears easily extendable to the graph domain. In my view, it seems they have not drawn particularly interesting conclusions or specific designs for graphs. I have some reservations about the novelty of the proposed framework.

A1: Thanks for your comments. The novelty of the paper is that we point out the limitation of previous GIB works and propose a framework to approximate pϕ(GsubG)p_\phi(G_{sub}|G) by the variational distribution qθ(GsubG)q_\theta(G_{sub}|G). The experimental results demonstrate that we can have a better estimation of the GIB term compared with GSAT and DIR.

In detail, previous GIB works think the generated subgraph GsubG_{sub} is an edge-weighted variant of GG, where the nodes in GsubG_{sub} and GG are one-to-one correspondence, and the edges in GG are sampled to obtain GsubG_{sub}. However, these works overlook the scenario where the number of nodes in GsubG_{sub} is less than that in GG. Even if the number of nodes in GsubG_{sub} equates to that of GG, the correspondence between nodes in GsubG_{sub} and those in GG may be nonunique. In such cases, it is necessary to perform subgraph matching before calculating the probability pϕ(GsubG)p_\phi(G_{sub}|G). Since subgraph matching is NP-hard, it becomes extremely difficult to accurately compute the value of pϕ(GsubG)p_\phi(G_{sub}|G). Given the learned ϕ\phi, previous studies have not been able to compute an effective value of pϕ(GsubG)p_\phi(G_{sub}|G), which affects the upper bound value of I(Gsub;G)I(G_{sub};G). Instead, our proposed method approximates pϕp_\phi by a variational distribution qθq_\theta with the learned ϕ\phi. Based on Eq.7 in the paper, the maximization of Ep(G,Gsub)[logqθ(GsubG)]E_{p(G, G_{\text{sub}})} \left[ \log q_\theta( G_{\text{sub}}|G) \right] is equivalent to minimizing Ep(G)[KL(pϕ(GsubG)E_{p(G)}[KL(p_\phi(G_{\text{sub}}|G) || qθ(GsubG))]q_\theta(G_{\text{sub}}|G))]. Thus, we learn qθq_\theta in Eq.8.

For a formal analysis of the limitations of previous works, please refer to Common concern 1.

Q2: The experimental results are not convincing enough. SOTA explainers for GNNs should be set as baselines. Moreover, the proposed model did not exhibit a significant improvement compared to these baselines.

A2: Thank you for your feedback. Indeed, GSAT is the state-of-the-art baseline with both predictive and interpretable capabilities. Additionally, GNNExplainer and PGExplainer are frequently employed as post-hoc explanation methods for GNNs in prior research. As demonstrated in Table 1 & 2, our proposed method maintains comparable predictive performance with mainstream GNNs (e.g., GIN), and improves the interpretation AUC over the second-best baseline across five datasets. The average improvement is about 3%\sim 5% over the state-of-the-art self-interpretable baseline. Thus, we believe our model exhibits a significant advancement in interpretability, contributing to the existing body of research on GNNs.

Q3: The paper's writing and organization require enhancements. For instance, it is challenging for readers to discern the corresponding relationships between the limitations and the contributions.

A3: Thanks for your suggestions. We have revised the paper more clearly to distinguish the limitation analysis and our proposed method. We also added the rigorous formulation and limitation analysis in Appendix A.1. Please refer to Section 3, 4, and Appendix A.1 in the revised paper.

审稿意见
5

In this paper, the authors introduce a novel approach for approximating the Graph Information Bottleneck (GIB). Their method focuses on modeling the distribution of arbitrary subgraphs and graphs, while also bypassing the need to model the prior of subgraphs. Experimental results seem to demonstrate the effectiveness of the proposed method.

优点

  1. The paper is well-written, with the authors providing a thorough derivation of the proposed GIB approximation and presenting a clear step-by-step explanation of their method.

  2. The authors employ the CLUB technique to circumvent the need for modeling the prior of subgraphs. This approach appears to relax the assumptions made in previous methods, enhancing the flexibility and applicability of the proposed approach.

缺点

  1. The model architecture in this paper bears a resemblance to GSAT, and it would be advantageous if the authors could explicitly delineate the key distinctions between the two. Furthermore, the experimental results suggest a notable enhancement over GSAT despite their similar model architectures. Providing inference codes for model reproduction would greatly facilitate the validation of these results and contribute to the paper's overall reproducibility and transparency.

  2. The authors assert that the proposed method can generate sparse subgraphs even without the need for sparse regularization. However, it is evident that L_sp is introduced as a subgraph term to regulate graph sparsity. In the ablation study, the authors argue that this term is essential, which appears to be inconsistent with their initial claim in the introduction.

问题

  1. See the comments above.

  2. What is the benefit of modeling arbitrary subgraphs and graphs? Since G_{sub}^1 should be sampled from G_1 and should not be related to G_2.

  3. Why is the MI upper bound approximation proposed in the method better than previous methods?

评论

Q1: The model architecture in this paper bears a resemblance to GSAT, and it would be advantageous if the authors could explicitly delineate the key distinctions between the two. Furthermore, the experimental results suggest a notable enhancement over GSAT despite their similar model architectures. Providing inference codes for model reproduction would greatly facilitate the validation of these results and contribute to the paper's overall reproducibility and transparency.

A1: Thanks for your advice! The key difference between GSAT and our proposed method is that we improve the calculation of pϕ(GsubG)p_\phi(G_{sub}|G) and I(G;Gsub)I(G; G_{sub}). Actually, pϕ(GsubG)p_\phi(G_{sub}|G) is the probability that GsubG_{sub} is a generated subgraph from GG based on a subgraph generation process parameterized by ϕ\phi. However, knowing ϕ\phi is not enough to calculate the exact value of pϕ(GsubG)p_\phi(G_{sub}|G), it should also include a subgraph matching step to compute pϕ(GsubG)p_\phi(G_{sub}|G), leading to a NP-hard complexity. GSAT made some assumptions to ignore the subgraph matching step. Nonetheless, this has resulted in an imprecise learning of ϕ\phi and an inaccurate calculation of I(Gsub;G)I(G_{sub};G).

Instead, we simultaneously learn ϕ\phi and approximate pϕp_\phi by a variational distribution qθq_\theta. Based on Eq.7 in the paper, optimizing the objective in Eq. 8 minimizes the KL divergence between pϕp_\phi and qθq_\theta. The optimization of ϕ\phi is achieved by minimizing the total GIB objective.

Please refer to the Common concern 1 for a comprehensive and rigorous analysis. For the inference codes, we have provided the source code in the supplementary material to facilitate the reproduction of our experimental results.

Q2: The authors assert that the proposed method can generate sparse subgraphs even without the need for sparse regularization. However, it is evident that L_sp is introduced as a subgraph term to regulate graph sparsity. In the ablation study, the authors argue that this term is essential, which appears to be inconsistent with their initial claim in the introduction.

A2: Thanks for your comments. In the first section of the ablation study (Effect of the LGCLUBL_{GCLUB} term), we remove the LspL_{sp} term and rely solely on the LGCLUBL_{GCLUB} regularization. Figure 2(b) demonstrates that LGCLUBL_{GCLUB} could generate sparse graphs, e.g., the average edge/node probability is about 0.60.70.6\sim 0.7. We claim LspL_{sp} is essential for generating even sparser graphs, e.g., average edge/node probability of 0.30.50.3\sim 0.5. The term LspL_{sp} is used to provide users with the opportunity to achieve a desired level of subgraph sparsity based on their preferences.

Q3: What is the benefit of modeling arbitrary subgraphs and graphs? Since Gsub1G_{sub}^1 should be sampled from G1G_1 and should not be related to G2G_2.

A3: Thanks for your questions! Different from the previous works, we provide a more accurate calculation of pϕ(GsubG)p_\phi(G_{sub}|G) and I(G;Gsub)I(G;G_{sub}). In detail, we approximate pϕ(GsubG)p_\phi(G_{sub}|G) by a variational distribution qθ(GsubG)q_\theta(G_{sub}|G), and use the CLUB upper bound for I(G;Gsub)I(G;G_{sub}).

When adopting our IGCLUBI_{GCLUB} bound to calculate the upper bound of I(G;Gsub)I(G;G_{sub}), a subgraph instance of GsubG_{sub} may not be directly sampled from a graph instance of GG. According to the CLUB bound in Eq. 5 of the paper, for the second term Ep(G)Ep(Gsub)[logpϕ(GsubG)]E_{p(G)} E_{p(G_{sub}) } [ \log p_\phi( G_{\text{sub}}|G )], GsubG_{sub} and GG come from their own marginal distributions. Hence when we calculate this term, we need to calculate pϕ(gsub2g1)p_\phi( g_{\text{sub}}^2 |g^1 ) where g1g^1 is sampled from p(G)p(G), and gsub2g_{\text{sub}}^2 is sampled from p(Gsub)p(G_{\text{sub}}). gsub2g_{\text{sub}}^2 may not come from g1g^1.

Q4: Why is the MI upper bound approximation proposed in the method better than previous methods?

A4: Thanks for your questions! The reason for the utilization of CLUB bound is to avoid the estimation of the prior distribution r(Gsub)r(G_{sub}). Computing r(Gsub)r(G_\text{sub}) over graphs is equally challenging as pϕ(GsubG)p_\phi(G_{sub}|G). For example, GSAT regarded the prior probability of a subgraph instance r(gsub)r(g_{sub}) as the product of edge probabilities on gg, where the existence of each edge follows an independent Bernoulli distribution. However, r(gsub)r(g_{sub}) is essentially r(gsubg)r(g_{sub}|g), which relies on the parent graph gg (as we have stated, in GSAT and other works, gsubg_{sub} must be a weighted version of gg and the matching between node sets of gsubg_{sub} and gg is implicitly assumed to be known), and is not a real prior distribution over graphs. A real graph prior r(gsub)r(g_{sub}) should be able to be calculated for an arbitrary graph gsubg_{sub} without knowing its parent graph gg. Hence, we should utilize an MI bound that does not depend on r(Gsub)r(G_{sub}).

评论

Summary of the revision

We sincerely thank all the reviewers for the valuable comments, which help to improve the quality of our work. We summarize the revision in the updated version as follows:

  • We revised section 4 to correct some typos and remove limitation analysis to a new section.
  • We added the new section 3 to distinguish the limitation analysis of current graph MI estimations and our proposed method.
  • We added a section in the Appendix to rigorously formulate the analysis in section 3.
  • We revised the introduction to be consistent with our revised paper.
  • We added an experiment to investigate the spurious score of different models.

Common concern 1: Limitations of existing methods when calculating pϕ(GsubG)p_\phi(G_{sub}|G)

To rigorously illustrate the difficulty of calculating the exact values of pϕ(GsubG)p_\phi(G_{sub}|G) and the necessity of the proposed variational distribution qθ(GsubG)q_\theta(G_{sub}|G), we need to introduce several definitions first. Note that we use upper case (e.g., GG, GsubG_{sub}) to represent graph random variables, and use lower case (e.g., gsubg_{sub}, gg, g1g_1, g2g_2) to indicate graph instances.

1.1 Subgraph matching

Let's consider two graph instances, g1g_1 and g2g_2. The node sets of g1g_1 and g2g_2 are represented by V(g1)V(g_1) and V(g2)V(g_2) respectively, while their edge sets are denoted by E(g1)E(g_1) and E(g2)E(g_2). We assume that V(g1)V(g2)|V(g_1)| \leq |V(g_2)|. A valid subgraph matching mm from g1g_1 to g2g_2 is an injective function that maps the nodes of g1g_1 to nodes of g2g_2, such that for each edge eu,ve_{u,v} in E(g1)E(g_1), the mapped edge em(u),m(v)e_{m(u),m(v)} is still in E(g2)E(g_2). The set of all possible subgraph matching from g1g_1 to g2g_2 is defined as Mg1g2M_{g_1 \rightarrow g_2}, where

Mg1g2{mfor all (u,v)V(g1)×V(g1),em(u),m(v)E(g2) if eu,vE(g1)}.M_{g_1\rightarrow g_2} \triangleq \{m| \text{for all } (u,v) \in V(g_1) \times V(g_1), e_{m(u),m(v)} \in E(g_2) \text{ if } e_{u,v} \in E(g_1) \}.

1.2 The probability distribution over subgraphs

For a graph instance gg, assuming p(eu,v)p(e_{u,v}) is the learned edge probability for eu,vE(g)e_{u,v}\in E(g) and p(v)p(v) is the learned node probability for node vV(g)v\in V(g). We consider the following procedure to obtain a subgraph instance gsubg_{sub} from gg:

  1. Sampling each node vV(g)v\in V(g) basd on a Bernoulli distribution Bern(p(v))Bern(p(v)).
  2. For all sampled nodes in the first step, sampling edges eu,ve_{u,v} between uu and vv based on a Bernoulli distribution Bern(p(eu,v))Bern(p(e_{u,v})). If there is no edge between uu and vv on gg, the p(eu,v)p(e_{u,v}) is regarded as 0.

In the following, we formally derivate the probability mass function pϕ(gsubg)p_\phi(g_{sub}|g) on the subgraph sample space Ωg\Omega_g induced by gg and edge/node probabilities (which is determined by ϕ\phi), and the probability space of random subgraphs sampled from gg. Note that we mainly focus on the graph topology since the sampling procedure and learned probabilities only involve the graph substructure information.

For a graph instance gg with nn nodes, we denote the subgraph space of gg as Ωg\Omega_g, which contains all possible subgraphs from gg, i.e., Ωg=i=1nΩgi\Omega_{g} = \cup_{i=1}^{n} \Omega^i_g where Ωgi\Omega^i_g is the set of all possible non-induced subgraphs of gg with exactly ii nodes. In other words, for any gsubΩgg_{sub} \in \Omega_g, MgsubgM_{g_{sub}\rightarrow g} \neq \emptyset.

Based on the sampling (or graph generation) process above, gg determines a probability space of a random subgraph variable, for which the sample space is Ωg\Omega_g, and the probability measure is implicitly determined. In the following, we formalize the probability mass function.

For a specific subgraph matching instance mm, we define

PV(gsub,g,m)=vV(gsub)p(m(v))vV(gsub)/V(gsub)(1p(v))P^{V}(g_{sub}, g, m) = \prod_{v\in V(g_{sub})} p(m(v)) \prod_{v\in V(g_{sub})/ V(g_{sub})} (1-p(v))

to indicate the likelihood of nodes in an instance gsubg_{sub} are preserved from gg under the mapping mm, where p(m(v))p(m(v)) is the learned node probability for node m(v)m(v), which belongs to gg.

Similarly, we define

PE(gsub,g,m)=u,vV(gsub)1(u,v)E(gsub)p(em(u),m(v))+(11(u,v)E(gsub))(1p(em(u),m(v)))P^{E}(g_{sub}, g, m) = \prod_{u,v \in V(g_{sub})} 1_{(u, v)\in E(g_{sub})} p(e_{m(u), m(v)}) + (1-1_{(u, v)\in E(g_{sub})}) (1-p(e_{m(u), m(v)}))

to indicate the probability of edges on the instance gsubg_{sub} to be preserved from gg under the mapping mm, where p(em(u),m(v))p(e_{m(u), m(v)}) is the learned edge probability for edge em(u),m(v)e_{m(u), m(v)}, which belongs to gg. The 1condition1_{condition} is the indicator function. Note that if em(u),m(v)e_{m(u), m(v)} doesn't exist on gg, we can regard the probability p(em(u),m(v))p(e_{m(u), m(v)}) as 0.

The probability mass function pϕ(gsubg)p_\phi(g_{sub}|g) on Ωg\Omega_{g} should be

pϕ(gsubg)=mMgsubgPV(gsub,g,m)×PE(gsub,g,m).p_\phi(g_{sub}|g) = \sum_{m \in M_{g_{sub}\rightarrow g}} P^{V}(g_{sub}, g, m)\times P^{E}(g_{sub}, g, m) .

The summation indicates that all possible subgraph matchings from gsubg_{sub} to gg should be considered when calculating the probability of gsubg_{sub} being sampled from gg.

评论

Common concern 2: Core contributions

2.1 Point out limitations of existing methods for graph MI estimation

The key difficulty of exactly calculating pϕ(gsubg)p_\phi(g_{sub}|g) relies on finding all subgraph matchings MgsubgM_{g_{sub}\rightarrow g}. In previous works, gsubg_{sub} is always regarded as an edge-weighted version of gg and the number of nodes in gsubg_{sub} equates to that of gg. In that case, the subgraph matching set MgsubgM_{g_{sub}\rightarrow g} actually degrades into a one-element set {mˉ}\{\bar{m}\}, where mˉ\bar{m} is implicitly assumed to be known, i.e., the one that maps each node of the weighted gsubg_{sub} to the corresponding node of the unweighted gg. However, this is not true for general subgraphs, which may be smaller than gg, i.e., V(gsub)<V(g)|V(g_{sub})|< |V(g)|. Because for these smaller subgraphs, there may exist multiple subgraph matchings, i.e., Mgsubg>1|M_{g_{sub}\rightarrow g}|>1. Besides, even if the size of gsubg_{sub} equates to the size of gg, the subgraph matching can also be non-unique.

Moreover, computing the prior distribution r(Gsub)r(G_\text{sub}) over graphs is equally challenging as pϕ(GsubG)p_\phi(G_{sub}|G). The previous works also involve the term r(Gsub)r(G_\text{sub}) to compute the bounds of graph MI. For example, GSAT regarded the prior probability of a subgraph instance r(gsub)r(g_{sub}) as the product of edge probabilities on gg, where the existence of each edge follows an independent Bernoulli distribution. However, r(gsub)r(g_{sub}) is essentially r(gsubg)r(g_{sub}|g), which relies on the parent graph gg (as we have stated, in GSAT and other works, gsubg_{sub} must be a weighted version of gg and the matching between node sets of gsubg_{sub} and gg is implicitly assumed to be known), and is not a real prior distribution over graphs. A real graph prior r(gsub)r(g_{sub}) should be able to be calculated for an arbitrary graph gsubg_{sub} without knowing its parent graph gg. Hence, we should utilize an MI bound that does not depend on the calculation of r(gsub)r(g_{sub}).

2.2 Adopt qθ(gsubg)q_\theta(g_{sub}|g) to approximate pϕ(gsubg)p_\phi(g_{sub}|g)

In light of the above analysis, we have to either resort to a subgraph matching algorithm to find the exact MgsubgM_{g_{sub}\rightarrow g}, or try to approximate the ground truth probability mass function pϕ(gsubg)p_\phi(g_{sub}|g). Since calculating MgsubgM_{g_{sub}\rightarrow g} is extremely time-consuming and intractable (NP-hard) for general graphs, we adopt the later scheme, i.e., utilizing a variational distribution qθ(gsubg)q_\theta(g_{sub}|g) (which is implemented by a neural network) to approximate pϕ(gsubg)p_\phi(g_{sub}|g).

Moreover, once qθ(gsubg)q_\theta(g_{sub}|g) is obtained, we need to calculate the upper bound of graph mutual information I(G;Gsub)I(G; G_{sub}). Previous works require designating a prior r(Gsub)r(G_{sub}) to obtain a graph MI upper bound, while r(Gsub)r(G_{sub}) is also difficult to design over the graph domain. We adopt an MI upper bound (CLUB) which solely relies on qθ(gsubg)q_\theta(g_{sub}|g) to circumvent r(Gsub)r(G_{sub}).

Common concern 3: relation with GSAT

Our proposed method GEMINI indeed resembles GSAT in its computational architecture. However, we enhance the estimation of the upper bound of I(G;Gsub)I(G; G_{sub}) in the graph information bottleneck framework. Specifically,

  • GSAT calculated the probability pϕ(GsubG)p_\phi(G_{sub}|G) by multiplying probabilities of the preserved edges in the subgraph without considering subgraph matching. It is problematic as we have analyzed above. Unlike GSAT, we adopt a variational qθ(GsubG)q_\theta(G_{sub}|G) to approximate this conditional probability and develop a learning objective to optimize qθq_\theta.

  • GSAT utilized a prior distribution r(Gsub)r(G_{sub}), which is hard to estimate. In our work, we utilize CLUB upper bound that does not depend on r(Gsub)r(G_{sub}), circumventing the difficulty of designating a prior distribution over subgraphs.

评论

Dear Reviewer,

Thank you for taking the time to review our paper. We appreciate your feedback and would like to address your concerns by revising the original paper, adding more experiments, and elaborating more on the technical details. We hope that our revisions address your concerns and provide a more comprehensive and compelling response.

We look forward to your feedback and hope to address any further questions or comments you may have!

Thank you.

评论

Dear Reviewers,

Thank you for dedicating your time to review our work. Your feedback is highly valued, and we hope that our responses have successfully addressed all the issues you raised. In light of this, we humbly request that you reconsider your scoring. Moreover, as the period for open discussion is set to end in several hours, we kindly remind you to share any remaining concerns you might still have. We will be happy to provide further clarification!

Thank you so much for your consideration!

AC 元评审

This paper studies interpretable graph networks with improved mutual information estimation. XAI in graph network is an important topic and this paper presents a way to improve current methods. However, the reviewers are concerns about the technical advances of this work. Thus a reject is recommended.

为何不给更高分

The reviewers are concerns about the technical advances of this work.

为何不给更低分

NA

最终决定

Reject