4.7

/10

Rejected3 位审稿人

最低1最高8标准差2.9

3.3

置信度

正确性2.0

贡献度2.3

表达2.7

ICLR 2025

Bayesian Neighborhood Adaptation for Graph Neural Networks

Paribesh Regmi,Rui Li,Kishan K C

OpenReview PDF

提交: 2024-09-25更新: 2025-02-05

摘要

关键词

Graph Neural NetworkBayesian Inference

评审与讨论

审稿意见

评分: 5置信度: 32024-10-20

The paper introduces Bayesian Neighborhood Adaptation (BNA), a novel framework for automatically determining neighborhood scopes in Graph Neural Networks (GNNs). Rather than relying on a two-stage optimization to select the best number of hops for message aggregation, BNA employs a non-parametric Bayesian method. Specifically, it models the expansion of the neighborhood scope as a beta process and performs feature selection via a Bernoulli process. The proposed method allows GNNs to adapt their neighborhood scope dynamically, thus improving expressivity and performance on both homophilic and heterophilic graphs. The authors provide theoretical guarantees for improved model expressivity and conduct extensive experiments showing that BNA enhances GNN variants without introducing significant computational overhead.

优点

Originality: The paper presents an original solution to the problem of manually tuning neighborhood scopes in GNNs by framing the task as a Bayesian inference problem, which is a novel approach. This differs from typical HPO&NAS methods and introduces a probabilistic perspective on neighborhood adaptation.
Quality: The theoretical analysis is rigorous, particularly in terms of expressivity, and the claims are backed by mathematical proofs. The experiments are thorough, spanning both homophilic and heterophilic graphs, and demonstrate consistent improvements across various GNN architectures. The use of a variational inference approach to handle the intractable computation of the beta process prior is well-executed, balancing model complexity and computational efficiency.
Clarity: The methodology and theoretical sections are detailed, with clear explanations of the model formulation and the beta process. The figures, such as the one illustrating neighborhood adaptation, help clarify key concepts. The empirical results are presented comprehensively, with clear tables and comparisons to baselines, making it easy to follow the improvements brought by BNA.
Significance: Automatically determining the neighborhood scope for message passing in GNNs is an important contribution, as it addresses a key bottleneck in GNN design. The potential impact of this method is significant, as it could simplify GNN model tuning, leading to more efficient and scalable applications of GNNs in various domains.

缺点

Diversity of baselines: It seems that the proposed method is evaluated as a plug-in trick for improving conventional GNNs. However, there have been lots of previous works that achieve adaptive depth in an efficient way (differentiable in some methods). It is necessary to include such works as the baselines here.

问题

Could the authors provide more insights into how the framework performs on significantly larger datasets, such as full ogbn-mag or ogbn-papers100M? How would the computational cost and memory usage scale with larger graphs?

What would be the impact of using low-quality or noisy data on the inferred neighborhood scope? Is the beta process robust enough to handle datasets with significant noise or missing edges?

Can the framework be extended to handle dynamic graphs where the structure evolves over time? What challenges might arise in adapting BNA for such scenarios?

评论- Response

2024-11-23

We thank Reviewer j7KP for the time and constructive comments.

"there have been lots of previous works that achieve adaptive depth in an efficient way (differentiable in some methods). It is necessary to include such works as the baselines here."

One similar research for network architecture adaptation is Neural Architecture Search (NAS) methods. While there has been research on NAS in Deep Learning, limited works have explored it in GNNs. Our method fundamentally differs from those methods on the following aspects:

Our method infers neighborhood scope by assigning contribution probabilities ( $\pi_l$ ) for neighboring nodes at each hop. The contribution probabilities allows for customized message aggregation with neighboring nodes having different levels of contributions. However, NAS-based methods does not allow for such customized aggregation in training time.
We have theoretically and empirically demonstrated that our method improves expressivity in GNNs. NAS methods have not been shown to improve expressivity in GNNs.
The performance of gradient-based graph architecture search is sensitive to noises [1], and thus also requires the optimization of graph structure together with the network architecture, which adds extra computing overload. However, we have shown that our Bayesian inference strategy improves the performance of state-of-the-art baselines without the need for graph structure optimization.
Application of our method improves uncertainty calibration of the baseline models. NAS-based methods have not been shown to have calibration benefits.

We compare the performance of GASSO [1] and our method on small (Cora) and large (ogb-arxiv) graphs and the performance is reported in the table below. Compared to GASSO, our method has comparable performance on the small graph and significantly outperforms it on the large graph.

Accuracy	Cora	ogb-arxiv
GASSO	87.63 ± 0.29	70.39
Ours+ResGCN	86.83±0.13	72.79±0.30
Ours+GCNII	87.26±0.25	73.06±0.40

We will include this discussion in the final version.

"Could the authors provide more insights into how the framework performs on significantly larger datasets"

We conducted experiments on the ogb-products dataset (~2.5M nodes) using ResGCN, JKNet, and Our+ResGCN, and the results are presented in the table below. The reported metrics are percentage accuracy, time (for 100 epochs) and memory consumption (per epoch). The results demonstrate that incorporating our framework enhances performance. Additionally, the time and space usage align with the analysis in Section 5.7, confirming the scalability of our framework to significantly larger datasets.

	Accuracy	Time (seconds)	Memory (GB)
ResGCN	76.06±0.05	17.2	3.10
JKNet	76.01±0.16	15.2	3.38
Ours+ResGCN	76.65±0.08	56	4.05

"Is the beta process robust enough to handle datasets with significant noise or missing edges?"

Bayesian methods are generally recognized for their robustness to noisy data. In our work, we have demonstrated that using the beta process for neighborhood scope inference achieves comparable or superior performance to state-of-the-art baselines on very small datasets, such as Cornell and Texas, which have fewer than 200 nodes. Since the challenges associated with small datasets, such as limited signal and susceptibility to overfitting, overlap with those posed by noisy data, our method offers a promising avenue for further exploration in scenarios involving noisy data.

Can the framework be extended to handle dynamic graphs where the structure evolves over time?

It would be an interesting direction for our future work to integrate our nonparametric Bayesian method with sequence models like RNNs, to capture the temporal evolution of both the structure and node features in a graph. In dynamic scenarios, the relevant neighborhood scope for a graph may vary across timesteps, which may pose a challenge on the convergence of the beta process. Addressing this would require developing mechanisms to adapt the beta process in dynamic scenarios.

[1] Qin et al. Graph Differentiable Architecture Search with Structure Learning. NeurIPS, 2021.

2024-12-02

Dear Reviewer J7KP,

As the discussion period is ending soon, we would appreciate your feedback on whether our responses have adequately addressed your concerns. If you have any additional comments or require further clarifications, please let us know at your earliest convenience.

Thank you for your time and input!

2024-12-04

We have made every effort to address the concerns and questions raised. As the discussion period is nearing its end, we kindly request a consideration from Reviewer J7KP for an updated score. If there are any further questions or matters to discuss, please don’t hesitate to let us know.

审稿意见

评分: 1置信度: 32024-11-03

This paper proposes a message aggregation strategy based on Bayesian inference that adaptively selects multi-hop neighbor ranges and estimates the contributions of different nodes.

优点

The method proposed in this paper can adaptively apply to different base GNN models to capture long-range relationships.

缺点

The performance improvement is quite modest compared to the computational cost.

问题

no further concerns

伦理问题详情

potential dual submission

2024-11-23

We thank Reviewer jLYF for the time and constructive comments.

Performance and computational cost:

For the results reported in Table 1 and 4, we conducted a hyperparameter search for the 5 different values of layers L (neighborhood scope) for the baselines, specifically over the range [2,4,6,8,10] (as mentioned in Appendix F.2.1). Table 5 shows that the time required for tuning the layer depth across five different settings is comparable to or greater than that of our method when S=5. Therefore, the results in Tables 1 and 4 are comparisons under the same time budget for the overall training cycle.

Under this same time budget, applying our method to baselines results in comparable or significantly better performance.

评论- Regarding the dual submission concern

2024-11-25

While Submission 3862 also proposes a Bayesian inference GNN method, our approaches differ significantly from four perspectives.

Inferring the Model Architecture vs. Inferring a Subgraph

Our work proposes to infer GNN architectures for neighborhood scope determination. Specifically, we adopt a beta process to model the depth of the GNN backbone networks. Since GNN depths correspond to the hops of neighborhood, it will determine the most plausible neighborhood scope for message aggregation warranted by the data. We then mask the node features (i.e., neurons) at each layer by element wisely multiplying the conjugate Bernoulli vectors $\mathbf{z}_l$ with the neuron outputs:

$\mathbf{H}\_l = \sigma(\mathbf{\hat{A}}\mathbf{H}_{l-1}\mathbf{W}_l)$ $\otimes \mathbf{z}_l+\mathbf{H}\_{l-1}, \mathbf{z_l} \in \\{0,1\\}^{O}$ for node feature sparsification.
In contrast, Submission 3862 proposes to directly apply a beta-Bernoulli process to sample subgraph by identifying important edges. In practice, they masks the graph edges by multiplying $\mathbf{z}_l$ with the adjacency matrix of the graph, instead:

$\mathbf{H}\_l = \sigma((\mathbf{\hat{A}}\otimes \mathbf{z}\_l) \mathbf{H}\_{l-1}\mathbf{W}\_l)+\mathbf{H}\_{l-1}, \mathbf{z}\_l \in \\{0,1\\}^{|\mathcal{V}| \times |\mathcal{V}|}$

Aggregation over the Entire Graph vs. Aggregation over a Subgraph

Since we propose to model GNN architectures instead of sampling subgraphs, every message aggregation is conducted through the whole graph with different contribution probabilities $\pi_l$ .
Submission 3862, instead, only aggregate information within the inferred subgraphs by masking out edges.

Dropout vs. DropEdge

Our method can be viewed as a nonparametric Bayesian extension of dropout on GNNs, since we prune GNN architectures through a beta process over network depth and its conjugate Bernoulli process over network width.
In contrast, Paper 3862 can be seen as a generalization of the DropEdge method that applying a binary mask on graph edges for subgraph sampling.

Experiments

Our work focuses on two key aspects: 1) assessing the effectiveness of neighborhood scope determination via GNN architecture inference by evaluating our method's performance on both heterophilic and homophilic graphs, and 2) to examine the consistency between our theoretical expressivity analysis and the property of the learned node representations .

Submission 3862's empirical study does not evaluate heterophilic graphs or the expressivity of their method. Instead, their experiments focus on analyzing different kernel performance for edge sampling and its ablation study.

Phrasing similarities: We noticed some similar phrases in both works.We will revise our camera-ready version accordingly to address this problem.

2024-12-02

Dear Reviewer jLYF,

Thank you for your time and input!

2024-12-04

We have made every effort to address the concerns and questions raised. As the discussion period is nearing its end, we kindly request a consideration from Reviewer jLYF for an updated score. If there are any further questions or matters to discuss, please don’t hesitate to let us know.

审稿意见

评分: 8置信度: 42024-11-04

This work considers the problem of determining the neighborhood scope of a problem. In other words, determining the contribution of nodes at different distances from the target. To tackle the inefficiency of the standard train-then-validate approach, a technique called Bayesian Neighborhood Aggregation (BNA) is presented, which uses a beta process prior on the contributions of nodes at different distances, and a conjugate Bernoulli distribution over the features dimensions used for message passing. The data likelihood is jointly optimized over the (mean-field) variational distribution and the model weights.

优点

The idea of using a Bayesian model for inferring the neighborhood scope is novel.
The empirical analysis is quite thorough, with an evaluation of model performance, over-smoothing analysis, performance on homophilic and heterophilic datasets, as well as small and large datasets, and computation time.
The model performs well, or at least competitively, in comparison to other baselines, in various aspects like the ones mentioned above.
The theoretical analysis of over-smoothing provides good support for BNA's expressivity.

缺点

The literature review in Section 2 looks weak because the description of the works is more about the proposed algorithms, but not their relevance to the current work. As in, the descriptions look like "ABC work does XYZ", but their strengths, weaknesses and/or relevance to BNA are not discussed. For example, near line 137, we have "half-hop adds slow nodes at each edge" - it is neither clear what slow nodes are (not that it is relevant for this work), nor is it obvious what context this piece of literature is supposed to add for the current work. In the absence of such an analysis, it is hard to contextualize the position of BNA in the current research landscape. An exception is Section 2.3, which does a good job describing how BNA differs from the rest of the Bayesian GNN frameworks, clarifying that these works target different problems. Another point is that despite the extensive literature review, in Section 5, no comparison is made with the Bayesian GNN methods, only with the architectural changes.
A modelling intuition, including strengths and weaknesses, is not presented for the choice of the beta process prior. All I understand is that a low $\nu_j$ reduces the weight of nodes at $j$ -hops and further away, so the model assumes that the message importance decays monotonically as distance between nodes increases.
I am getting the impression that $\pi_l$ is the probability of receiving messages over each feature dimension from nodes up to $l$ -hops away, and not exactly $l$ -hops away, since the augmented adjacency matrix (with self-loops) is used for message passing at each step.
The best and second best models should not be computed across all architectures. Rather, for each architecture, the baseline and should be compared with its +BNA version to show how adding BNA affects model performance. This way, the results in Table 1 are unconvincing, for example, ResGCN is competitive with ResGCN+BNA, especially considering the standard deviations.
It takes 3-5x more time to train with BNA (Table 5). The authors claim that this is cheaper than performing CV over model depths, but this should be validated by comparing the methods under the same time budget. For example, set the budget to the time it takes for one BNA run, and perform CV over 3-5 model depth choices and then compare performance on test set.
While the performance on benchmarks is competitive with the baselines, I would have been more interested in seeing conclusive evidence for neighborhood scope determination, perhaps using a synthetic dataset, where scope of the underlying ground-truth is known. All the other benefits are well and good, but I don't see compelling empirical evidence for automatic determination of neighborhood scope.

问题

What is the advantage of using a conjugate Bernoulli-process for the feature masks? Analytical tractability is anyway not available, and the choice of conjugate prior does not exactly help with the variational learning, does it? Can you also use other masks on the feature matrix, maybe like the one in DropMessage, where all elements in the feature matrix are masked iid?
The symbol $\otimes$ near line 204 should be defined; it is not obvious from the description. Am I correct to understand that this is not the usual Hadamard product, but rather a node-wise analogue?
Kindly add the dimensions of the vectors and the matrices, as well as the spaces they belong to. For example, $\mathbf{H}_l \in \mathbb{R}^{N\times O}$ .
What is the need for adding residual connections in your model? I understand that they improve the expressivity and performance of GNNs in general, but what is its relevance for your theoretical analysis? I see that Res+BNA has better expressivity than Res alone, but I think you should also show how the expressivity of NoRes+BNA relates with that of NoRes. You do test your method with other message passing schemes, such as GAT and JK-Net, where residual connections are not added, so the residual connections aren't exactly needed in your model.
The proof of Theorem 2 relies on feature sampling reducing message passing across the graph (line 854, $\mathbf{H} \leftarrow (\mathbf{A}\pi_L+1)\mathbf{H}$ ), correct? In that case, does BNA exacerbate the over-squashing problem, making it unsuitable for long-range tasks.
Can BNA result in over-fitting? I see that PubMed requires information aggregation over 6-hops, while I was under the impression that it is a short range task that is modelled better by shallow networks.
Possible typos (correct me if I am wrong; in that case, I may have read the related parts wrongly):

I think you are using (\cite{...}) for citing inside parentheses, instead of the standard \citep{...}.
Near line 108, ... with a graph convolution layers.
Space after "beta process" in line 178.
At the end of Equation 6 and in Equation 8, the beta process samples should be in bold face.
At the end of line 239/240, it should be right-hand instead of left.
In Equation 9, $\leq \lambda^L d_{\mathcal{M}}(\mathbf{H}_0)$ instead of $\mathbf{H}_0$ .
Line 488, AU-ROC instead of AUC-ROC.

伦理问题详情

Submission3862 has a suspicious resemblance with Submission3881. Both submissions present a variational inference algorithm for computing message aggregation weights in GNNs, modelling the contribution from different neighborhoods as a stochastic process.

From the abstracts:

Submission3862 - We develop a novel variational inference algorithm to jointly approximate the posterior of the count of neighborhood hops and learn GNN weights while accounting for edge importance.
Submission3881 - We thus propose to model the GNNs’ message-passing behavior on a graph as a stochastic process by treating the number of hops as a beta process. This Bayesian framework allows us to infer the most plausible neighborhood scope for messsage aggregation simultaneously with the optimization of GNN parameters.

Figure 1 in both manuscripts, describing the aggregation strategy, are clearly the same. Moreover, their captions have overlaps:

Submission3862 - The sticks located at the top represent random draws from the beta process, serving as layer-wise activation probabilities... The bottom shows the conjugate Bernoulli process.
Submission3881 - The sticks on top are random draws from a beta process, representing the probabilities over the number of hops. The bottom shows the conjugate Bernoulli process over node feature dimensions.

The priors for the models (Equations 3 and 4 in Submission3862, and Equation 2 and 3 in Submission3881) are the same. So are the choices of the variational distributions (Equation 6 in Submission3862 and Equation 7 in Submission3881).

I can point out other similarities, but it is quite clear just by scrolling through the manuscripts that they are different versions of the same work, with Submission3881 being the newer version (eg. with Section 4 on over-smoothing analysis added).

PS. Gave a low score primarily because of the possible dual submission.

评论- Regarding the Questions

2024-11-23

"The Conjugacy between Beta Process and Bernoulli Process and its Analytical Tractability": The intractability of the marginal likelihood is due to the nonlinearity of the neural network parameters (weights) and the layer number going to infinity.

"Can you also use other masks on the feature matrix, maybe like the one in DropMessage, where all elements in the feature matrix are masked iid?":

It is an interesting idea to apply the Bernoulli mask on the feature matrix. This can be part of our future work. We will cite DropMessage and discuss the possible extension.

"The Need for Adding Residual Connections in Our Model":

Residual connections feed the output from the last activated layer to the output layer by skipping all the inactivated layers in-between. In equation 4, without the residual term, the model output will collapse to zero when it goes through a deactivated layer. Hence, the residual connection is necessary for our model to realize a potentially infinite number of hidden layers.

"Theorem 2 about Feature Sampling in BNA and Over-smoothing. Does BNA Exacerbate the Over-squashing Problem, Making it Unsuitable for Long-range Tasks":

Over-squashing in deep GNNs occurs when adjacent node representations converge and become indistinguishably similar. Our theoretical analysis and empirical results demonstrate that node representations learned by GCN and ResGCN tend to collapse into a narrow angular region, reflecting a higher degree of collapse. In contrast, the representations of our method span a significantly wider angular region for the same number of layers (Theorem 2). Moreover, Figure 3 shows that applying our framework improves the performance of the baselines with deeper layers. These findings show that our approach effectively mitigates the over-squashing problem.

For long-range tasks, our method adapts the neighborhood scope accordingly by appropriately inferring contribution probabilities $\pi_l$ , facilitating effective message aggregation from long-range neighbors. This makes our approach well-suited for such tasks.

"I see that PubMed requires information aggregation over 6-hops, while I was under the impression that it is a short range task that is modelled better by shallow networks":

Figure 3 shows the majority of the contribution comes from the first 3 hops, and the neurons in the rest 3 are sparsely activated in the aggregation process.

"Does BNA result in overfitting?

BNA mitigates overfitting problem of the backbone GNNs. The overfitting analysis is in Appendix Section G, showing that applying our method makes them robust to overfitting.

Definition of $\otimes$ : $\otimes$ represents the element-wise multiplication as mentioned in line 208. It is the node-wise analogue of the Hadamard product.

Minor: As pointed out by the reviewer, we acknowledge the typos and the missing dimensionality information of the vectors and matrices. We will address this in the final version.

We will clarify the above points in the camera-ready version.

2024-11-23

Inferring the Model Architecture vs. Inferring a Subgraph

While I do agree with the distinction in the models, I find the suggestion that Submission 3862 infers a subgraph slightly misleading. Rather something like 'infers the message-passing paths' seems more precise to me. Anyway, that is irrelevant to this submission.

Our method can be viewed as a nonparametric Bayesian extension of dropout on GNNs, since we prune GNN architectures through a beta process over network depth and its conjugate Bernoulli process over network width.

I really like this phrasing – it presents an intuition for the probabilistic model, which was otherwise rather abstract (as I understood it).

Submission 3862's empirical study does not evaluate heterophilic graphs or the expressivity of their method. Instead, their experiments focus on analyzing different kernel performance for edge sampling and its ablation study.

I do agree that the works are not identical, but they are sufficiently similar to raise suspicion. Particularly, the models are a lot more similar than they are dissimilar. Figure 1 is exactly the same (except for color choices). This work's Figure 3 and Submission 3862's Figure 2 are the same. I cannot believe the two submissions are not the same work, unless the ACs/PCs suggest so. Indeed, I did mention in my comment under Submission 3862 that it seems to be an older version of this work.

评论- Regarding the Weaknesses (continued..)

2024-11-23

"The best and second best models" Comparison in Table 1: In each horizontal section in Table 1, we have compared the performance between the backbone GNNs and +BNA. The purpose of highlighting the overall best and second-best results for each benchmark dataset is to see which performs the best across all the configurations. As the reviewer suggested, we will discuss about individual backbone comparison in the camera-ready version.

"Comparison between +BNA and CV under same time budget": For the results reported in Table 1 and 4, we have adopted automatic search algorithms for hyper-parameter tuning for the layers L (neighborhood scope) for the baselines, specifically over the range [2,4,6,8,10] (as described in Appendix F.2.1). Table 5 shows that the time required for tuning the layer depth is comparable to or greater than that of our method when S=5. Therefore, the results in Tables 1 and 4 can be considered as comparisons under the same time budget.

Evidence for neighborhood scope determination: We are unaware of any real or synthetic datasets with a ground truth of neighborhood scopes. Therefore, we use a real-world biomedical dataset to provide evidence for neighborhood scope determination.

We conduct an experiment using Protein-Protein Interaction (PPI) data. The nodes in the graph are proteins and the edges represent that the connected proteins interacting with each other. The task is to predict the presence of interaction (edges) (link prediction) for the protein pairs in the test set.

When training our method with the PPI dataset, we find that the inferred neighborhood scope is 2. The learned contribution probabilities ( $\pi_l$ ) for the PPI datasets are: $\pi_1$ = 0.77, $\pi_2$ = 0.25, $\pi_3$ = 0.06, $\pi_4$ = 0.01, $\pi_5 = ... = \pi_{10}$ = 0.00. The contribution probabilities drop below 0.1 starting from the third hop, indicating that effective message aggregation occurs only from nodes within the 2-hop range.

This aligns with the evidence presented in L3[1], which suggests that proteins two hops apart are likely to interact. Additionally, in the table below, we report the AUPRC and AUCROC metrics for link prediction, comparing ResGCN and our method.

	AUPRC	AUCROC
ResGCN	0.894 ± 0.002	0.907 ± 0.006
Ours+ResGCN	0.907 ± 0.003	0.918 ± 0.002

[1] Kovács et al. Network-based prediction of protein interactions. Nat Commun 10, 1240 (2019).

We will add the above discussions and results in the camera-ready version.

评论- Regarding the Weaknesses

2024-11-23

Literature Review: We have discussed the related works' limitations mainly in the Introduction Section. In Section 2.1, we review previously proposed message aggregation schemes due to their relevance to our research since our primary contribution lies in effectively aggregating messages in a graph by identifying the most relevant neighbors. Following the reviewer's suggestion, we will elaborate the points for DRew and Half-hop in the camera-ready version.

The reviewed works' "strengths, weaknesses and/or relevance to BNA are not discussed": We have discussed the weaknesses and relevance of the SOTA methods in the Introduction Sections. Specifically, the common limitation of the methods discussed in section 2.1 and 2.2 have a common limitation: they rely on traditional two-stage approaches to search for the best setting. Since these empirical approaches involve training and validating GNN models for each single candidate configuration of neighborhood scope, it is a daunting task and tend to be biased. Besides, they also lack the capability to quantify the uncertainty of their predictions. Some are even prone to over-smoothing. We will include this discussion in the camera-ready version.

"No comparison is made with the Bayesian GNN methods": Among the discussed Bayesian GNNs, DropConnect (BBGDC) is the most relevant to ours in terms of applying regularizations to the GNN architectures in a Bayesian framework, while other methods perform inference over the properties of the input graph (features, structure and labels). We thus compare the performance of our method with BBGDC in both accuracy and calibration (ECE) and the results are in the table below:

Accuracy	Cora	Citeseer	Chameleon	Cornell
BBGDC	86.80±0.10	77.25±0.32	55.71±1.86	79.34±5.50
Ours+ResGCN	86.83±0.13	77.90±0.37	64.25±2.30	80.82±3.60
Ours+GCNII	87.26±0.25	78.36±0.66	57.44±3.35	87.87±5.19
Ours+ACM-GCN+	84.76±0.76	74.73±0.33	74.38±1.69	93.61±2.13

ECE	Cora	Citeseer	Chameleon	Cornell
BBGDC	0.20±0.03	0.09±0.05	0.06±0.01	0.44±0.07
Ours+ResGCN	0.04±0.02	0.12±0.05	0.05±0.01	0.48±0.05
Ours+ACM-GCN+	0.09±0.03	0.13±0.02	0.04±0.01	0.10±0.05

We will include the BBGDC baseline in Tables 1 & 2 in the camera-ready version.

"Intuition for using Beta Process": Beta process is a complete random measure over discrete numbers which allows each layer's activation probability to go from 0 to 1. Its flexibility, without the constraints such as the one imposed by Dirichlet process that all probabilities have to sum to 1, makes it an ideal and natural prior over. The decaying probabilities is due to a stick-breaking representation of a beta-process. The decaying trends can be smoothed by tuning the hyper-parameters.

The role of $\pi_l$ : We aggregate messages from immediate neighbors at each layer. So, the message from $l^{th}$ hop reaches the target node only at the $l^{th}$ layer. $\pi_l$ , which is sampled from a completely random measure, represents the probability of receiving aggregated messages up to $l^{th}$ hop. $\pi_l$ is the activation probability of the $l^{th}$ hop's neighbors because the message from lower hops already reached the target node in the previous aggregation steps (and they do not rely on $\pi_l$ ), but for the $l^{th}$ hop neighbors, $\pi_l$ determines if their message is incorporated into the target node.

评论- Concerns about Dual Submission

2024-11-23

We thank Reviewer AsWV for your time and constructive comments on our work.

While Submission 3862 also proposes a Bayesian inference GNN method, our approaches differ significantly from four perspectives.

Inferring the Model Architecture vs. Inferring a Subgraph

Our work proposes to infer GNN architectures for neighborhood scope determination. Specifically, we adopt a beta process to model the depth of the GNN backbone networks. Since GNN depths correspond to the hops of neighborhood, it will determine the most plausible neighborhood scope for message aggregation warranted by the data. We then mask the node features (i.e., neurons) at each layer by element wisely multiplying the conjugate Bernoulli vectors $\mathbf{z}_l$ with the neuron outputs:

$\mathbf{H}\_l = \sigma(\mathbf{\hat{A}}\mathbf{H}_{l-1}\mathbf{W}_l)$ $\otimes \mathbf{z}_l+\mathbf{H}\_{l-1}, \mathbf{z_l} \in \\{0,1\\}^{O}$ for node feature sparsification.
In contrast, Submission 3862 proposes to directly apply a beta-Bernoulli process to sample subgraph by identifying important edges. In practice, they masks the graph edges by multiplying $\mathbf{z}_l$ with the adjacency matrix of the graph, instead:

$\mathbf{H}\_l = \sigma((\mathbf{\hat{A}}\otimes \mathbf{z}\_l) \mathbf{H}\_{l-1}\mathbf{W}\_l)+\mathbf{H}\_{l-1}, \mathbf{z}\_l \in \\{0,1\\}^{|\mathcal{V}| \times |\mathcal{V}|}$

Aggregation over the Entire Graph vs. Aggregation over a Subgraph

Since we propose to model GNN architectures instead of sampling subgraphs, every message aggregation is conducted through the whole graph with different contribution probabilities $\pi_l$ .
Submission 3862, instead, only aggregate information within the inferred subgraphs by masking out edges.

Dropout vs. DropEdge

Our method can be viewed as a nonparametric Bayesian extension of dropout on GNNs, since we prune GNN architectures through a beta process over network depth and its conjugate Bernoulli process over network width.
In contrast, Paper 3862 can be seen as a generalization of the DropEdge method that applying a binary mask on graph edges for subgraph sampling.

Experiments

Phrasing similarities: We noticed some similar phrases in both works.We will revise our camera-ready version accordingly to address this problem.

2024-11-26

We thank the reviewer for the further comments.

Particularly, the models are a lot more similar than they are dissimilar. Figure 1 is exactly the same (except for color choices).

Figure 1 is a high-level conceptual depiction of the problem we propose to solve. The similar appearance arises from that both papers propose different approaches with a similar component (i.e., the stochastic process prior) for the problem. We will revise the figure to fix the problem in our camera-ready version.

This work's Figure 3 and Submission 3862's Figure 2 are the same.

The similar visual styles of the two figures stem from the use of the same plotting strategy for visualization. However, the reported results in the figures are different, produced by two distinct methods applied to two separate graph datasets.

2024-11-29

Figure 1 is a high-level conceptual depiction of the problem we propose to solve.

It is not believable that both works chose the exact same problem to solve, proposed very similar solutions, and presented their approach using the same figures.

The similar appearance arises from that both papers propose different approaches with a similar component (i.e., the stochastic process prior) for the problem. We will revise the figure to fix the problem in our camera-ready version.

Phrasing similarities: We noticed some similar phrases in both works.We will revise our camera-ready version accordingly to address this problem.

The purpose of the camera-ready version is not resolve ethical concerns about dual submission by changing presentation.

However, the reported results in the figures are different, produced by two distinct methods applied to two separate graph datasets.

The plotting styles are exactly the same, as are the choice of epochs for which the plots are shown – 10, 50, 100, 300; this sequence is not a standard choice, which is why it is hard for me to consider this as a coincidence.

Methods discussed in section 2.1 and 2.2 have a common limitation: they rely on traditional two-stage approaches to search for the best setting. Since these empirical approaches involve training and validating GNN models for each single candidate configuration of neighborhood scope, it is a daunting task and tend to be biased.

Fair enough. I think it can be more concise – you don't need to discuss these methods in such detail just to highlight that they use cross-validation for determining model depth.

We thus compare the performance of our method with BBGDC in both accuracy and calibration (ECE) and the results are in the table below.

Thank you! This resolves my concern.

for the $\pi_l$ hop neighbors, determines if their message is incorporated into the target node.

I fail to see that since in Section 2, the manuscript says that "The adjacency matrix with added self-connections is denoted by $\mathbf{A}$ and $\hat{\mathbf{A}}$ is it’s normalized form." Wouldn't that imply that masking the edges in the $l$ -th step masks information from nodes reachable in $l$ -hops, which includes all nodes up to $l$ -hops away?

The contribution probabilities drop below 0.1 starting from the third hop, indicating that effective message aggregation occurs only from nodes within the 2-hop range. This aligns with the evidence presented in L3[1], which suggests that proteins two hops apart are likely to interact.

This is good supporting evidence! Although I was referring to designing a small synthetic dataset, this should suffice in addressing my concern.

The intractability of the marginal likelihood is due to the nonlinearity of the neural network parameters (weights) and the layer number going to infinity.

That does not address my concern. My concern was that emphasizing the choice of a conjugate prior seems misleading especially since it has nothing to do with analytical tractability.

In equation 4, without the residual term, the model output will collapse to zero when it goes through a deactivated layer. Hence, the residual connection is necessary for our model to realize a potentially infinite number of hidden layers.

Mm, I am not totally convinced, but I guess I see the point. I would appreciate some further clarification.

Over-squashing in deep GNNs occurs when adjacent node representations converge and become indistinguishably similar. Our theoretical analysis and empirical results demonstrate that node representations learned by GCN and ResGCN tend to collapse into a narrow angular region, reflecting a higher degree of collapse. In contrast, the representations of our method span a significantly wider angular region for the same number of layers (Theorem 2).

This is not over-squashing. I think you are thinking of over-smoothing.

For long-range tasks, our method adapts the neighborhood scope accordingly by appropriately inferring contribution probabilities $\pi_l$ , facilitating effective message aggregation from long-range neighbors.

That's what the intuition suggests, but not what has been empirically demonstrated. I'll repeat my concern, which has largely gone unaddressed: The proof of Theorem 2 relies on feature sampling reducing message passing across the graph (line 854, $\mathbf{H} \leftarrow (\mathbf{A}\pi_L+1)\mathbf{H}$ ), correct? In that case, BNA is expected to exacerbate the over-squashing problem, making it unsuitable for long-range tasks?

Figure 3 shows the majority of the contribution comes from the first 3 hops, and the neurons in the rest 3 are sparsely activated in the aggregation process.

How do you define "sparsely activated"? $pi_4$ and $pi_5$ do not seem to be below 0.1, a criteria you used in one of the responses I address above.

I have update my rating of the work, but the ethical concerns continue to persist.

2024-12-01

We thank Reviewer AsWV for further comments and for increasing the rating for our work.

"It is not believable that both works chose the exact same problem to solve, proposed very similar solutions, and presented their approach using the same figures."

Although the plotting schemes of Figure 1s in both papers are similar, there are actually important and subtle differences. In our figure, we plot the nodes with varying color intensity to highlight that our approach regularize node features through GNN architecture inference. In contrast, Submission 3862 highlights the edge colors to present their edge sampling method. These corresponds to the different approaches addressing similar problems.

"The plotting styles are exactly the same, as are the choice of epochs for which the plots are shown – 10, 50, 100, 300; this sequence is not a standard choice, which is why it is hard for me to consider this as a coincidence."

To visualize the evolution of the neighborhood scopes in Figure 3, we selected epochs such that the changes in contribution probability are more distinctive. Submission 3862's choice of epochs are not exactly the same as ours. Theirs are 10, 50, 100, and 500. The first three epochs are the same probably because of the same reason as we stated above or just a pure coincidence.

"I fail to see that since in Section 2, the manuscript says that 'The adjacency matrix with added self-connections is denoted by and $\mathbf{A}$ and $\mathbf{\hat{A}}$ is it's normalized form.' Wouldn't that imply that masking the edges in the $l$ -th step masks information from nodes reachable in $l$ -hops, which includes all nodes up to $l$ -hops away?"

The reviewer is correct in that masking features at the $l$ -th step (using a mask sampled from $\text{Bernoulli}(\pi_l)$ ) masks the aggregated information from nodes up to $l$ -hops away. The contribution probability $\pi_l$ determines whether these aggregated messages reach the target node. However, the node information is already aggregated up to $l-1$ hops even with a value of $\pi_l = 0$ . This is because information from lower-hop neighbors already reached the target in earlier steps. In contrast, $\pi_l$ plays a critical role for $l$ -th hop neighbors since their information cannot reach the target in fewer than $l$ -steps. For these neighbors, $\pi_l > 0$ is essential to allow their messages to reach the target node.

My concern was that emphasizing the choice of a conjugate prior seems misleading especially since it has nothing to do with analytical tractability.

Our mention of conjugacy was to highlight the relationship between the two processes (Beta and Bernoulli). Both processes together construct the prior of the GNN architecture. However, the intractable marginal likelihood is due to the nonlinearity in GNN weights as the likelihood. In sum, we have a convenient GNN architecture prior due to the conjugacy, but the nonlinear GNN weights in the likelihood requires approximation. We will clarify this in the camera-ready version.

"Mm, I am not totally convinced, but I guess I see the point. I would appreciate some further clarification."

In practice, our model activates a finite number of layers in a truncation. The learned node representations from the output of the last activated hidden layer in our model needs to be fed to the output layer. Residual connections address this by propagating the output from the last activated layer directly to the output layer, bypassing the rest of deactivative hidden layers in the truncation.

"That's what the intuition suggests, but not what has been empirically demonstrated. Is BNA expected to exacerbate the over-squashing problem, making it unsuitable for long-range tasks?"

In the figure link, we have visualized contribution probabilities ( $\pi_l$ ) learned by our method for a large data ogb-arxiv. Compared to Figure 3, the inferred neighborhood scope is significantly larger. This empirical finding suggests that our method can capture long-range dependencies by inferring contribution probabilities accordingly.

"How do you define 'sparsely activated'? $\pi_4$ and $\pi_5$ do not seem to be below 0.1, criteria you used in one of the responses I address above."

we use "sparsely activated" to generally describe the hidden GNN layers with low contribution probabilities. We do not specifically define a specific threshold for sparse activation, a lower contribution probability leads to fewer node features passing forward in the message aggregation process. In Figure 3, although the inferred neighborhood scope reachs 6 hops, the activations for the last 3 hops are relatively low—particularly for the 4th and 5th hops. This indicates that while these hops contribute to the aggregation, their influence is minimal.

We will clarify the points in the camera-ready version.

2024-12-03

My concerns are pretty much resolved, and I'll update my rating to an 8, but the ethical concern of duplicate submission remains.

2024-12-04

We are grateful to Reviewer AsWV for further raising the score and providing us valuable suggestions. The discussions are in-depth and helpful. We will include them in the camera-ready version.

AC 元评审

2024-12-18

(a) Summary of Scientific Claims and Findings

This paper addresses the problem of determining the neighborhood scope in Graph Neural Networks (GNNs). The authors propose a Bayesian framework that employs a beta prior to model the importance of neighbor hops and incorporates feature dimension sampling for message passing. Experimental results demonstrate that the proposed algorithm outperforms baseline methods across several datasets.

(b) Strengths of the Paper

The paper introduces a novel Bayesian approach for adaptively determining neighborhood scopes in GNNs, a critical aspect of GNN design. The method is supported by experiments across homophilic and heterophilic datasets, showcasing its effectiveness.

Key weaknesses identified by reviewers include:

Insufficient discussion in the literature review, particularly on the relevance and context of related works.
A need for more thorough evaluation approaches and discussion on computational overhead.

(d) Decision and Rationale While the paper presents an effective Bayesian approach for GNNs, several shortcomings remain:

The literature review and discussions on related works require substantial improvement.
More transparent evaluation of results and computational efficiency is needed.

审稿人讨论附加意见

The authors had extensive discussions with Reviewer AsWV, who was eventually convinced about the technical aspects and experimental results.

最终决定Reject

2025-01-22

Reject