Pure Message Passing Can Estimate Common Neighbor for Link Prediction
We demonstrate that node-level message passing can effectively capture link-level structural features, such as Common Neighbor, for link prediction.
摘要
评审与讨论
This paper studies the problem of encoding pairwise structural features to enhance link prediction (LP). They note that previous literature has shown that standard MPNNs cannot encode the necessary pairwise information needed for LP. The authors argue that with careful design, MPNNs can in fact estimate pairwise structural features (e.g., common neighbors). This is done by equipping each node with quasi-orthogonal vector, which is used to estimate the counts of various structural features between two nodes. The authors then benchmark their method(s) on a number of datasets and compare against prominent benchmarks. They further highlight the efficiency of their method against other methods.
优点
- The paper is well-written and easy to understand. I think the motivation and the design of the method is explained quite clearly.
- The problem itself is an important one. Recently, many methods have been proposed for estimating the pairwise structural information between nodes. However, there is a trade-off between expressivity and efficiency (e.g., SEAL is very expressive but slow while BUDDY or NCN is less expressive but much faster). Balancing these concerns is crucial for enhancing current LP methods as for current methods, those that are expressive tend to be computationally inefficient and vice-versa.
- Overall, the performance is good. More so, when one considers the efficiency of MPLP+.
缺点
Weakness 1:
My main concern is that I'm unsure why the performance of this method is this good. Let me explain. MPLP estimates the structural counts via the use of quasi-orthogonal vectors. The predictor (i.e., score function), then takes as input (a) the elementwise product of both node representations (outputted by a MPNN) and (b) the estimated structural counts which are of the form #(p, q) such that (1, 1) is CNs. I'm just restating the Eq. at the end of 4.2.
However, how is this appreciatively different from BUDDY [16]?. In the score function, they also include the elementwise product of both nodes in the target link. They then concatenate the same kind of structural counts considered by MPLP. They differ only in their method of estimating those counts. BUDDY (and ELPH) use subgraph sketching instead of orthogonal vectors. As such, I can't seem to understand why MPLP/MPLP+ do so much better than BUDDY (I know you report ELPH in your tables but it only does marginaly better than BUDDY). If the performance difference was small I wouldn't be as confused, by the gap on some datasets is enormous. For example BUDDY has a 49.85 Hits@100 on ogbl-PPA while MPLP+ is 65.24. A smaller, but still noticeable, gap is present on ogbl-Citation2 with BUDDY at 87.56 and MPLP+ at 90.72. My point is, the different in performance is non-trivial. The authors clearly report that their method can outperform BUDDY.
But as I noted, there doesn't seem to be anything specific reason why this is occurs, as they are estimating the same information. Is the approximation used by the authors better? If so, then why? The authors seems to argue that in Appendix F.1, where they compare the MSE of the label counts by ELPH/BUDDY and theirs. However, I have a few issues with the results: (a) The results aren't very clear to me. For example, is it on all test samples? Positive and negative? (b) Also the authors don't specify the # of hops used when estimating the counts for ELPH/BUDDY. From personal experience using their code, I find setting hops=3 can help improve the counts while still being fairly efficient. (c) Most importantly, the authors only compare to MPLP and not MPLP+. To me, this is a very important point as MPLP+ is itself a comprimise between efficiency and estimation quality. As shown in Table 2, MPLP is impractical on larger datasets as it's OOM. So why not compare the estimation quality of MPLP+, the model which still performs well but can scale to larger graphs? In my opinion it is very important that the authors include MPLP+ in the experiments in Appendix F.1.
I apologize for the wall of text, but the magnitude of increase over BUDDY seems very strange to me. To summarize, I suggest the authors: (a) Giving more explanation as to why the large discrepancy in performance may exist and (b) Include MPLP+ in Figures 8 and 9. I think if the authors can show that MPLP+ still does a much better job at estimating the structural features, that may be the answer. Otherwise, I very strongly recommend the authors try to give any difference that exist between the two methods that can explain it.
Weakness 2:
I find it odd that MPLP+ is more efficienct than BUDDY. MPLP+ and BUDDY both use L layers of message passings. Furthermore, as noted before, the scoring function is similar. The only difference is in estimating the structural counts. However, BUDDY pre-computes and caches the subgraph counts, so it is qctually quite efficiency as they don't need to be recomputed. This is as opposed to MPLP+ which does have to recompute them each time. Because of this I don't really see how BUDDY can be slower except for various differences in implementation. Because from the theoretical complexity I shouldn't expect BUDDY to be slower. Again, I recommend the authors try to explain why this might be the case.
Weakness 3:
I should have noticed this earlier, but the similarity between MPLP and BUDDY hurts the novelty of the paper. As far as I can see, the authors are proposing a new method that can potentially estimate the structural counts a little better and faster. While noteworthy, this is a small contribution. However, I personally only find this to be a minor weakness of the paper.
Weakness 4:
Some benchmark datasets are missing. Including Cora, Citeseer, and Pubmed (see [33]). I recommend including them. Also, for HeaRT, I recommend reporting the results on all datasets and not just ogbl-ppa/collab/citation2. To be clear, this is a very minor weakness.
问题
- Can you reproduce Figures 8 and 9 with MPLP+ against BUDDY/ELPH?
- Can you better explain why MPLP+ is more efficienct than BUDDY?
- Are there any other contributing factors that may lead to MPLP/MPLP+ doing better than a similar method like
Other:
- I recommend including the results of BUDDY in the main tables. I know you include ELPH and it typically does better, but it gets confusing when the results are for ELPH but the efficiency for BUDDY. I think it'll just make the experiments section easier to read.
局限性
yes
We first want to appreciate the reviewer's extensive feedback, which is extremely valuable to us given the large volume of review loads this year. We also very appreciate the reviewer's positive feedback on the importance of the problem we address, especially acknowledging the importance of trade-off between expressivity and efficiency for link prediction tasks. We will address the following points in the rebuttal:
W1: Comparison with ELPH/BUDDY
Due to the space limit, we post this rebuttal as the "global" rebuttal on the very top of the page. Please refer to the global rebuttal for the detailed response.
W1 (cont.) Issues about Appendix F.1
We apologize for the any confusion in Appendix F.1. In Appendix F.1, we find that ELPH's MinHash/HyperLogLog can introduce a high variance of estimation compared to MPLP. We address the reviewer's concerns as follows:
For example, is it on all test samples? Positive and negative?
We report the estimation quality on a set of both positive and negative edges. For positive edges, we use all the training edges. For negative edges, we randomly sample the same number of negative edges as the positive edges. We keep the edges consistent across MPLP and ELPH to ensure a fair comparison.
In MPLP, we use the shortcut removal technique to avoid the distribution shift problem. For a more fair comparison, we also implement the shortcut removal technique for ELPH in Appendix F.1. We will clarify this in the revised version.
the authors don't specify the # of hops used when estimating the counts for ELPH/BUDDY.
For ELPH, we use the node signatures of 2-hop neighbors to get estimations for #(1,1), #(1,0), #(1,2), #(2,2) and #(2,0). Since the estimation variance becomes higher for both MPLP and ELPH at the #(2,2) and #(2,0), we did not report the variance beyond 2-hop neighbors. It is very interesting to learn from the reviewer that setting hops=3 can help improve the performance, where we thought that the variance would be too high to be useful. In our experiments, the performance gain from 3-hop neighbors of MPLP(+) is marginal.
Most importantly, the authors only compare to MPLP and not MPLP+
MPLP+ is different from MPLP/ELPH/BUDDY in that MPLP+ estimates the number of walks between two nodes rather than the number of nodes like MPLP/ELPH/BUDDY. Therefore, to have an apple-to-apple comparison, we can only compare MPLP to ELPH in Appendix F.1 to evaluate the quality of estimating number of nodes.
W2
We apologize for any confusion about the efficiency of MPLP+. MPLP+ is similar to BUDDY and can also cache the node signature during the inference time. In Appendix D.3, we discuss the detail of benchmarking the inference time for both BUDDY and MPLP+.
While theoretically, MPLP+ and BUDDY should have similar efficiency. However, the implementation of two methods causes the efficiency discrepancy. For MPLP(+), we utilize pytorch_sparse to implement the message-passing, which is built on [3]. BUDDY utilizes pytorch_scatter to implement its message-passing, which is both slower and more memory-hungry compared to the sparse-dense matrix (Spmm) opeartion in MPLP(+).
In fact, we also submit a Pull Request in BUDDY's Github repo to change its message-passing implementation to Spmm. Even though BUDDY with Spmm shows comparable efficency as MPLP+, it still exhibits higher estimation variance (Appendix F.1.) and lower overall performance (Table 1/2) compared to MPLP(+).
[3] Design Principles for Sparse Matrix Multiplication on the GPU.
W3
Thank you for pointing this out. We discuss in the paper that our link prediction framework shares a similar spirit of learning structural representation as ELPH/BUDDY. However, beyond ELPH/BUDDY, MPLP(+) proposes several distinct components (Shortcut Removal, Norm Scaling, One-hot hubs, walk-based counting) that significantly boost the model performance with a totally different but simple node/walk-estimation mechanism (quasi-orthogonal vectors).
W4
We appreciate the reviewer's constructive suggestion. We do not include the Cora/Citeseer/Pubmed in our main experiment because recent studies[4,5] and we find that these three datasets overwhelmingly rely on the node attributes for link prediction tasks. An MLP without graph structural information can already achieve comparable performance as GNNs on these three datasets. This makes them not sensitive enough between different link prediction methods.
For HeaRT setting, we will report the results on all datasets in a revised version.
[4] Linkless Link Prediction via Relational Distillation
[5] Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New Benchmarking
Q1
Please refer to rebuttal for W1 (cont.) Issues about Appendix F.1.
Q2
Please refer to rebuttal for W2.
Q3
Please refer to rebuttal for W1.
Q4
We thank the reviewer for the suggestion. We will include BUDDY's results into the main tables to make it more clear.
Thanks again for your comments and diligence in reviewing our work. We hope our responses have addressed your concerns. If so, we hope that you will consider raising your score. If there are any notable points unaddressed, please let us know and we will be happy to respond.
Proof Sketch for MPLP's Expressiveness over NCN:
We present a proof sketch to show that MPLP is more expressive than NCN, specifically because MPLP can enable weighted node counting by Norm Scaling technique (Sec 4.1). Consider a rook's graph and a rook's graph, where . These two graphs are strongly regular graphs, but with different node degrees. Also any non-adjacent nodes in rook's graph will have two common neighbors. We investigate if NCN/MPLP can encode the non-adjacent node pair in rook's graph differently from the non-adjacent node pair in rook's graph.
For NCN, (1) the GNN encoder (GCN/SAGE) will generate the same representation for all nodes due to regular graphs; (2) all the non-adjacent node pairs will have two common neighbors, leading to the same NCN score.
For MPLP, because , the Norm Scaling technique will assign different norms to the quasi-orthogonal vectors of nodes in and rook's graphs. Therefore, the non-adjacent node pairs in and rook's graphs will have different MPLP representation due to the weighted node counting. This shows that MPLP is more expressive than NCN.
We appreciate the reviewer's feedback. We address the following points in your feedback:
Empirical advantages of MPLP+'s Norm Scaling and Shortcut Removal. Q2
We thank the reviewer for the insightful comments. In our previous response, we have discussed the theoretical advantages of Norm Scaling and Shortcut Removal. The ablation study in Table 7/8 shows that these two techniques can effectively boost the MPLP's performance, depending on the dataset. We also show in Appendix F.1 that MPLP(+) has better estimation accuracy compared to ELPH/BUDDY.
Here, we further investigate the empirical advantages of Norm Scaling and Shortcut Removal in MPLP+ especially on PPA and Citation2 dataset. We keep all the other hyperparameters the same as the main experiments, and only remove Norm Scaling for PPA and Shortcut Removal for Citation2. The results are shown in the table below:
| PPA(Hits@100) | Citation2(MRR) | |
|---|---|---|
| BUDDY | 49.85 | 87.56 |
| MPLP+ | 65.24 | 90.72 |
| MPLP+ w/o Shortcut Removal | 64.53 | 87.76 |
| MPLP+ w/o Norm Scaling | 53.94 | 90.25 |
As shown in the table, the performance of MPLP+ drops significantly when removing Norm Scaling on PPA. This indicates that Norm Scaling is crucial for capturing the nuanced structural information in the PPA dataset and the driving factor boosting the performance of MPLP+. On the other hand, the performance of MPLP+ decreases when removing Shortcut Removal on Citation2. This suggests that Shortcut Removal is essential for avoiding the distribution shift problem in the Citation2 dataset. Interestingly, the degenerated performance of MPLP+ is close to BUDDY when removing Norm Scaling on PPA and Shortcut Removal on Citation2. This empirically demonstrates that Norm Scaling and Shortcut Removal are effective techniques for improving the performance of MPLP+ over BUDDY.
Since NCN uses the common neighbors' node representation as link representation, it is not directly comparable to MPLP(+). Therefore, it is difficult to conclude that Norm Scaling and Shortcut Removal are not effective for MPLP(+) because NCN cannot exploit the node degree information and shortcut removal technique. We will include this empirical analysis in the revised version.
In fact, MPLP+ with Norm Scaling can be more expressive than NCN. We include a proof sketch in the end of the response.
The more expressive MPLP can face more severe distribution shift problems than NCN (Sec 5.3.2 in [2]). Therefore, the Shortcut Removal technique is more crucial for MPLP to avoid the distribution shift problem compared to NCN.
[2] FakeEdge: Alleviate Dataset Shift in Link Prediction
Other concerns
We greatly appreciate the reviewer's suggestions. We will revise the paper to more explicitly indicate the walk-based counting mechanism of MPLP+ and the reference of the proposed techniques of MPLP(+). We will also include a discussion when comparing the efficiency results between BUDDY and MPLP.
Q1: Estimation quality of MPLP+
We thank the reviewer for the suggestion. Similar to experiments in Appendix F.1, we include the experimental results for the walk estimation quality of MPLP+ on Collab below:
| Signature Dimension F | #(1,1) | #(1,2) | #(1,0) | #(2,2) |
|---|---|---|---|---|
| 1000 | 2.0 | 4.4*10^3 | 3.7*10^3 | 1.0*10^11 |
| 1500 | 0.49 | 1.0*10^3 | 2.1*10^3 | 1.3*10^11 |
| 2000 | 0.55 | 5.2*10^2 | 1.5*10^3 | 1.1*10^11 |
| 2500 | 0.23 | 4.3*10^2 | 1.9*10^3 | 1.2*10^11 |
Since there is no other baseline counting walk on graph, we can only investigate the results by itself. As shown in the table, the walk estimation quality of MPLP+ becomes worse when hops increase. This is because with longer hops, the number of walks between two nodes increases exponentially, leading to a higher estimation variance. When it comes to the #(2,2), the estimation variance becomes too high to be useful. Therefore, we observe that the performance gain from 3-hop above neighbors of MPLP(+) is marginal. In the meantime, increasing the signature dimension can help reduce the estimation variance. We will include this analysis in the revised version.
We hope that we have addressed your concerns. Please let us know if there are any other concerns. We will be happy to respond.
As shown in the table, the performance of MPLP+ drops significantly when removing Norm Scaling on PPA
I appreciate the experiments but by argument isn't that these strategies can necessarily help, but (a) the impact is limited and (b) this isn't a technique introduced by the authors of this paper (i.e., not novel contributions). These are true as shortcut removal has a marginal impact on ogbl-ppa and norm scaling has the same on ogbl-citation2. I'll admit, I'm also surprised that shortcut removal has such a big effect on MPLP+ on citation2. As shown in the NCN paper (linked in my last response), shortcut removal actual hurts performance of NCN on ogbl-citation2.
we include the experimental results for the walk estimation quality of MPLP+ on Collab below
Thanks a lot. I agree there's no way to really judge these w/o a baseline (not your fault). However, at first glance the MSE seems high. Compared to the node estimations shown in Fig 8, it's a lot higher. Though I guess part of that is because is because there are just many more walks. Regardless this does suggest the estimation isn't great.
I really do appreciate all the discussion and additional experiments but my main concerns haven't been addressed. So I will keep my score.
We really appreciate the reviewer's active engagement and insightful questions during the rebuttal-discussion process. Since the reviewer still has a major concern about the performance advantage of MPLP+ over BUDDY, we want to summarize the key points discussed so far:
Summary of why MPLP+ can outperform BUDDY on PPA and Citation2:
-
From the theoretical perspective: MPLP+ has two key components, Norm Scaling and Shortcut Removal, which can boost the performance of MPLP+ over BUDDY. The Norm Scaling improves the expressiveness of MPLP+ by enabling weighted counting, while the Shortcut Removal avoids the distribution shift problem during training.
-
From the empirical ablation study: We conduct an additional ablation study to show that Norm Scaling and Shortcut Removal are crucial for the performance improvement of MPLP+ on PPA and Citation2. The empirical results demonstrate that Norm Scaling and Shortcut Removal are effective techniques for improving the performance of MPLP+ over BUDDY.
-
From the experimental results: In Table 2, we show that MPLP+, with distinct components like Norm Scaling and Shortcut Removal, can achieve better performance than BUDDY on both PPA and Citation2. In fact, MPLP+ not only outperforms BUDDY but also achieves state-of-the-art performance on PPA and Citation2 on the OGBL leaderboard.
We believe that the theoretical and empirical evidence provided above can explain how and why MPLP+ can outperform BUDDY on PPA and Citation2. We are eager to know what remains unclear or unsatisfactory to the reviewer and will be happy to provide further clarification. This can greatly help us improve the clarity of the paper and the presentation of the results.
I really do appreciate the author's continued indulgence of my comments and I don't mean to come off as overly critical. However, I spent a day thinking about it and my concerns still stand.
Norm Scaling + Shortcut Removal
As I mentioned, I'm not discounting that they can help. But:
- I don't see why I should see these as the authors contributions, as the authors themselves state in the paper that they are taken from prior work. As such, they are not core contributions of the authors.
- Furthermore, I'm quite quite confused by the norm scaling ablation study. The authors show that for ogbl-ppa, performance massively decreases when ablating norm scaling on ogbl-ppa. However, earlier, they also mention that BUDDY uses a pre-computed RA score for ogbl-ppa. In that case, if weighted counting is so important, why does BUDDY still lag behind MPLP+ so much on ogbl-ppa? You can argue that it's because RA only considers CNs (i,e,, walks of length (1, 1)) while MPLP+ considers higher-order walks. But (a) as the authors show in their earlier comment, the walk estimation outside of (1, 1) isn't very good and (b) regardless of the estimation quality, why can NCN (which only considers CN node representations) do so well? I don't see how to reconcile the authors findings with these points.
We really appreciate the reviewer's careful investigation about our work, especially the time and efforts. Actually, we want to understand better what the reviewer is concerned about so that we can provide a more detailed explanation. We address the following points in your feedback:
- We apologize for the confusion. We do not claim that these individual components are the core contribution of our work. We highlighted them in the rebuttal-discussion because we tried to address the reviewer's concern about why MPLP+ can outperform BUDDY. In general, our contribution is (1) theoretical perspective: pure node-level MPNN can estimate the common neighbors/walks between two nodes; (2) practical implementation: the proposed MPLP(+), built based on the node-level MPNN and previous studies, can achieve state-of-the-art performance on various datasets while maintaining node-level MPNN's efficiency.
The following discussions are to provide a more detailed explanation about why MPLP+ can outperform BUDDY:
-
Our claim is that MPLP+ can outperform BUDDY on PPA and Citation2 due to (1) lower estimation variance (Appendix F.1) (2) Norm Scaling and (3) Shortcut Removal.
-
We appreciate the reviewer reminding us that BUDDY also uses RA in its implementation. To have a more fair comparison, we remove the RA feature from BUDDY and re-run the ablation study of BUDDY. The results show that BUDDY, without RA as link feature, will drop to in Hits@100. This suggests that the RA feature is an important factor for BUDDY to perform well on PPA. Since MinHash/HyperLogLog of BUDDY cannot perform weighted counting, it explicitly adds RA as a feature to improve the performance on PPA. This is also why we believe that the weighted counting mechanism of MPLP+ can be crucial for its performance advantage over BUDDY.
| PPA(Hits@100) | Citation2(MRR) | |
|---|---|---|
| BUDDY | 49.85 | 87.56 |
| BUDDY w/o RA | 30.19 | - |
| MPLP+ | 65.24 | 90.72 |
| MPLP+ w/o Shortcut Removal | 64.53 | 87.76 |
| MPLP+ w/o Norm Scaling | 53.94 | 90.25 |
-
Regarding the importance of the weighted counting on PPA, the ablation study above can clearly show that MPLP+ with Norm Scaling can perform significantly better. We respectfully disagree that the walk estimation with relatively high variance will be useless in practice. Firstly, we do not have baselines for walk estimation to compare against. It is difficult to tell if the estimation is good or bad. Moreover, even if we look at the node estimation of MPLP/BUDDY in Appendix F.1, the estimation variance is also as high as the walk estimation of MPLP+ (in the range of to ). In fact, BUDDY, with such high variance esitmation, can still achieve the good performance considered as sota at its publication.
-
We appreciate the reviewer's suggestion about digging deeper into why MPLP(+) performs better than BUDDY. However, we respectfully disagree that bringing in other methods like NCN into discussion can help resolve the concern. NCN has a much different way to represent link. It is too difficult to make a rigorous ablation study across NCN and MPLP(+). In fact, the ablation study above when comparing components within MPLP(+) may also only tell part of the story, just like any ML paper. This is because there can be joint effects between the components in the system. At last, we can only conclude that the entire MPLP(+), with all components proven to be effective in both theoretical and empirical aspects within its own, can achieve sota performance over both BUDDY and NCN(C).
-
Another evidence that it is difficult to ablate between two methods happens to be shown on BUDDY itself. In Table 2 of [5], the benchmarking results show that BUDDY, with precomputed RA, can actually perform slightly worse than RA itself (BUDDY+RA's vs RA's ). This observation cannot conclude that RA is good or bad for BUDDY. However, by ablating RA within BUDDY, our results show that RA is crucial for BUDDY to achieve good performance on PPA.
We apologize for the extensive response and appreciate the reviewer's careful investigation. We really do hope that our responses have addressed your concerns.
[5] Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New Benchmarking
I'll start off my comment by saying that I'll raise my score to a 5. However, I'm really just doing it out of respect for the authors and their patience with me. To be honest, I'm still not convinced that my concerns have been properly addressed.
I tried to respond to most points below. However, one last question I have is if you can give more details as to how much GPU memory is needed to run MPLP+? I only ask because I noticed in D.4 that you guys used a 80Gb GPU which seems like a lot. Was that much memory necessary? I know this is the last day of rebuttals so if you don't answer it's fine. I'm just curious.
In general, our contribution is (1) theoretical perspective ... (2) practical implementation
Thanks for the clarification, I agree that these are the main contributions of the paper. Apologies for any mixup. Regardless, my main problem is not with either of these but rather with my lack of understanding as to why MPLP+ does so well.
Our claim is that MPLP+ can outperform BUDDY on PPA and Citation2 due to (1) lower estimation variance (Appendix F.1) (2) Norm Scaling and (3) Shortcut Removal.
This really gets at the heart of my problem. As you mentioned, (2) and (3) aren't contributions of the work. Also, F.1 does not show that MPLP has a lower estimation variance, rather that MPLP does. As you mentioned earlier, MPLP+ and BUDDY can't be compared.
We appreciate the reviewer reminding us that BUDDY also uses RA in its implementation. To have a more fair comparison, we remove the RA feature from BUDDY
I appreciate it but that wasn't really my point. My logic is the following:
- The authors argue that weighting matters, however BUDDY uses RA and still performs much worse.
- While on can argue that this is because RA ignores walks beyond (1, 1), I don't find this compelling as (a) NCN can do quite well just considering only CNs and (b) the walk estimation of MPLP+ seems to steeply decline beyond (1, 1). Even if we consider Katz, which does compute the # of walks exactly and weights them, the performance is still the same or worse than just RA (see [5]).
So let's consider ogbl-ppa: (a) With Katz, considering walks > 2 hurts performance and (b) BUDDY includes the weighted counts of CNs. MPLP doesn't do better due to the shortcut removal, as the authors show in their ablation study that it has limited effect on ogbl-ppa. This again brings me to my why question - why can MPLP+ outperform BUDDY by such a large margin?
Maybe this is just going completely over my head (and apologies if so), but it doesn't make sense to me.
We respectfully disagree that the walk estimation with relatively high variance will be useless in practice
I did not call it useless, just that estimation doesn't look great.
Firstly, we do not have baselines for walk estimation to compare against. It is difficult to tell if the estimation is good or bad.
I understand and sympathize with the authors. However, I don't think the lack of a baseline means we shouldn't question the estimation quality. I think it's fair to say that we'd prefer not be off by a factor of 10^3 when estimating the number of walks. To be clear, this isn't meant as a criticism.
In fact, BUDDY, with such high variance esitmation, can still achieve the good performance considered as sota at its publication.
Yes, and I think this is a problem! If it can do well with high variance are the counts themselves really important? If not, then why does it achieve SOTA (at the time)? This again goes with my general point about the need to understand why methods can perform well when they do.
As an aside, my hunch as to why BUDDY can still do well is that the vast majority of the impact is gained by the inclusion of CNs, of which the variance is low (but just a guess).
We sincerely appreciate the reviewer's time and effort in reviewing our work. We address the following points in your feedback:
GPU memory:
We measure the GPU memory peak consumption of MPLP+ on both PPA and Citation2 datasets during training and testing. As a comparison, we also measure the memory consumption of a typical 2-layer GCN model on the same datasets with the same batch size.
| GPU Mem(GB) | PPA | Citation2 |
|---|---|---|
| GCN | 18.73 | 68.59 |
| MPLP+ | 29.53 | 84.7 |
The results show that MPLP+ consumes more memory than GCN. This is because MPLP+ needs to store the node signatures for all nodes in the graph, which requires more memory than the GCN model. Even though the GPU memory cost is high for MPLP+ during training, its cost will be much lower during testing. During inference, MPLP+ only needs to retrieve the node embedding/signature of the two nodes in the link prediction task, and does not need to store the graph structure at all. Therefore, the time/space complexity of MPLP+ during inference is the same as the GCN/BUDDY model (linear to the number of edges to predict), and more efficient than the NCN (requiring finding CN nodes(NCN)/neighboring nodes(NCNC) by using graph structure).
In our current implementation, we just throw everything into the GPU memory and maximize the batch size to make the training process faster. Depending on the practical needs, we can probably further optimize the memory consumption of MPLP+ by using a more memory-efficient implementation, such as storing the node signatures on the CPU memory and loading them into the GPU memory when needed (using DataLoader).
Why MPLP+ can outperform BUDDY:
We sincerely appreciate the reviewer's in-depth discussions with us, which really encourage us to rethink the theoretical and empirical aspects of our work compared to BUDDY/NCN. In fact, both works are our favorite papers and we are glad to see that they are compared with our work. In general, we still think that it may be too difficult to rigorously compare components (RA) across methods beyond the performance results. However, we also believe such discussions are valuable for the community. Below are some other thoughts we have that may help explain but not mean to thoroughly convince the reviewer on why weighted count can be important:
-
Why NCN works: As the reviewer pointed out, NCN may also implicitly capture the node degree with bias. Even though we show that NCN cannot encode the node degree in the strongly regular graph, its GCN-style message passing, normalized by node degree, can capture the node degree information in the CN nodes' representation in a biased way. We do not have a proof for this, but it may relate to the proof of Theorem 3.1 in our paper.
-
Why BUDDY fails on PPA: The hyperparameter matters. NCN uses a heavily tuned hyperparameter setting compared to BUDDY. It is possible that BUDDY can also achieve comparable performance with a more sophisticated hyperparameter search. MPLP+ has a comparable set of hyperparameters to BUDDY, but we indeed have a hyperparameter search until we landed a good enough result.
-
BUDDY has some training difficulty: The slightly performance drop from RA to BUDDY+RA in [5] may indicate that there is something suboptimal in the training process of BUDDY. Ideally, adding RA should improve the performance, if not degrade it. However, there is certainly some noise confusing BUDDY during training.
-
Response to "if we consider Katz":
Katz is different from MPLP+ because Katz is weighted by number of hops. However, MPLP+ is weighted by the node degree. Therefore, performance of Katz may not indicate MPLP+'s.
-
Seemingly high variance could be probably okay: BUDDY can somehow perform well with large variance. Besides that the reviewer observed BUDDY can perform better beyond 3-hop, we also observe that MPLP+ can perform better with 2-hop neighbors ((1,2),(2,1)) even with relatively large variance. Therefore, it is possible that MPLP+ is better than BUDDY because of the path estimation with an acceptable variance (sounding not acceptable though by a factor of ). And in turn MPLP+ provides (1) lower estimation variance (2) weighted counting beyond (1,1).
If all above are not correct:
Then we also wonder why BUDDY, with high variance estimation, can still perform well. We agree with the reviewer that the CN counts with relatively low variance might be the essential factor for BUDDY to perform well behind the scene.
Again, we sincerely appreciate the reviewer's time and effort in reviewing our work. The discussion with the reviewer has been very insightful and valuable, especially in understanding the performance advantage of MPLP+. We want to thank the reviewer again for the engaging discussion and the constructive feedback, which helps build a healthy ML research community.
Thanks for the detailed response. I list my further comments below.
| However, BUDDY cannot perform a weighted counting due to the mechanism of MinHash and HyperLogLog.
I agree with this in relation to BUDDY. But I'm skeptical that this is actually important. Of course RA/AA do better than just CN, but let's consider NCN. It doesn't consider the degree of the CNs and does very well. Of course one can argue that maybe the degree information get's encoded in the node representations in NCN but we don't really know since it's not explicit in any way. So I'm technically not disagreeing with your claim, but I'm unsure if the proof really exists to claim that this matters.
| ELPH/BUDDY can cause a distribution shift problem, while MPLP will not
I agree in theory, but in practice recent work has shown that the effect of this distribution shift is quite minimal. NCN/NCNC also removes the link during training. On the OGB datasets, their studies show the effect is minimal, at best (Note: The results are only in an older version of the paper here in Table 3 under NoTLR). Also [1] below studies the problem and conclude that it really only effects nodes of lower degree. They show that the overall performance on ogbl-collab and ogbl-citation2 is barely changed when removing target links.
So again, I'm not disagreeing with you and I agree that this is a plus for MPLP/MPLP+. I'm just skeptical that it matters a lot in practice.
[1] Zhu, Jing, et al. "Pitfalls in link prediction with graph neural networks: Understanding the impact of target-link inclusion & better practices." Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 2024.
| In our experiments, the performance gain from 3-hop neighbors of MPLP(+) is marginal.
Interesting. It's likely dataset dependent.
| While theoretically, MPLP+ and BUDDY should have similar efficiency. However, the implementation of two methods causes the efficiency discrepancy.
Ok, that makes sense. Please include a discussion of this when showing the efficiency results in the paper. As the current results are misleading.
| We do not include the Cora/Citeseer/Pubmed in our main experiment because recent studies[4,5] and we find that these three datasets overwhelmingly rely on the node attributes for link prediction tasks.
Ok, that's fair.
| However, beyond ELPH/BUDDY, MPLP(+) proposes several distinct components (Shortcut Removal, Norm Scaling, One-hot hubs, walk-based counting)
I thank the reviewers for the clarification.
However, I'm not disagreeing that the mechanism for estimating nodes/walk counts is different. That is a notable achievement. My point was regarding, as you mentioned, the "spirit" of the method being quite similar. That is still true. Furthermore, some of the components mentioned such as shortcut removal aren't proposed by this paper. In fact, it's used by multiple link prediction methods. This is also true for norm scaling where the authors explicitly refer to [12, 25] in their paper (line 246). To be clear, this isn't a bad thing, but I think the authors should be clear in their paper about what specifically they propose that's novel.
| MPLP+ estimates the number of walks between two nodes
Thanks for the clarification. I recommend the authors make this a little more explicit in their paper, as when introducing MPLP+ it's only mentioned briefly on line 294.
Furthermore, in my opinion, the fact that MPLP+ uses walk-based features raises 2 more questions:
Q1: How well does it estimate walk of disparate lengths? I recommend the authors try to incorporate this into future versions of their paper. Because as of now, we really don't know how well it actually estimates the walks. This matters because it will give us some understanding on whether estimating these counts is actually important and driving the performance gain.
Q2: It doesn't explain to me why MPLP+ can perform so well. In general, MPLP+ tends to do slightly worse than MPLP across almost all datasets. Naively, this would suggest that if it were computationally feasible to run on ogbl-ppa, it would also do very well ~65 Hits@100. But this would still be much much higher than the performance of BUDDY on ogbl-ppa which is ~49. A similar observation stands for ogbl-citation2. I know this is hypothetical as the results don't exist, but I would find it hard to believe that estimating the node counts slightly better would lead to such a huge increase. Also, we don't really have any reason to expect that using walks in MPLP+ would help on the OGBs. In [33] they show that Katz (which just weights these walks) always does same/worse than RA/AA. So that likely can't explain it either for just MPLP+.
Basically, as I mentioned in my original review, I don't really understand why MPLP/MPLP+ does so well, especially compared to BUDDY. Because of that I'll keep my score.
The authors of this work introduce the Message Passing Link Predictor (MPLP), a link prediction model that uses pure message passing - as originally proposed by Gilmer et al. - to estimate structural similarities between nodes. The method is motivated by the fact that a single round of message passing using one-hot-encodings of nodes in un-attributed graphs can be used to estimate the common neighbors between node pairs. This idea is used to develop MPLP, a novel link prediction model that uses message passing to construct quasi-orthogonal vectors, whose inner product can be used to efficiently estimate structural similarities. Combining those vectors with node features and shortest path neighborhoods, MPLP obtains node representations that can be used by a classifier to predict links. Moreover, an advanced model MPLP+ is proposed with a simplified estimation of shortest path neighborhoods based on multiple rounds of message passing using one-hop information only. The two resulting models are evaluated in eight unattributed and seven attributed graphs, showing excellent performance compared to nine baseline methods.
优点
[S1] Showing that a single round of neural message passing in a Graph Convolutional Network (GCN) and SAGE can be used to estimate the number of common neighbors between node pairs, while multiple rounds of message passing can estimate the number of walks between node pairs, this work provides interesting insights into the design of Graph Neural Networks, as well as its inherent relations to heuristic link prediction techniques.
[S2] The work includes both theoretical and practical insights that make it interesting for a wide range of researchers and practitioners in graph learning.
[S3] The two proposed models are extensively evaluated in large attributed and unattributed graphs, showing promising performance across a wide range of data dets from different contexts.
[S4] The inference time of both proposed models and baseline methods is compared across three large data sets from the OGB benchmark, demonstrating that despite its improved performance the proposed methods are actually more efficient than some of the baseline methods.
[S5] In the appendix, the authors included extensive ablation studies in which they investigated the impact of different structural estimators, the sensitivity to parameter values, and the impact of removing specific components of the architecture.
[S6] The paper is very well-written, clearly motivating the problem and giving a good intuition of the proposed approach. I enjoyed reading this work, although some of the math in the appendix is hard to follow for non-specialists.
缺点
[W1] I could not follow the role of the weight matrix W, which is assumed to be randomly initialized from a zero-mean distribution, in theorem 3.1, see Q1.
[W2] The experimental evaluation does not include results on the training time of the different models.
[W3] The evaluation is solely based on HITS@50, see Q3.
问题
[Q1] Referring to W1, what would happen if we assume that the weight matrix is absent, i.e. if we remove the weight parameters in the aggregation function?
[Q2] Related to W2, how does the training time of the proposed models compare to those of the baseline methods?
[Q3] I understand that the authors followed the evaluation metrics of OGB. Nevertheless, it would be good to include additional evaluation metrics such as accuracy, AUC-ROC or AUPR, see e.g. [arXiv 1505.04094].
局限性
The paper includes a detailed discussion of limitations in section H of the appendix, which I consider sufficient for this work.
We first want to express our gratitude for the reviewer's comment that the reviewer "enjoyed reading this work". For us, it is the most rewarding feedback to know that our work is well-received. We will address the following points in the rebuttal:
Q1
The weight matrix in theorem 3.1 does not have any specific role for holding the theorem. We just keep it here because the standard GCN/SAGE formulation has the weight matrix. If we remove the weight matrix , the entire theorem will still hold, except that the constant will be different as . In other words, the untrained zero-mean-initialized weight matrix will only introduce a scaling factor to the expectation of the inner product.
Q2
We appreciate the reviewer's suggestion. Generally, it is challenging to provide a fair evaluation of training times for different methods due to variations in learning rates, batch sizes, and early stopping criteria. Since the size of trainable parameters in GNNs is relatively small compared to other deep learning models, training time is usually not a bottleneck. Most of the computational cost in GNNs comes from processing the graph structure (See Table 1 in [7]), such as message passing[1,2] or the labeling trick[3] for link prediction. Therefore, inference time can be used as a proxy for the computational cost of training models. In the Figure 4 of experimental section, MPLP is shown to be more efficient than other methods in terms of inference time, indirectly indicating that MPLP is more computationally efficient than other methods.
Q3
We thank the reviewer for the suggestion. We follow the evaluation protocol of previous works[4,5] to ensure a fair comparison. While we acknowledge that other metrics like AUC-ROC and AUC-PR are important for evaluating link prediction tasks, recent studies[6] have shown that performance evaluated by AUC-ROC and AUC-PR is nearly saturated for most methods. Therefore, we focus on the evaluation of MRR and Hits@K, which are more sensitive to the performance differences between methods.
[1] SIGN: Scalable Inception Graph Neural Networks
[2] Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
[3] Labeling Trick: A Theory of Using Graph Neural Networks for Multi-Node Representation Learning.
[4] Open Graph Benchmark: Datasets for Machine Learning on Graphs
[5] Graph Neural Networks for Link Prediction with Subgraph Sketching
[6] Neural Link Prediction with Walk Pooling
[7] MLPInit: Embarrassingly Simple GNN Training Acceleration with MLP Initialization
We hope that we have addressed your concerns. If we have left any notable points of concern unaddressed, please let us know, and we will be happy to respond.
Thank you for the responses to my questions. In light of those responses as well as the responses to the other reviewers, I am happy to retain my positive evaluation.
This paper first shows that pure message passing can count common neighbor. Based on the proposed theory, this paper develops MPLP for link prediction, where the common neighbor is an important heuristic feature. Experiments on link prediction demonstrate the performance gains of MPLP over baselines.
优点
- This paper shows that pure message passing can count common neighbor.
- The paper is well-written.
- The authors provide the theoretical analysis and experimental details.
缺点
- If MILP aims to count the common neighbor exactly, why not directly use the pre-computed common neighbor? If MILP aims to develop a neural network to approximate common neighbor heuristics, what is the novelty of MILP compared with [1]?
- To capture structural link representation such as Common Neighbors (CN), Zhang et al. [2] propose labeling trick, which is a position encoding in my opinion. Indeed, the Quasi-orthogonal vector is also a position encoding. I suggest discussing the relationship between Quasi-orthogonal vectors and existing position encodings such as random walk position encodings (RWPE), relative random walk probabilities (RRWP) [3], Resistance Distance encodings [4], and Laplace position encodings (LapPE). Notably, Laplace position encodings are also orthogonal.
- Is MILP scalable to large-scale graphs? If MILP assigns each node with a unique orthogonal vector, then the orthogonality requires the dimension of the orthogonal vector to be larger than the number of nodes, as the orthogonal vectors are linearly independent.
[1] Neural Common Neighbor with Completion for Link Prediction.
[2] Labeling Trick: A Theory of Using Graph Neural Networks for Multi-Node Representation Learning.
[3] Graph Inductive Biases in Transformers without Message Passing.
[4] Rethinking the Expressive Power of GNNs via Graph Biconnectivity.
问题
See Weaknesses.
局限性
The authors have adequately addressed the limitations.
We thank the reviewer for the feedback. We will address the following points in the rebuttal:
W1
We apologize for any confusion regarding MPLP. MPLP aims to count the number of nodes at varying distances from a target node pair. Compared to pre-computed heuristics like Common Neighbor, MPLP is significantly more efficient in terms of both time and space complexity. In a graph with nodes, pre-computed common neighbors require space to store all pairwise information, whereas MPLP only requires linear space , where represents the node signature dimension. Regarding time complexity, MPLP can be seen as an approximation of the labeling trick[2], trading off variance for efficiency. Generally, MPLP is more scalable for large-scale graphs.
Unlike NCN[1], MPLP considers not only the common neighbors but also the nodes beyond the 1-hop neighborhood. This makes MPLP more expressive and powerful in capturing structural representation than NCN. The experimental results in our paper demonstrate the effectiveness of MPLP in link prediction tasks.
W2
We appreciate the reviewer's suggestion. In fact, our quasi-orthogonal (QO) vectors are used to build a link structural representation rather than positional encoding [5]. There are distinct differences between position encoding and our QO vectors:
-
Unlike typical position encodings[3,4] designed for graph-level tasks, our QO vectors are specifically targeted for link prediction tasks. Position encodings represent the relative position of each node in the graph, whereas our QO vectors cannot represent individual node positions but can be decoded to represent pairwise structural representation for link prediction tasks.
-
When applying position encodings to link prediction tasks, most of them are limited to transductive settings. This is because position encodings represent node position within a specific graph. If graph changes, the position encodings should also change. However, our QO vectors can be decoded as a structural representation and applied in both transductive and inductive settings.
We will add a detailed comparison between QO vectors and position encodings in the revised version.
W3
We apologize for any confusion regarding the scalability of MPLP. MPLP is designed to have greater scalability for large-scale graphs compared to methods like labeling tricks[2]. Each node is assigned a random quasi-orthogonal vector of dimension . These vectors do not need to be strictly orthogonal but should be orthogonal in expectation. Therefore, MPLP is scalable for large-scale graphs.
[1] Neural Common Neighbor with Completion for Link Prediction.
[2] Labeling Trick: A Theory of Using Graph Neural Networks for Multi-Node Representation Learning.
[3] Graph Inductive Biases in Transformers without Message Passing.
[4] Rethinking the Expressive Power of GNNs via Graph Biconnectivity.
[5] On the Equivalence between Positional Node Embeddings and Structural Graph Representations.
We hope that we have addressed your concerns, and that you will consider raising your score. If we have left any notable points of concern unaddressed, please let us know, and we will be happy to respond.
Thanks for your response. Unfortunately, my concerns about Weaknesses 1,2,3 remain unaddressed. The questions are as follows.
- The authors claim that MPLP is more expressive and powerful than NCN, but I can not find the corresponding theoretical analysis.
- The shortest path distance is a pairwise structural representation (heuristic metrics), as shown in NBFNet [6]. [3] shows that relative random walk probabilities (RRWP) can express the shortest path distance. Moreover, Resistance Distance encodings [4] are similar in my opinion. Moreover, the discussion with Laplace position encodings (LapPE) [7]---which are also orthogonal---is missing.
- The graph-level classification is inductive. They first train GNNs on the training set of graphs and then infer on the test set of unseen graphs. Why are the position encodings limited to transductive settings?
- If the random quasi-orthogonal vectors have dimension , then their expectations also have dimension . So why are they orthogonal in expectation? Can you provide the theoretical analysis or reference to prove that "these vectors do not need to be strictly orthogonal but should be orthogonal in expectation"?
[6] Neural Bellman-Ford Networks: A General Graph Neural Network Framework for Link Prediction. [7] Rethinking graph transformers with spectral attention.
We thank the reviewer for the feedback on our rebuttal. We address the following points in your feedback:
Concern 1
We show that MPLP is more expressive and powerful than NCN:
Expressiveness: We present a proof sketch to show that MPLP is more expressive than NCN, specifically because MPLP can enable weighted node counting by Norm Scaling technique (Sec 4.1). Consider a rook's graph and a rook's graph, where . These two graphs are strongly regular graphs, but with different node degrees. Also any non-adjacent nodes in rook's graph will have two common neighbors. We investigate if NCN/MPLP can encode the non-adjacent node pair in rook's graph differently from the non-adjacent node pair in rook's graph.
For NCN, (1) the GNN encoder (GCN/SAGE) will generate the same representation for all nodes due to regular graphs; (2) all the non-adjacent node pairs will have two common neighbors, leading to the same NCN score.
For MPLP, because , the Norm Scaling technique will assign different norms to the quasi-orthogonal vectors of nodes in and rook's graphs. Therefore, the non-adjacent node pairs in and rook's graphs will have different MPLP representation due to the weighted node counting. This shows that MPLP is more expressive than NCN.
powerful: In Table 1/2, we empirically show that MPLP(+) can achieve better performance than NCN on various datasets, which demonstrates the effectiveness of MPLP for link prediction tasks.
We thank the reviewer for pointing this out. We will include the proof in the revised version.
Concern 2
We thank the reviewer for the suggestion. RRWP, Resistance Distance encodings, and LapPE can all provide a pairwise structural representation, especially the distance information. Even though they are used for graph classification tasks, we also believe that there can be a connection between these encodings and link prediction tasks. This can be an interesting direction for future work. We will add a discussion on the connection between RRWP/Resistance Distance/LapPE and MPLP in the revised version, detailing how these positional encodings can also be decoded to provide pairwise structural representation, similar to MPLP, and potentially be used for link prediction tasks. For LapPE, we will also cover that it is derived from the orthogonal eigenvecotrs of the Laplacian matrix, while MPLP's quasi-orthogonal vectors are randomly initialized.
Concern 3
We apologize for any confusion regarding the inductive ability of positional encodings. Previously, we referred the positional encodings specifically to the one formally defined in [5]. However, the positional encodings, like RRWP/Resistance Distance/LapPE, contains both the positional information and the structural information (this is also discussed in Sec 2 of [3]). We will clarify this when discussing the connection between RRWP/Resistance Distance/LapPE and MPLP in the revised version.
[5] On the Equivalence between Positional Node Embeddings and Structural Graph Representations.
Concern 4
We apologize for the confusion regarding the orthogonality of the vectors. Consider two vectors and independently sampled from a standard normal distribution. The inner product of these two vectors is likely to be zero but not always. However, the expected value of the inner product is , which means that the vectors are orthogonal in expectation due to the independence of the vectors. This is why we refer to the vectors as quasi-orthogonal such that they are orthogonal from the probabilistic perspective. We will clarify this in the revised version.
We hope that we have addressed your concerns. If we have left any notable points of concern unaddressed, please let us know, and we will be happy to respond.
Thanks for your response. The response has addressed Concern 1. Unfortunately, my concerns about Concerns 2,3,4 remain unaddressed. The suggestions and questions are as follows.
- The authors may want to discuss the theoretical properties of the above-mentioned position encodings. For example, can the position encodings estimate common neighbors which is the key topic of this paper?
- Let . As the rank of the matrix is less than , the vectors in the -dimensional vector space can not be pairwise orthogonal.
We thank the reviewer for the feedback. We address the following points in your feedback:
Q1: Can the position encodings estimate common neighbors?
We thank the reviewer for the insightful question. Theoretically, the positional encodings like RRWP/Resistance Distance/LapPE can estimate the number of common neighbors between two nodes. We first show that RRWP can estimate the Resource Allocation (RA) score between two nodes with a bias term.
Consider a graph with two nodes and and ajacency matrix . The RA score between node and node is defined as , where is the degree matrix. For RRWP in [3], the relative positional encoding between node and node is . Then, the 3rd element of is . Therefore, RRWP can estimate the RA score between node and node , biased by the degree of node and node .
Compared to MPLP(+)'s unbiased estimation, RRWP's estimation of RA is biased. Moreover, the computation of RRWP involves adjacency matrix 's power operation, which is much more computationally expensive than MPLP(+)'s pairwise structural estimation. Therefore, MPLP(+) provides a more efficient and unbiased estimation of the number of common neighbors between two nodes.
Similar to RRWP, Resistance Distance and LapPE can also estimate the number of common neighbors between two nodes. Resistance Distance Encoding can estimate the common neighbor between two nodes because of its equivalence to the random walk's ( in RRWP) commute time [4]. LapPE can estimate the number of common neighbors between two nodes because it is derived from the eigenvectors of the Laplacian, which can tell the relative distance information between two nodes [5]. Rigorous investigation for Resistance Distance and LapPE estimation of common neighbors can be an interesting direction for future work.
[3] Graph Inductive Biases in Transformers without Message Passing.
[4] Rethinking the Expressive Power of GNNs via Graph Biconnectivity.
[7] Rethinking graph transformers with spectral attention.
Q2: Pairwise Orthogonality
We apologize for the confusion. Since the quasi-orthogonal vectors are independently sampled from a standard normal distribution, their expectation is actually zero vector, orthogonal to each other. For , the independence of the vectors ensures that we can split the expectation into the product of the expectations. We will clarify this in the revised version.
Thanks for the feedback. If there are any other concerns, please let us know, and we will be happy to respond.
Thanks for your rebuttal which has answered some of my concerns. Due to Weaknesses 3 (Pairwise Orthogonality), I still have concerns about the correctness of the proof in this paper. Therefore, I would like to lower my confidence level in my assessment to 1.
Dear Reviewers,
We appreciate your valuable feedback and constructive suggestions when reviewing our paper. Here, we want to address a concern raised by Reviewer CYaw regarding the difference between MPLP(+) and ELPH/BUDDY[1]:
W1: Comparison with ELPH/BUDDY
We appreciate the reviewer's question about the difference between MPLP and BUDDY. We agree that the principle of capturing structural information is similar between MPLP and BUDDY. However, there are several key differences between MPLP and BUDDY:
-
MPLP is more expressive than ELPH/BUDDY due to the flexibility of orthogonal vectors: Sec 4.1 introduces that MPLP(+) can perform a weighted count with the norm rescaling technique. In general, by scaling the norm of the vectors according to the node degrees, MPLP can capture more nuanced structural information than BUDDY with a weighted counting mechansim. For example, Resource Allocation(RA) and Adamic-Adar(AA) are two weighted counting methods for Common Neighbor(CN), which shows better overall performance than unweighted counting CN. MPLP generalizes these two methods beyond the 1-hop neighborhood by using degree-normalized orthogonal vectors. However, BUDDY cannot perform a weighted counting due to the mechanism of MinHash and HyperLogLog. Therefore, we believe that the incorporation of RA/AA into MPLP is one of the driving factors boosting its performance. In fact, in the Github repository of BUDDY, BUDDY uses a pre-computed RA score as the edge feature to improve the performance on OGBL-PPA, which also suggests the effectiveness of RA, but, in our humble opinion, violates BUDDY's beauty of only using node-level features/signatures to perform link prediction tasks.
-
ELPH/BUDDY can cause a distribution shift problem, while MPLP will not: Sec 4.2 introduces that during training, MPLP(+) will perform a shortcut removal for the positive node pairs to avoid the distribution shift problem[2]. This technique is critical for both ELPH/BUDDY and MPLP. Consider a target node pair and another node that connects to node only but not . If is a positive node pair, there is an edge connecting and . When counting the 2-hop neighbors for node , will always be counted because it can exploit the node and the link between and to reach node . This will cause a distribution shift problem because the model can easily distinguish the positive node pairs from the negative node pairs, just during training. MPLP(+) will perform a shortcut removal for the positive node pairs to avoid this problem. However, ELPH/BUDDY will not perform this shortcut removal, which will cause a distribution shift problem.
- In fact, I believe that the authors of BUDDY are aware of this problem but have not addressed it in their paper.
In their code, they have a
use_zero_oneflag to always discard the 2-hop neighbor counts due to the issue above. Recently, the Github repo of BUDDY has been updated with a new feature branchrewiring, where they also start to implement a shortcut removal technique to avoid the distribution shift problem. It can be found in thesrc/datasets/elph.pyfile'sremove_bridgesfunction.
-
MPLP+ is walk-based: MPLP+, different from ELPH/BUDDY, estimates the number of walks (Theorem 3.3) rather than nodes between two nodes. It is still an open question whether walk-based methods are better than node-based methods for link prediction tasks (MPLP is node-based). However, MPLP+ shows better empirical performance than ELPH/BUDDY in our experiments.
-
Lower estimation variance: With the same computational budget, MPLP can perform the pairwise structural estimation with lower variance than ELPH/BUDDY. This is empirically shown in Appendix F.1.
We want to again appreciate the reviewer's in-depth question on the difference between MPLP and BUDDY. While we have discussed the differences above in Appendix A, we have not dived into much details in the main paper to avoid distracting the reader. We will consider adding the above detailed comparison between MPLP and BUDDY in the revised version.
[1] Graph Neural Networks for Link Prediction with Subgraph Sketching
[2] FakeEdge: Alleviate Dataset Shift in Link Prediction
Dear reviewers,
First of all, thank you for your service to the ML community. Writing high-quality reviews and engaging with authors' responses is essential to a healthy ML research community.
It is essential that you read the rebuttals and provide a response and/or follow-up questions within the next few days. This will allow the authors sufficient time to react. While a detailed response addressing all points is not necessary, at a minimum, you should indicate you have read and considered the review and whether you will maintain or revise your score. Please also take the time to read the other reviews. Understanding your fellow reviewers' key arguments can help converge towards more similar ratings in cases of diverging scores.
I want to thank you again, and I look forward to following the discussions here.
The reviewers all lean towards accepting the paper with one reviewer scoring the paper to be of very high quality and impact. Reasons to accept the paper are
(i) both theoretical and practical results that should be relevant to the graph learning community (ii) strong and extensive experimental evaluation with competitive results (ii) extensive ablation studies (iv) a clearly and well-written paper with a strong motivation of the proposed approach
There was one reviewer who was not satisfied with the reasons the authors provided for why their methods works. Indeed, some explanations where inconsistent and the authors should take this feedback into account in the final version of their paper.