PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.3
置信度
创新性2.8
质量2.8
清晰度3.5
重要性2.8
NeurIPS 2025

Effects of Dropout on Performance in Long-range Graph Learning Tasks

OpenReviewPDF
提交: 2025-05-03更新: 2025-10-29
TL;DR

We present the shortcomings of existing dropout-based methods in modeling long-range tasks.

摘要

关键词
Dropoutgraph neural networkslong-range interactions

评审与讨论

审稿意见
4

Graph neural networks are known to suffer from issues like over-squashing (due to topological bottlenecks) which affect long-range interactions and over-smoothing, wherein repeated rounds of information aggregation results in node representations losing their discriminative power. This work explores the connections between Dropout style algorithms such DropEdge, DropNode, DropAgg and DropGNN and over-squashing. While Drop-out style algorithms have been shown to mitigate over-smoothing and improve downstream performance, this current work highlights the limitations of such methods by showing how these methods desensitize distant nodes to each other consequently making over-squashing worse. As a resolution DropSens is proposed which is a sensitivity-aware variant of DropEdge that can control the proportion of information lost due to edge-dropping, ensuring long-range interactions between distant nodes is enhanced. Extensive experiments on different datasets and GNN (GCN and GIN) models confirm their hypothesis.

优缺点分析

Strengths

  1. The characterization of interaction between random dropout style algorithms and over-squashing via the sensitivity analysis is really interesting.

  2. The proposed DropSens algorithm seems to increase the sensitivity of long-range nodes and thus possibly help with over-squashing.

  3. The paper is well-written and well-structured.

Weaknesses

  1. While I understand the whole dichotomy between over-squashing and over-smoothing trade-off, I am not really sure if the current motivation of the work is strong enough. The sensitivity analysis has already been proposed in [1] and [2] and making use of this analysis to show that dropout style algorithms might make over-squashing worse while is an interesting observation but its perhaps not entirely unexpected, since the dropout style algorithms were originally proposed with different design goals.

  2. The discussion on sensitivity analysis (Figure 1) could use a little more nuance. How would the sensitivity analysis look like for a heterophilic dataset? Does random edge dropping affect different-class neighbors more than same-class neighbors (which would be good)? Or does it reduce both? And thus could potentially explain why we see no significant results on heterophilic datasets?

  3. The paper argues that random edge dropping harms long-range tasks, and the results on heterophilic datasets are presented as evidence. However, one might intuitively expect dropping a small number of edges might be beneficial in heterophilic settings, since its more likely we prune connections between nodes of different labels? For instance, the work talks about absolute sensitivity reduction between distant nodes through random dropout style algorithms, but it is possible deleting a small number of edges in a heterophilic setting might reduce the influence of noisy neighbours and thus might improve relative sensitvity between same-class distant nodes?

  4. Experiments on Long Range Graph Benchmark datasets (LRGB) [3] [4] are missing. I think this is an important benchmark to include if the theme is on long range interactions. It would be interesting to see how these Dropout style algorithms including the proposed DropSens perform when it comes to LRGB datasets. I understand the datasets in LRGB have different tasks but PASCAL-VOC is a node classification task and datasets such as peptides-func and peptides-struct are graph classification and regression tasks respectively and ideally should have no problem including them in the experiments (since results on graph classification is already included in the work anyway).

  5. I think the work currently assumes somewhat pessimistic view - dropout style algorithms which remove edges are detrimental, but what about strategic edge deletions? For instance [5] shows that specific edge deletions can increase the spectral gap via the Braess paradox and thus can mitigate over-squashing while not making over-smoothing worse?

References

  1. Understanding over-squashing and bottlenecks on graphs via curvature. Topping et al, ICLR 2022.

  2. On Over-Squashing in Message Passing Neural Networks: The Impact of Width, Depth, and Topology. Di Giovanni et al, ICML 2023.

  3. Long Range Graph Benchmark, Dwivedi et al 2022.

  4. Where Did the Gap Go? Reassessing the Long-Range Graph Benchmark, Toenshoff et al, TMLR 2023.

  5. Spectral Graph Pruning Against Over-squashing and Over-smoothing, Jamadandi et al, NeurIPS, 2024.

问题

See weaknesses

局限性

Yes

最终评判理由

The authors have done a good job of providing with extra experiments. However, I maintain my position on the fundamental motivation: demonstrating that random dropout-style algorithms perform poorly on long-range tasks—when they were designed for entirely different purposes—feels like evaluating the wrong tool for the wrong job. Further, results on LRGB which actually have long range interactions suggest the proposed DropSens (and Drop-out variants like DropEdge) perform poorly compared to NoDrop also reinforces my initial concern about the applicability of such methods for modeling long range interactions. I have increased my score from 2 to 4 in recognition of the authors' responses and the technical merit of their approach, but I am not entirely convinced to champion this paper for acceptance.

格式问题

L904 in the appendix has missing subsection content only a heading called - "Why influence scores?".

作者回复

Thank you very much for the insightful summary, for recognizing the strengths of our work, and for the overall feedback!

Before we address your concerns, we wanted to bring to your attention that we had accidentally omitted the DropSens configuration $(c, q_{\max}) = (0.8, 0.3)$ from our original hyperparameter sweep. This has now been included, and the results have been updated accordingly. The overall findings remain unchanged, with the exception that DropSens now ranks 3rd on the Proteins dataset.


Kindly find our responses to the concerns raised:

[...] making use of this analysis to show that dropout style algorithms might make over-squashing worse [...] perhaps not entirely unexpected, since the dropout style algorithms were originally proposed with different design goals.

Thank you for sharing your perspective. We agree that it may seem intuitive that dropout-style methods reduce sensitivity between distant nodes. However, our theoretical analysis serves a deeper purpose than simply confirming this behavior: it provides a concrete understanding of why and how these methods fall short in the context of long-range tasks, and it motivates the design of DropSens as a principled alternative.

While it may not be surprising that these methods diminish long-range sensitivity, the key concern we highlight is that this fundamentally undermines their utility for the very tasks that deeper GNNs aim to solve:

The original motivation behind alleviating over-smoothing was to enable training of deeper MPNNs, which are only necessary when long-range dependencies matter. But if dropout-based methods worsen over-squashing, they negate this benefit – deep models may remain trainable (i.e., node representations don't collapse), but the model is still unable to leverage information from distant nodes for prediction.

This, in our view, is a critical and underappreciated limitation. We are not claiming that these methods are ineffective in general, but that they should not be applied in isolation when long-range interactions are important. Our hope is this analysis will motivate practitioners to thoroughly assess the effects of methods designed for alleviating over-smoothing and training deep GNNs, with regards to their effects on over-squashing.

How would the sensitivity analysis look like for a heterophilic dataset? [...] could potentially explain why we see no significant results on heterophilic datasets?

Thank you for these thoughtful questions. First, we’d like to clarify a possible misconception: our sensitivity analysis (as shown in Figure 1) does not depend on node labels. It measures the norm of the Jacobian of the final node representations (e.g., logits or regressands) with respect to the input node features. As such, it is agnostic to whether a dataset is homophilic or heterophilic – the analysis captures how much a node’s output depends on the input features of other nodes, regardless of class.

Second, random edge-dropping methods like DropEdge are entirely blind to node or edge attributes, including class membership. They drop edges uniformly at random, and thus reduce influence from both same-class and different-class neighbors indiscriminately. This uniformity could indeed be suboptimal, and it points toward a promising direction, as you hint: incorporating structural or label-aware heuristics to guide edge dropping may help preserve task-relevant information more effectively, especially in settings where homophily cannot be assumed.

As for why dropout methods perform much worse on heterophilic datasets than on homophilic datasets, we refer to Section 5.2:

"[...] homophilic datasets have local consistency in node labels, i.e. nodes closely connected to each other have similar labels. On the other hand, in heterophilic datasets, nearby nodes often have dissimilar labels. Since DropEdge-variants increase the sensitivity of a node’s representations to its immediate neighbors, and reduce its sensitivity to distant nodes, we expect it to improve performance on homophilic datasets but harm performance on heterophilic ones […]"

[...] one might intuitively expect dropping a small number of edges might be beneficial in heterophilic settings [...] might reduce the influence of noisy neighbours and thus might improve relative sensitvity between same-class distant nodes?

Thank you for this thoughtful observation. We understand the intuition that dropping a small number of edges might reduce the influence of noisy (i.e., different-class) neighbors in heterophilic settings. However, we believe this intuition overlooks a key limitation of random edge-dropping methods like DropEdge: they do not use any information from node features, class labels, or graph structure when sampling edge masks. Edges are dropped independently and uniformly at random, with no consideration for whether an edge connects same-class or different-class nodes.

This means that while it is possible that a given sampled mask might coincidentally drop some "noisy" edges, the probability of consistently retaining the correct paths for information flow – especially between distant same-class nodes – is extremely low. This is because successful long-range propagation typically requires entire paths of edges to be active in the forward pass. Dropping even a single edge on such a path breaks the flow of information. Hence, random dropout is more likely to disrupt useful long-range communication than improve it, particularly in heterophilic graphs where same-class nodes tend to be farther apart.

This is precisely the phenomenon our sensitivity analysis and experiments aim to highlight: dropout methods, while helpful in alleviating over-smoothing, can severely exacerbate over-squashing by indiscriminately pruning edges – thereby reducing a node’s sensitivity to informative distant features.

Ultimately, if the goal is to reduce influence from different-class neighbors while preserving important long-range same-class connections, this requires structured or informed edge-dropping, not random sampling. Our work points toward this need by showing the limitations of uniform random edge-drop methods in heterophilic and long-range settings.

Experiments on Long Range Graph Benchmark are missing [...]

Thank you for this thoughtful suggestion. We fully agree that the LRGB benchmark is highly relevant for evaluating long-range reasoning capabilities, and we appreciate the opportunity to clarify our decision.

First, we would like to highlight the computational cost of our experiments. For each dataset ×\times GNN ×\times dropping method combination, we conducted 20 runs across 9 different dropping probabilities, followed by 30 additional runs for the best configuration (see "Number of Independent Runs" under Appendix E.2 to learn why we take 50 runs for evaluation), totaling 210 runs per setting. Extending this setup to large-scale LRGB datasets, which involve significantly larger graphs and longer training times, was unfortunately beyond our available compute budget.

Second, our current choice of datasets was motivated by the desire to make direct and meaningful comparisons with prior works on graph rewiring for alleviating over-squashing (Black et al., 2023; Karhadkar et al., 2023; Topping et al., 2022). These prior works have widely adopted the benchmark suite used in our study, so evaluating on these datasets allowed us to assess DropSens in the context of established baselines.

That said, we recognize the value of testing dropout-style methods on LRGB tasks, and we agree that it is an exciting direction, especially since the LRGB datasets include both node- and graph-level tasks, making them a natural complement to our current benchmark. For that reason, we are aiming to share a comparison between NoDrop, DropEdge and DropSens on the PeptidesStruct graph-regression dataset by the end of discussion period. We would appreciate your patience as we work on it.

I think the work currently assumes somewhat pessimistic view [...] can mitigate over-squashing while not making over-smoothing worse?

Thank you for sharing this reference – we weren’t previously aware of it, and we found it very insightful. We fully agree that strategic edge deletion offers a promising approach to addressing both over-smoothing and over-squashing. In fact, this aligns with our broader message: it’s not the act of edge deletion itself that is harmful, but how it is done.

Our work critiques dropout-style methods like DropEdge not because they remove edges per se, but because they do so uniformly at random, without regard to the underlying graph topology or feature space. By contrast, methods that perform informed or topology-aware edge deletion – including those leveraging phenomena like the Braess paradox to enhance spectral properties – can potentially mitigate over-squashing without exacerbating over-smoothing. We view such methods as complementary to our analysis, and have discussed a few works on unified treatment of over-smoothing and over-squashing in Appendix A.4; we have also added the reference you shared in this section.


References:

Mitchell Black, Zhengchao Wan, Amir Nayyeri, and Yusu Wang. Understanding oversquashing in GNNs through the lens of effective resistance. In International Conference on Machine Learning, 2023.

Kedar Karhadkar, Pradeep Kr. Banerjee, and Guido Montufar. FoSR: First-order spectral rewiring for addressing oversquashing in GNNs. In International Conference on Learning Representations, 2023.

Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. In International Conference on Learning Representations, 2022.

评论

I would like to thank the authors for their responses. Some of my concerns still remain

  1. As I said earlier in my review, my main concern is that the paper is highlighting an observation for a method that was not intended to be used in the long range context in the first place as a fundamental flaw of Drop-out like methods. To elaborate, all the dropout-style methods have primarily been evaluated on homophilic graphs [1],[2],[3] (while [3] also shows results on graph classification which have LRIs, but there I think the idea of model expressivity is also under consideration) where the label/feature of nodes are homophilic and LRIs might not be a problem and Drop-out style methods work and they also meet their intended design goal of training deeper GNNs. By showing they perform poorly on heterophilic graphs, the work essentially confirms an expected outcome rather than uncovering a surprising flaw. This makes the motivation feel less like a novel insight and more like an application of a known tool to the wrong problem.

  2. I understand that the sensitivity analysis does not depend on the node labels and relies on the node features, but my motivation to ask that question was about how the concept of heterophily is not limited to class labels; it can also refer to feature heterophily, where connected nodes have dissimilar feature distributions [4]. In many real-world heterophilic graphs, the heterophily-ness is a combined effect of both label and feature dissimilarity. I believe presenting a sensitivity analysis on a heterophilic graph would be a valuable addition. It would help us understand whether the observed reduction in sensitivity just harms all long-range connections equally, or if it has a more nuanced, feature-dependent effect that could better explain the performance on these graphs.

  3. I also think the paper could benefit from analysing what kind of edges are more likely dropped when applying Drop-out style algorithms to heterophilic datasets (same-class vs different class), i understand that the method is label agnostic, but what I am failing to understand is - in a heterophilic graph the node/feature heterophily patterns are more complex than the current theoretical analysis suggests. The analysis only identifies two regimes - immediate neighbors vs. long-distance interactions - but both of these categories can contain beneficial or harmful connections depending on label relationships. For instance, in a heterophilic graph dropping an edge increases the sensitvity to nearby neighbours which might be bad but if the long range interaction we are trying to capture also has a different label then it could be beneficial vs if that node had a same label, then its bad. Which brings me back to my original doubt about applying random drop-out style algorithms for heterophilic graphs in the first place. If the method was optimizing some objective for instance spectral gap then it wouldn't matter if we dropped same-label/different-label edges because we are concerned with optimizing a global objective of information flow (which the spectral gap characterizes) and which random-style algorithms lack.

References:

  1. DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. Rong et al. ICLR 2020.

  2. Graph Random Neural Networks for Semi-Supervised Learning on Graphs. Feng et al. NeurIPS 2020.

  3. DropGNN: Random Dropouts Increase the Expressiveness of Graph Neural Networks. Papp et al. NeurIPS 2021.

  4. What Is Missing In Homophily? Disentangling Graph Homophily For Graph Neural Networks. Zheng et al NeurIPS 2024.

评论

Thank you very much for the response!

[...] all the dropout-style methods have primarily been evaluated on homophilic graphs [...] where the label/feature of nodes are homophilic and LRIs might not be a problem and Drop-out style methods work and they also meet their intended design goal of training deeper GNNs. [...] This makes the motivation feel less like a novel insight and more like an application of a known tool to the wrong problem.

Thank you for raising this concern. We believe the disagreement stems from a different understanding of the purpose behind dropout-style methods and deep MPNNs.

Our work focuses on message-passing neural networks (MPNNs), where depth is synonymous with reach, since the number of layers directly governs how far information can propagate in the graph. Therefore, we believe the motivation for training deep MPNNs is precisely to model long-range interactions. In fact, using deep MPNNs on tasks with only short-range dependencies may result in dataset-model misalignment. This is why we argue that evaluating such methods solely on homophilic datasets (where long-range signals are not necessary) paints an incomplete picture. As we discuss in Appendix A.2, this has led to relatively little attention being paid to how these methods affect performance on tasks that do require long-range modelling.

Since dropout-based methods compromise sensitivity to distant nodes, as we show both theoretically and empirically, the very reason for going deeper is undermined. The model may remain trainable by mitigating over-smoothing, but it cannot leverage the long-range signals, defeating the very purpose of going deeper.

Does this clarify the motivation behind our work, and the significance of our findings?

[...] it can also refer to feature heterophily, where connected nodes have dissimilar feature distributions [...] It would help us understand whether the observed reduction in sensitivity just harms all long-range connections equally, or if it has a more nuanced, feature-dependent effect that could better explain the performance on these graphs.

Ah, I see – that makes total sense! Just to make sure we are on the same page, you're suggesting that comparing the sensitivity profiles of homophilic and heterophilic datasets is helpful because homophily extends beyond node labels, so despite being label-agnostic, sensitivity analysis may offer an explanation for the difference in performances on these datasets – correct?

Would a comparison between the sensitivity profiles of PubMed (edge- and feature-homophilic, as suggested by Zheng et al. (2024, Table 3)) and Chameleon (edge- and feature-heterophilic) help address your concern?

I also think the paper could benefit from analysing what kind of edges are more likely dropped when applying Drop-out style algorithms to heterophilic datasets (same-class vs different class) [...] For instance, in a heterophilic graph dropping an edge increases the sensitvity to nearby neighbours which might be bad but if the long range interaction we are trying to capture also has a different label then it could be beneficial vs if that node had a same label, then its bad.

If I understand correctly, the question here is whether random edge-dropping affects same-class and different-class node pairs differently, particularly in the context of heterophilic graphs. In other words, you're pointing to how the topology of heterophilic datasets may influence which types of connections are disproportionately impacted by dropout.

Would it be helpful if we conducted an analysis similar to what we did for Cora, but using Chameleon instead, and with same-class and different-class node pairs grouped and analyzed separately? This would allow us to see whether dropout has different effects depending on the label relationship between nodes.


Yilun Zheng, Sitao Luan, and Lihui Chen. What Is Missing In Homophily? Disentangling Graph Homophily For Graph Neural Networks. In Annual Conference on Neural Information Processing Systems, 2024.

评论

Thank you for your response. Some clarifications

  1. I think the discussion here is a little nuanced. Lets break this down into 2 parts

a) Firstly, increasing model depth does not help with long-range interactions, as authors in [1] show it quickly leads to vanishing gradients and of course there is the problem of over-smoothing. On the other hand, graph rewiring has largely been proposed to mitigate over-squashing/help model long-range interactions. All of these techniques modify the graph in a principled way either based on Ricci curvature, spectral gap, effective resistance etc. which optimize for certain topological properties of the graph, for instance, spectral gap is tied to how well information can flow in a graph and is guaranteed to help with long-range interactions. Random edge modifications do not have such guarantees.

b) Consequently, methods that talk about training deeper GNN models do not have the objective of modeling long-range interactions. For instance in [2] the DropEdge acts as a graph augmentor and is supposed to help with over-fitting and over-smoothing and thus allows for training deeper GNNs, similarly in [3] DropGNN helps with increasing model expressivity (Weisfeler-Leman tests etc) they dont even train a deeper GNN model, in [4] the goal is to reduce memory consumption and the OGB datasets they use are all largely homophilic.

This is precisely why I believe the motivation is somewhat limited. Random dropout-style methods have different objectives. While it would be an interesting experiment to see what happens when such methods are applied to heterophilic datasets, this does not necessarily mean it fundamentally undermines their utility for the very tasks that deeper GNNs aim to solve.

  1. Yes exactly! I think having a figure like Figure 1 for a heterophilic dataset should provide more insights.

  2. Yes, I think that would provide valuable insight into if Random drop-out style algorithms affect all types of edges indiscriminately or are more likely to affect certain types of edges (meaning connecting same-label nodes vs different label nodes), if for instance DropSens affected different label nodes more often than same-labels then it can also act as a message-passing reducer effectively allowing you to also mitigate over-smoothing and possibly show training deeper GNNs is possible? I'm not asking the authors to conduct this experiment - this is just another line of reasoning that might strengthen the paper's claims.

I maintain my concerns about the fundamental motivation and methodology.

References :

[1] On Over-Squashing in Message Passing Neural Networks: The Impact of Width, Depth, and Topology. DiGiovanni et al. ICML 2023.

[2] DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. Rong et al. ICLR 2020.

[3] DropGNN: Random Dropouts Increase the Expressiveness of Graph Neural Networks. Papp et al. NeurIPS 2021.

[4] Training Graph Neural Networks with 1000 Layers. Li et al. ICML 2021.

评论

[...] this does not necessarily mean it fundamentally undermines their utility for the very tasks that deeper GNNs aim to solve.

Kindly see our remark above. To reiterate, we are not suggesting that these algorithms are without merit – they are effective at preventing overfitting and can help alleviate over-smoothing, thereby improving the trainability of deeper MPNNs. However, this often comes at the cost of exacerbating over-squashing, which limits their suitability for long-range tasks – the very setting where deeper MPNNs are most needed.

Instead, we view these techniques primarily as different regularizers that can be complementary to methods specifically aimed at alleviating over-squashing. For instance, Dropout was used alongside graph rewiring approaches in works such as Black et al. (2023); Karhadkar et al. (2023); and Topping et al. (2022).

We hope this makes the our motivation clear.


References

Mitchell Black, Zhengchao Wan, Amir Nayyeri, and Yusu Wang. Understanding oversquashing in GNNs through the lens of effective resistance. In International Conference on Machine Learning, 2023.

Kedar Karhadkar, Pradeep Kr. Banerjee, and Guido Montufar. FoSR: First-order spectral rewiring for addressing oversquashing in GNNs. In International Conference on Learning Representations, 2023.

Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. In International Conference on Learning Representations, 2022.

评论

First off, thank you very much for actively engaging in the discussion – we really appreciate the thoughtful feedback!

We're currently working on adding experimental results for heterophilic datasets, but in the meantime, we wanted to briefly respond to your first point.

Firstly, increasing model depth does not help with long-range interactions.

We mostly agree – increasing depth does improve the reach of MPNNs, but problems like over-smoothing and over-squashing limit their potential in practice. That’s precisely why much of the community, including us, is interested in mitigating these issues.

Graph rewiring has largely been proposed to mitigate over-squashing [...] these techniques modify the graph in a principled way [...] Random edge modifications do not have such guarantees.

We agree that dropout-based techniques don't provide formal guarantees for information flow. Therefore, their detrimental effects on over-squashing doesn’t invalidate the claims made by those works – namely, that they alleviate over-smoothing and improve the trainability of deeper MPNNs.

Consequently, methods that talk about training deeper GNN models do not have the objective of modeling long-range interactions. For instance in [2] the DropEdge acts as a graph augmentor and is supposed to help with over-fitting and over-smoothing and thus allows for training deeper GNNs [...]

Respectfully, we’d disagree. The motivation for deep MPNNs largely stems from the goal of capturing long-range interactions (LRIs). As noted in Li et al. (2018), a foundational work on over-smoothing:

"Since the graph convolution is a localized filter [...] a shallow GCN cannot sufficiently propagate the label information to the entire graph with only a few labels. [...] the accuracy of GCNs decreases much faster than the accuracy of label propagation. [...] it reflects the inability of the GCN model in exploring the global graph structure."

If the task were inherently short-range, there would be little reason to train deep MPNNs, i.e. execute multiple message-passing steps, at all. In such cases, improving the capacity of the message, aggregation or update functions (e.g., using deeper MLPs as update functions in GIN), while keeping the number of message-passing steps low, would be a more effective and efficient alternative, and would avoid the challenge of over-smoothing altogether.

Alon & Yahav (2021) make a similar point when distinguishing between short- and long-range tasks:

"Over-smoothing was mostly demonstrated in short-range tasks [...] – tasks that have small problem radii, where a node’s correct prediction mostly depends on its local neighborhood. [...] Since the learning problems depend mostly on short-range information in these datasets, it makes sense why more layers than the problem radius might be extraneous."

This framing further supports our view that the primary motivation for alleviating over-smoothing and training deeper GNNs is, indeed, to capture LRIs.

Finally, regarding the point that random dropout-like algorithms target overfitting: while we acknowledge their effectiveness on short-range tasks, in the context of long-range tasks, their benefits are often overshadowed by increased over-squashing and a tendency to overfit to short-range signals. We explore this in Appendix F, and request you to check Figures 7 and 9 for supporting evidence.

[...] similarly in [3] DropGNN helps with increasing model expressivity (Weisfeler-Leman tests etc) they dont even train a deeper GNN model [...]

We agree, and thank you for pointing this out – we apologize for the oversight. We've now added a clarification noting that while DropGNN falls under the broader class of edge-dropping algorithms, its design was not motivated by the goal of alleviating over-smoothing or enabling the training of deeper GNNs.


Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations, 2021.

Qimai Li, Zhichao Han, and Xiao-ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2018.

评论

Below, we present the average sensitivity under DropEdge, normalized by average sensitivity with NoDrop, to understand how the relative sensitivities vary for homophilic and heterophilic datasets:

DatasetdG(j,i)=d_G(j, i) = 0123456
Cora1.783e+001.165e+009.752e-017.570e-016.473e-014.837e-012.519e-01
CiteSeer1.727e+001.045e+009.798e-016.984e-015.559e-013.672e-011.946e-01
PubMed3.511e+001.022e+001.089e+006.777e-016.151e-014.290e-014.576e-01
Chameleon3.034e+001.295e+009.625e-018.401e-017.816e-015.815e-016.126e-01
Squirrel1.602e+001.429e+001.235e+008.854e-018.444e-017.787e-016.763e-01
TwitchDE5.657e+002.357e+001.203e+009.466e-011.044e+009.640e-014.968e-01
Actor2.493e+001.663e+001.289e+008.653e-017.941e-017.081e-015.314e-01

It is interesting to note that the relative sensitivity is higher for heterophilic datasets than for homophilic datasets. That is, the graph topology and node features enable LRIs to be modelled better for heterophilic datasets than for homophilic datasets.

Yet, relative sensitivity falls below 1 (i.e., sensitivity under DropEdge is lower than under NoDrop) for nodes more than 2-hops away. This is not much of an issue for homophilic datasets, which are inherently short-range tasks, but it explains the poor performance on heterophilic datasets, which require modelling LRIs.


Next, we observe how DropSens improves on DropEdge by comparing their sensitivity profiles:

DatasetdG(j,i)=d_G(j, i) = 0123456
Cora7.312e-011.025e+001.067e+001.123e+001.094e+001.155e+001.228e+00
CiteSeer6.768e-011.066e+001.079e+001.318e+001.305e+001.633e+002.038e+00
PubMed3.325e-011.025e+009.755e-011.387e+001.478e+001.708e+001.658e+00
Chameleon5.106e-018.513e-011.042e+001.117e+001.198e+001.330e+001.174e+00
Squirrel1.043e+009.056e-019.158e-011.060e+001.034e+001.019e+001.052e+00
TwitchDE4.012e-016.809e-019.326e-011.041e+009.943e-019.586e-011.295e+00
Actor5.030e-017.333e-019.069e-011.162e+001.153e+001.170e+001.353e+00

It is clear to observe that DropSens allows node representations to be more sensitive to distant (d3d \geq 3) nodes' features, than DropEdge, thereby improving performance on heterophilic datasets.

What is interesting is that for homophilic datasets, DropSens is more sensitive to nearby nodes than DropEdge, allowing the model to effectively capture short-range interactions. Although it is poorer at capturing short-range interactions for heterophilic datasets, the table below shows that DropSens improves over NoDrop on that front, similar to DropEdge:

DatasetdG(j,i)=d_G(j, i) = 0123456
Cora1.304e+001.193e+001.040e+008.503e-017.081e-015.586e-013.094e-01
CiteSeer1.169e+001.114e+001.057e+009.202e-017.254e-015.997e-013.966e-01
PubMed1.167e+001.047e+001.063e+009.399e-019.090e-017.325e-017.586e-01
Chameleon1.549e+001.102e+001.003e+009.387e-019.367e-017.732e-017.189e-01
Squirrel1.671e+001.294e+001.131e+009.383e-018.729e-017.931e-017.118e-01
TwitchDE2.269e+001.605e+001.122e+009.856e-011.038e+009.241e-016.436e-01
Actor1.254e+001.219e+001.169e+001.005e+009.157e-018.286e-017.191e-01

This set of experiments suggests that DropSens strikes a perfect balance between NoDrop and DropEdge:

  1. improves sensitivity to nearby nodes (d2d \leq 2), over NoDrop
  2. improves sensitivity to distant nodes (d3d \geq 3), over DropEdge
  3. mitigates over-fitting (by stochastic regularization) and over-smoothing (by reducing message-passing), just as DropEdge
评论

Thank you for your responses. I appreciate the authors taking time to conduct extra experiments. I am satisfied with most of these answers and the new experiments are also interesting. I will be increasing my score.

评论

Thank you very much for acknowledging the new results!

We also request you to take a look at our official comment, where we have updated DropSens rankings compared to graph-rewiring methods, and also added results for the PeptidesStruct (LRGB) dataset.

审稿意见
4

The paper looks at various dropout methods for graph NNs. These method have been introduced to try and alleviate the over-smoothing problem. The paper postulates that these methods might actually make another problem, over-squashing, worse, because they introduce more bottlenecks in the graph. The authors show this is in fact true when the model is a linear GCN. The analysis is used to design a new dropout method, DropSens, which makes the dropout node-dependent in such a way that it reduces over-squashing. DropSens is shown to be more effective in enabling long-range interactions in GCNs than alternative graph rewiring methods.

优缺点分析

Strengths:

  • The paper provides a clear and sufficient background for understanding the problem of over-squashing.
  • The proposed method (GCN+DropSens) is clear and targeted, and provides obvious improvements over the GCN+Rewiring in the node-classification setting.
  • The work provides a fairly convincing explanation as to why dropout methods damage long-range abilities for GCNs.

Weaknesses:

  • While performance improvements are clear for node-classification tasks, improvements are less clear on the graph-classification tasks (Table 2 (b)). For example, DropSens underperforms by \sim10 points on Mutag, 15\sim 15 points on IMDb
  • DropSens only performs well on GCN, and not on GAT or GIN.
  • The paper does not propose an alternative for these GAT/GIN.
  • The empirical analysis should include more datasets.
  • Empirical results from the paper are not clear from the abstract (i.e. that DropSens only works well on GCN).

Overall the results are lacking, and the paper could be drastically improved by adding more datasets and/or an improvement for GIN.

Minor issues:

  • Some notations are not clearly defined. For example, ND/DE in the expectations (line 195, line 145), difference between P and \dot{P} could be clearer, and EMiE_{M^i} is not properly defined.
  • A comparison between DropSens and the other dropout variants would be helpful; perhaps this could be done by include the best dropout method in Table 2.
  • Missing error bars in Table 2, Figure 3

问题

  • Do you mean to include a kronecker delta (δij\delta_{ij}) in Eq. (3.3)?
  • Can you provide more experiments with a wider variety of datasets? Only 3 homophilous and 3 heterphilous datasets is lacking.
  • Can you provide an explanation as to why you believe DropSens has an advantage over graph rewiring methods?
  • Can you speculate (or even better, evaluate) on a sensible version of DropSens for GIN/GAT?

局限性

Authors should mention a small number of 'real world' datasets as a limitation

最终评判理由

I believe this is a clearly written paper, with the proposed DropSens method well motivated and producing benefits in practice regarding long-range interactions. I thank the authors for running more benchmarks

While the reviewers addressed the majority of my points, I keep my rating because the method is weaker for the other common architectures (GAT/GIN) and there doesn't seem to be any obvious recourse here.

格式问题

No concerns

作者回复

Thank you very much for the insightful summary, for recognizing the strengths of our work, and for the overall feedback!

Before we address your concerns, we wanted to bring to your attention that we had accidentally omitted the DropSens configuration $(c, q_{\max}) = (0.8, 0.3)$ from our original hyperparameter sweep. This has now been included, and the results have been updated accordingly. The overall findings remain unchanged, with the exception that DropSens now ranks 3rd on the Proteins dataset.


Kindly find our responses to the concerns raised:

While performance improvements are clear for node-classification tasks, improvements are less clear on the graph-classification tasks (Table 2 (b)). For example, DropSens underperforms by 10\sim 10 points on Mutag, 15\sim 15 points on IMDb

Thank you for highlighting this. We agree that DropSens underperforms the best rewiring method on the two datasets you mention. That said, we would like to point out that DropSens achieves the best performance on 3 out of the 6 graph classification tasks, and ranks 3rd on another. Note that in the case of the IMDb dataset, DIGL exhibits unusually strong performance, even outperforming several more recent graph rewiring methods such as SDRF, FoSR, and GTR. While we are unsure of the exact reason for this, we do observe that DropSens remains competitive with the rest of the baselines on this dataset.

DropSens only performs well on GCN, and not on GAT or GIN.

The paper does not propose an alternative for these GAT/GIN.

Can you speculate (or even better, evaluate) on a sensible version of DropSens for GIN/GAT?

Thank you – we agree that this is an important limitation, and we've acknowledged it in our conclusion.

DropSens was specifically derived for GCN-style architectures, where the edge weights used in message aggregation are simple functions of node degree. Extending it to other architectures raises nontrivial challenges:

  1. GIN uses constant edge weights, so the sensitivity of a node to its neighbors simplifies to (1q)W(1 - q)||\mathbf{W}|| for dropping probability qq. Enforcing a fixed information preservation ratio cc leads to q=1cq = 1 - c, which is equivalent to DropEdge. This limitation arises because GIN’s aggregation is insensitive to the local graph structure, making a principled variant of DropSens unnecessary.
  2. GAT, in contrast, uses feature-dependent attention weights that vary at every layer and iteration. This means a DropSens-style approach would require recomputing edge masks dynamically in each iteration and each layer, negating the simplicity we aim for. Moreover, the presence of softmax attention makes a closed-form derivation of sensitivity intractable (if at all possible).

In short, while DropSens is not directly applicable to GIN or GAT, this is a consequence of their architectural design, not a shortcoming of the method per se. We do not propose alternatives for these architectures in this paper, but see this as an opportunity for future work – developing principled dropout strategies tailored to different message-passing schemes.

Does this address your concern regarding the architectural specificity of DropSens and the lack of proposed variants for GAT and GIN?

The empirical analysis should include more datasets.

Can you provide more experiments with a wider variety of datasets? Only 3 homophilous and 3 heterphilous datasets is lacking.

Thank you for the suggestion. We used a total of 14 datasets in our empirical evaluation – one synthetic graph regression dataset, 7 real-world node classification datasets and 6 graph-classification ones. The node classification datasets were carefully chosen to span both short-range and long-range tasks, in line with prior work studying over-squashing effects. The graph classification datasets follow the benchmark selection used in recent influential work on rewiring strategies. While we agree that broader empirical coverage is always desirable, we believe our current set strikes a strong balance between diversity and relevance for the problem setting.

That said, we are aiming to share a comparison between NoDrop, DropEdge and DropSens on the PeptidesStruct graph-regression dataset from the Long Range Graph Benchmark by the end of discussion period. We would appreciate your patience as we work on it.

Empirical results from the paper are not clear from the abstract (i.e. that DropSens only works well on GCN).

Thank you for this feedback – we’ve updated the abstract to clarify that DropSens is specifically designed for GCNs, and that its empirical effectiveness is primarily demonstrated in that setting:

"To address this, we introduce DropSens, a sensitivity-aware variant of DropEdge, which is developed following the message-passing scheme of GCN. [...] DropSens with GCN consistently outperforms graph rewiring techniques designed to mitigate over-squashing, suggesting that simple, targeted modifications can substantially improve a model's ability to capture long-range interactions."

Some notations are not clearly defined. For example, ND/DE in the expectations (line 195, line 145), difference between P and \dot{P} could be clearer, and EMiE_{M_i} is not properly defined.

Thank you for pointing this out – we’ve made the following clarifications to address these notational issues:

  1. The subscripts ND and DE now explicitly refer to NoDrop and DropEdge models, respectively.
  2. We’ve added a footnote clarifying that P˙\dot{\mathbf{P}} and P¨\ddot{\mathbf{P}} denote the expected propagation matrices under the asymmetric and symmetric propagation rules, respectively.
  3. To reduce ambiguity, we’ve renamed P\mathbf{P} – previously used for the transition matrix of a uniform random walk – to T\mathbf{T}.
  4. We’ve also defined the expectation EMi\mathbb{E}_{\mathbf{M}_i} over edge masks, right before Equation 3.2.

Does this address your concerns regarding notation clarity?

A comparison between DropSens and the other dropout variants would be helpful; perhaps this could be done by include the best dropout method in Table 2.

Thank you for the suggestion. We agree that a broader comparison with other dropout variants could be valuable, especially to situate DropSens in the wider space of dropping methods. Accordingly, we have added the performance rankings for dropout methods in Appendix G.1 (Table 6). Notably, DropSens ranks 1stin 11/382911/38 \approx 29\\% of dataset×\timesmodel combinations, and places within the top-3 in 21/385521/38 \approx 55\\% cases, highlighting its efficacy across a broad range of settings.

Just to clarify our scope and intent: our goal in introducing DropSens was not to establish a new state-of-the-art dropout method, but rather to demonstrate that techniques originally designed for mitigating over-smoothing can be adapted to also address over-squashing in a principled way. For this reason, our comparisons focus on DropEdge and NoDrop as representative baselines.

Missing error bars in Table 2, Figure 3

Thank you for the observation. We omitted error bars from Table 2 primarily due to space constraints. Additionally, since Table 2 already reports p-values, which directly address statistical significance, we felt this conveyed the relevant information more concisely. For Figure 3, error bars are non-trivial to compute since we report relative improvement. Instead, we have included mean difference and the standard error as text above the bars, in the revised version to reflect variability across runs:

CoraCiteSeerPubMedChameleonSquirrelTwitchDEMutagProteinsEnzymesRedditIMDbCollab
0.11±0.07-0.11 \pm 0.070.45±0.13-0.45 \pm 0.130.00±0.06-0.00 \pm 0.061.70±0.571.70 \pm 0.570.36±0.210.36 \pm 0.210.08±0.150.08 \pm 0.152.80±1.982.80 \pm 1.981.41±0.821.41 \pm 0.821.31±1.191.31 \pm 1.194.71±0.994.71 \pm 0.990.04±1.030.04 \pm 1.030.67±0.84-0.67 \pm 0.84

Does this address your concern?

Do you mean to include a kronecker delta (δij\delta_{ij}) in Eq. (3.3)?

Yes, thank you for catching that. We have specified that Equations 3.3 and 3.4 hold if and only if (ji)E(j \to i) \in \mathcal{E}.

Can you provide an explanation as to why you believe DropSens has an advantage over graph rewiring methods

Thank you for this question. We want to clarify that we do not claim DropSens to have an advantage over graph rewiring methods. Our goal was to demonstrate that a simple modification to DropEdge can lead to consistent performance gains, particularly by taking sensitivity to distant nodes into account.

That said, we believe DropSens performs competitively in practice for two reasons:

  1. It addresses both over-smoothing and over-squashing simultaneously, rather than focusing on just one of the problems, and
  2. As a stochastic method, it introduces a regularization effect that improves generalization.
评论

We humbly ask if our rebuttal has helped clarify your concerns. If any points remain unclear or unaddressed, please don’t hesitate to let us know – we’d be glad to clarify them further before the discussion period ends on August 6th.

评论

Thank you for taking the time to respond to my review. My points have been mostly addressed.

Does this address your concern regarding the architectural specificity of DropSens and the lack of proposed variants for GAT and GIN?

Is it possible to document these insights in the limitations section? These are helpful observations for understanding how DropSens fits in the broader context.

we believe our current set strikes a strong balance between diversity and relevance for the problem setting.

I disagree --- I do not think benchmarking more datasets is an unreasonable ask. Doing more than prior work would be a big improvement. I look forward to the benchmarking of PeptidesStruct graph-regression dataset. For this reason I won't be increasing my score.

评论

Thank you again for your thoughtful feedback.

Of course, we will make sure to document the current limitations of DropSens in the revised manuscript, including why adapting it to architectures like GIN and GAT is non-trivial. We hope these observations can help guide future work toward extending DropSens-style approaches to a broader range of models.

We didn’t mean to imply that benchmarking additional datasets is an unreasonable request – rather, we aimed to strike a balance between coverage and relevance given space and resource constraints. That said, we understand if concerns about dataset diversity remain. In this regard, we’d like to point you to the new results for NoDrop, DropEdge, and DropSens on PeptidesStruct, shared in the comment titled "Updated DropSens Rankings and Results on the LRGB Dataset".

We truly appreciate your engagement and hope our responses have clarified our position.

评论

Thank you for providing the further results, these will improve the completeness of the paper.

The issue of experimental setup is slightly worrying

We realized that our experimental setup was slightly different compared to the one used by Black et al. (2023); Karhadkar et al. (2023),

Are you able to confirm that you can reproduce figures in the referenced works (e.g. for vanilla or the other Drop- methods)?

评论

Thank you for pointing this out! We understand how differences in experimental setup can raise concerns.

To clarify, the only difference lies in the training duration and stopping criterion:

  • In our original setup, we trained for a fixed 300 epochs and reported the test accuracy corresponding to the best validation accuracy.
  • In contrast, Black et al. (2023) and Karhadkar et al. (2023) trained for up to 100,000 epochs with early stopping, halting when there was less than 1% improvement in validation accuracy over the last 100 epochs, and then reported the final test accuracy.

We believe this early stopping criterion is problematic, as it implicitly demands greater accuracy gains as training progresses. For instance, with 10% accuracy, a 0.1% gain is sufficient; at 90%, a 0.9% gain is required to continue training. This penalizes progress at later stages, when improvements naturally plateau, and results in premature convergence. In fact, we observed that in many cases, training stopped well before 300 epochs under their setup.

We have updated our implementation to match their stopping protocol, so the comparison is now aligned and fair. That said, we respectfully believe their criterion leads to suboptimal training in many cases.

If the concern is about selective reporting, we want to emphasize that while DropSens improved its rank on 5 datasets and dropped only on CiteSeer, the raw accuracy actually decreased on Cora, CiteSeer, PubMed, Chameleon, Squirrel, IMDb, and Collab with the updated setup. This reflects our commitment to transparency and fair comparison, even when it doesn't favor our method.


Since Black et al. (2023); Karhadkar et al. (2023) do not evaluate dropout-based methods, a direct comparison is unfortunately not possible. However, we can still compare against their baseline, which corresponds to the NoDrop setting in our work. We are in the process of collecting these results under the updated experimental setup and would appreciate your patience in the meantime.

评论

Hi again, and thank you for your continued engagement.

We have run the experiments on graph classification tasks using the baseline model setup from Karhadkar et al. (2023) – that is, no rewiring and feature dropout with probability 0.5, which they apply across all rewiring methods. Below, we present a comparison between their reported results and ours under this setup:

MutagProteinsEnzymesRedditIMDbCollab
Theirs72.1572.1570.9870.9827.6727.6768.2668.2649.7749.7733.7833.78
Ours72.0072.0066.5266.5225.0825.0867.7767.7750.9550.9532.9932.99

The results are broadly consistent across datasets, suggesting that our implementation aligns well with theirs. We note a few minor differences:

  1. Their results are averaged over 100 independent runs, while ours are based on 20 runs due to the limited time available during the rebuttal period.
  2. For binary classification tasks, their implementation uses a two-dimensional output layer, while we use a single output logit. While both approaches are theoretically equivalent, they may lead to slight differences in optimization trajectories.
  3. They apply feature dropout after message passing, whereas we apply it to the layer input, which is the more standard approach. While this distinction is not directly relevant to DropSens (which does not use feature dropout), it may affect the comparison presented above. To the best of our knowledge, however, it does not significantly alter the model.

We hope this comparison helps clarify the alignment between our implementation and theirs. Please let us know if further details would be helpful.

评论

Indeed, selective reporting would be a concern; your methodology sounds reasonable, however. Do you have error bars to show that these figures are/are not significantly different? I assume the differences are really mostly due to your point #2 and/or #3 above though.

Are you able to document the methodology you used, and how they are different to Karhadkar et al. (2023) in the appendix of your paper? A reference to show that applying feature dropout to the layer input is the more standard approach would help too.

评论

Thank you for your reply!

Indeed, selective reporting would be a concern

Could you please clarify what you are referring to here? In the main paper, we have reported results on all datasets used in Topping et al. (2022) and Karhadkar et al. (2023). For comparing our implementation with Karhadkar et al. (2023), we evaluate the single method common to both work, i.e. feature dropout with probability 0.5, across all the datasets they test.

Do you have error bars to show that these figures are/are not significantly different?

Yes, our apologies for not including them earlier. Here you go:

MutagProteinsEnzymesRedditIMDbCollab
Theirs72.15±2.4472.15 \pm 2.4470.98±0.7470.98 \pm 0.7427.67±1.1627.67 \pm 1.1668.26±1.1068.26 \pm 1.1049.77±0.8249.77 \pm 0.8233.78±0.4933.78 \pm 0.49
Ours72.00±9.8072.00 \pm 9.8066.52±4.4366.52 \pm 4.4325.08±5.2525.08 \pm 5.2567.77±3.5267.77 \pm 3.5250.95±5.3350.95 \pm 5.3332.99±0.4632.99 \pm 0.46

Are you able to document the methodology you used, and how they are different to Karhadkar et al. (2023) in the appendix of your paper? A reference to show that applying feature dropout to the layer input is the more standard approach would help too.

Yes, of course, we will make sure to include it in the camera-ready version!

评论

I was simply referring to your early comment ("If the concern is about selective reporting, we want to..."). I am not raising it as an issue! I agree that your work has a clear methodology.

评论

Oh, understood! Apologies for the misunderstanding!

审稿意见
4

This paper analyses how popular random‐dropping techniques for deep graph neural networks (GNNs) --- DropEdge, DropNode, DropAgg, DropGNN and DropMessage---affect over-squashing, a bottleneck that blocks long-range information flow. By computing the expected Jacobian of linear and nonlinear GCNs under random edge masks, the authors prove that DropEdge-style methods shrink sensitivity between distant nodes at an exponential rate, while marginally increasing sensitivity to immediate neighbours. To mitigate this, they introduce DropSens, which chooses a per-edge drop probability that preserves a user-specified fraction c of message strength, thus keeping long-range sensitivity constant while dropping the same number of edges. Experiments on SyntheticZINC, six node-classification benchmarks (homophilic vs heterophilic) and six graph-classification datasets show: (i) random dropping indeed helps short-range tasks but hurts or fails to help long-range ones; (ii) DropSens consistently outperforms or matches recent graph-rewiring baselines on long-range node tasks and is competitive on graph-level tasks while remaining projection-free.

优缺点分析

Strengths:

  1. The theoretical contribution is clear and original: the authors give the first explicit sensitivity decay analysis for edge-dropping schemes and extend it to nonlinear MPNNs, filling a gap in the literature on over-squashing.

  2. The proposed DropSens is simple, requires no extra model parameters, and is derived directly from the theory, illustrating good “analysis-to-algorithm” design.

  3. Empirical evaluation spans synthetic and real datasets, uses three backbone architectures (GCN, GIN, GAT) and includes statistical significance tests and effect sizes, lending credibility to the claims . The authors also release pseudocode and discuss computational shortcuts, supporting reproducibility.

Weaknesses:

  1. Empirical scope is still modest: all large-scale experiments run on CPUs with no wall-clock comparisons, so the claimed training-speed advantage of DropSens remains anecdotal.

  2. DropSens relies on in-degree–specific formulas and was tuned for GCN; results for GIN and GAT are weaker, underlining limited architecture generality.

  3. The paper does not explore the effect of the preservation hyper-parameter c nor the sensitivity of DropSens to degree distribution outliers.

问题

  1. Please report GPU wall-clock time and memory for DropSens vs DropEdge and two rewiring baselines on the Squirrel (heterophilic) dataset with a 32-layer GCN until convergence to a fixed validation loss.

  2. Provide an ablation on SyntheticZINC varying the preserved-information parameter c{0.7,0.8,0.9,0.95}c\in\{0.7,0.8,0.9,0.95\}; show accuracy and number of edges kept.

  3. Extreme degree graphs How does DropSens behave on power-law graphs where a few hubs dominate degrees? Include an experiment or theoretical bound on maximum drop probability in that regime.

局限性

The authors candidly state that the theory assumes standard message-passing with known parallel transport and that DropSens must be tailored to the propagation rule; extending it to attention or transformer-style layers is non-trivial . They also acknowledge that experiments are small-scale and limited to a few datasets.

最终评判理由

Thank you the author(s) for your clarification on my previous misunderstanding, taking my suggestion into account and completing several extra experiments which can be revised into the new version of the paper. I have no more questions, and at this stage I support the acceptance of the paper.

格式问题

None noted. The paper adheres to NeurIPS formatting guidelines.

作者回复

Thank you very much for the insightful summary, for recognizing the strengths of our work, and for the overall feedback!

Before we address your concerns, we wanted to bring to your attention that we had accidentally omitted the DropSens configuration $(c, q_{\max}) = (0.8, 0.3)$ from our original hyperparameter sweep. This has now been included, and the results have been updated accordingly. The overall findings remain unchanged, with the exception that DropSens now ranks 3rd on the Proteins dataset.


Kindly find our responses to the concerns raised:

Empirical scope is still modest: all large-scale experiments run on CPUs with no wall-clock comparisons, so the claimed training-speed advantage of DropSens remains anecdotal.

Thank you for pointing this out. To clarify, the hardware setup used for all large-scale experiments is described in Appendix E: we use 4×\times NVIDIA GeForce GTX TITAN X GPUs (12 GB VRAM each).

You're absolutely right that wall-clock runtimes were not reported in the initial submission. We have now addressed this:

We first show the preprocessing time required to compute edge-dropping probabilities for DropSens as a function of node in-degree, since the cost of solving for qiq_i grows with did_i. We plot results for di=1,2,,100d_i = 1, 2, \ldots, 100, averaged over c=0.1,0.2,,0.9c = 0.1, 0.2, \ldots, 0.9 in Figure 6b; for readability, we report values only for di=1,2,5,10,20,50,100d_i = 1, 2, 5, 10, 20, 50, 100 in the table below.

125102050100
Exact Computation time ×102\times 10^{-2}3.316±5.9613.316 \pm 5.9611.904±0.9521.904 \pm 0.9522.713±0.7282.713 \pm 0.7284.011±1.5144.011 \pm 1.51410.181±6.18610.181 \pm 6.18643.803±23.14743.803 \pm 23.147169.897±52.593169.897 \pm 52.593
Approximation Computation Time ×106\times 10^{-6}3.253±0.5683.253 \pm 0.5681.316±0.1311.316 \pm 0.1311.097±0.0211.097 \pm 0.0211.117±0.0251.117 \pm 0.0251.113±0.0131.113 \pm 0.0131.089±0.0621.089 \pm 0.0621.103±0.0111.103 \pm 0.011

We now compare the initialization and sampling time of DropSens (c=0.8,qmax=0.5c=0.8, q_{\max}=0.5) compared to sampling time of DropEdge, averaged over 10 runs, isolating the runtime difference since sampling is the only point where the two methods differ computationally.1.^1 We have added this comparison in Figure 7, and also present the results in the table below, with all entries reported in milliseconds:

CoraCiteSeerPubMedChameleonSquirrelTwitchDEActorMutagProteinsEnzymesRedditIMDbCollab
DropSens Initialization1041046767747489898585838343434949686858586969132132483483
DropSens Sampling17171313515150506161444416161313393927279292585811631163
DropEdge Sampling11110011000000110011111166

As expected, DropSens sampling is more expensive due to the need to compute edge-wise dropping probabilities per graph. That said, this step accounts for only a small fraction of overall training time and was never the bottleneck in our experiments

Note: Runtimes might vary form server to server, but we expect the trends to be similar.

We do want to clarify that we do not claim a training-speed advantage for DropSens. On the contrary, due to its preprocessing step, it may be marginally slower than DropEdge. We made this explicit in Appendix E.3, where we also provide a low-cost approximation for solving Equation 4.1.

Does this address your concern about the empirical scope and runtime reporting?

DropSens relies on in-degree–specific formulas and was tuned for GCN; results for GIN and GAT are weaker, underlining limited architecture generality.

Thank you – we agree that this is an important limitation, and we've acknowledged it in our conclusion.

DropSens was specifically derived for GCN-style architectures, where the edge weights used in message aggregation are simple functions of node degree. Extending it to other architectures raises nontrivial challenges:

  1. GIN uses constant edge weights, so the sensitivity of a node to its neighbors simplifies to (1q)W(1 - q)\|\|\mathbf{W}\|\| for dropping probability qq. Enforcing a fixed information preservation ratio cc leads to q=1cq = 1 - c, which is equivalent to DropEdge. This limitation arises because GIN’s aggregation is insensitive to the local graph structure, making a principled variant of DropSens unnecessary.
  2. GAT, in contrast, uses feature-dependent attention weights that vary at every layer and iteration. This means a DropSens-style approach would require recomputing edge masks dynamically in each iteration and each layer, negating the simplicity we aim for. Moreover, the presence of softmax attention makes a closed-form derivation of sensitivity intractable (if at all possible).

In short, while DropSens is not directly applicable to GIN or GAT, this is a consequence of their architectural design, not a shortcoming of the method per se. We see this as an opportunity for future work – developing principled dropout strategies tailored to different message-passing schemes.

Does this address your concern regarding the architectural generality of DropSens?

The paper does not explore the effect of the preservation hyper-parameter c nor the sensitivity of DropSens to degree distribution outliers.

Thank you for highlighting this. You’re absolutely right that we did not explicitly discuss the role of the preservation hyper-parameter cc in the main text. While Figure 6a does compare the dropping probabilities for different values of cc, we have now added a discussion at the end of Section 4 to clarify how varying cc affects the dropping probabilities for different in-degrees.

Regarding sensitivity to degree outliers, DropSens assigns dropping probabilities based on the degree of the target node, which means that edges connected to extremely high-degree nodes may receive disproportionately high dropping probabilities. To prevent overly aggressive edge-removal in such cases, we clip the computed probabilities, as described under "DropSens Configurations" in Appendix E.2. We have also added this implementation detail at the end of Section 4 to make this design choice explicit:

"Note that higher values of cc encourage lower qiq_i, while lower values permit qiq_i to take higher values; see Figure 6a. Since this can result in abnormally high dropping probabilities, we clip the value of qiq_i in our experiments using another hyperparameter, qmaxq_{\max}; see details in Appendix E.2."

Does this address your concern about both the role of cc and the robustness to degree outliers?

Provide an ablation on SyntheticZINC varying the preserved-information parameter c0.7,0.8,0.9,0.95c \in 0.7, 0.8, 0.9, 0.95; show accuracy and number of edges kept.

Thank you for the suggestion – we agree that this ablation would offer valuable insight. As a first step, we evaluate DropSens with c=0.9,0.65c = 0.9, 0.65, which preserve approximately the same number of edges in expectation (402\approx 402k, 248248k, respectively) as DropEdge with q=0.2,0.5q = 0.2, 0.5 (400\approx 400k, 250250k, respectively).

Kindly find the results for DropSens with c=0.9c = 0.9 below, where we observe that it improves on its DropEdge counterpart, q=0.2q=0.2; we will report on c=0.65c = 0.65 soon.

Train MAE ×102\times 10^{-2} (lower is better)

Methodα=0.1\alpha=0.10.20.20.30.30.40.40.50.50.60.60.70.70.80.80.90.91.01.0
DropSens (c=0.9c=0.9)6.0626.0626.8316.8317.2037.2037.2977.2978.2048.2047.1577.1576.6056.6056.5526.5526.5456.5456.8076.807
DropEdge (q=0.2q=0.2)6.3696.3697.0167.0167.3897.3897.3397.3397.4557.4557.3747.3746.9936.9937.4207.4206.5886.5887.2137.213
DropSens (c=0.65c=0.65)
DropEdge (q=0.5q=0.5)8.2188.2188.9548.9549.1739.1739.2109.2109.1099.1099.0769.0768.8348.8348.8078.8078.7548.7549.0679.067

Test MAE ×102\times 10^{-2} (lower is better)

Methodα=0.1\alpha=0.10.20.20.30.30.40.40.50.50.60.60.70.70.80.80.90.91.01.0
DropSens (c=0.9c=0.9)6.7396.7397.2217.2217.8307.8308.1568.1569.1349.1348.2218.2217.4427.4427.1187.1186.7606.7606.3556.355
DropEdge (q=0.2q=0.2)7.0967.0967.4967.4968.2628.2628.5678.5678.7818.7818.8018.8018.1688.1688.0998.0997.0667.0667.2217.221
DropSens (c=0.65c=0.65)
DropEdge (q=0.5q=0.5)13.32613.32613.45913.45914.80514.80515.21515.21514.75314.75314.50114.50114.10114.10113.56713.56712.56612.56613.38113.381

How does DropSens behave on power-law graphs where a few hubs dominate degrees? Include an experiment or theoretical bound on maximum drop probability in that regime.

Thank you for raising this important point. You're absolutely right that on power-law graphs, a small number of high-degree hubs can lead DropSens to assign extremely high dropping probabilities, potentially close to 11. Since DropSens does not impose a theoretical bound on dropping probability (except the vacuous bound, 11), this behavior is expected when in-degrees are very large.

To mitigate this, we cap the dropping probabilities using a hyper-parameter qmaxq_{\max}, which ensures that even high-degree nodes retain a minimum level of connectivity. This clipping mechanism is already implemented in our experiments, but we realize this detail was not clearly communicated. We have now added a clarification at the end of Section 4 to make this design choice explicit.

Does this address your concern about DropSens on power-law graphs and the lack of a theoretical bound?


Footnotes:

  1. For graph classification tasks, edge masks are computed in one go (instead of one mini-batch at a time, as in practice).
评论

Thank you the author(s) for taking my suggestion into account and completing several extra experiments which can be revised into the new version of the paper. I have no more questions, and support the acceptance of the paper.

评论

Thank you very much for actively engaging in the review process. If you have no further concerns remaining unresolved, we would appreciate it if you would consider updating your rating!

评论

We present the full results for DropSens on SyntheticZINC below (bold indicates better performance in the comparisons 1. DropSens (c=0.9c = 0.9) v.s. DropEdge (q=0.2q = 0.2), and 2. DropSens (c=0.65c = 0.65) v.s. DropEdge (q=0.5q = 0.5)):

Train MAE ×102\times10^{-2} (lower is better)

Methodα=0.1\alpha = 0.10.20.20.30.30.40.40.50.50.60.60.70.70.80.80.90.91.01.0
DropSens (c=0.9c = 0.9)6.062\mathbf{6.062}6.831\mathbf{6.831}7.203\mathbf{7.203}7.297\mathbf{7.297}8.2048.2047.157\mathbf{7.157}6.605\mathbf{6.605}6.552\mathbf{6.552}6.545\mathbf{6.545}6.807\mathbf{6.807}
DropEdge (q=0.2q = 0.2)6.3696.3697.0167.0167.3897.3897.3397.3397.455\mathbf{7.455}7.3747.3746.9936.9937.4207.4206.5886.5887.2137.213
DropSens (c=0.65c = 0.65)8.2238.2238.946\mathbf{8.946}9.138\mathbf{9.138}9.3119.3119.1609.1609.013\mathbf{9.013}8.8528.8528.563\mathbf{8.563}8.7798.7798.871\mathbf{8.871}
DropEdge (q=0.5q = 0.5)8.218\mathbf{8.218}8.9548.9549.1739.1739.210\mathbf{9.210}9.109\mathbf{9.109}9.0769.0768.834\mathbf{8.834}8.8078.8078.754\mathbf{8.754}9.0679.067

Test MAE ×102\times10^{-2} (lower is better)

Methodα=0.1\alpha = 0.10.20.20.30.30.40.40.50.50.60.60.70.70.80.80.90.91.01.0
DropSens (c=0.9c = 0.9)6.739\mathbf{6.739}7.221\mathbf{7.221}7.830\mathbf{7.830}8.156\mathbf{8.156}9.1349.1348.221\mathbf{8.221}7.442\mathbf{7.442}7.118\mathbf{7.118}6.760\mathbf{6.760}6.355\mathbf{6.355}
DropEdge (q=0.2q = 0.2)7.0967.0967.4967.4968.2628.2628.5678.5678.781\mathbf{8.781}8.8018.8018.1688.1688.0998.0997.0667.0667.2217.221
DropSens (c=0.65c = 0.65)14.13814.13813.223\mathbf{13.223}13.704\mathbf{13.704}15.088\mathbf{15.088}14.699\mathbf{14.699}14.226\mathbf{14.226}13.784\mathbf{13.784}12.240\mathbf{12.240}12.166\mathbf{12.166}10.458\mathbf{10.458}
DropEdge (q=0.5q = 0.5)13.326\mathbf{13.326}13.45913.45914.80514.80515.21515.21514.75314.75314.50114.50114.10114.10113.56713.56712.56612.56613.38113.381

It is clear to see that DropSens outperforms DropEdge in most cases, demonstrating the benefits brought in by simple sensitivity-aware modifications to algorithms originally designed for alleviating over-smoothing and training deep GNNs.

审稿意见
5

The paper investigates GNN-specific dropout methods and their effect on two phenomena preventing high depth in message passing GNNs, namely, oversmoothing and oversquashing. While the positive effect of such dropout methods on oversmoothing has been investigated, their effect on oversquashing and long-range interactions is underexplored. The authors use an existing sensitive analysis quantifying the dependence of a node's embedding on the initial features of distant nodes and generalize it to situations where edges are dropped randomly. This theoretical analysis suggests that higher depth under dropout does not necessarily lead to a better ability to capture long-range interaction. Experiments support the theoretical results.

优缺点分析

Strengths

S1) The interplay between dropout, oversmoothing, and oversquashing is highly relevant and gives new insights into the capabilities and predictive performance of GNNs.

S2) The paper summarizes the related work well and puts the contribution in the context of state-of-the-art techniques.

S3) The paper makes theoretical contributions (albeit under limitations) and shows the relevance of the theoretical results empirically. The limitations are clearly stated and discussed.

Wekaness

W1) The experimental evaluation has several weaknesses and does not provide a consistent picture of the topic:

  • Figure 1(b) is difficult to read since curves overlap. In Section 4, it is claimed that the proposed method, DropSens, improves the sensitivity compared to DropEdge. However, as both lines almost overlap, I cannot follow the interpretation.
  • I do not understand why DropSens is not included in the experiments of Sections 5.1 and 5.2; it would be highly relevant to see how the method compares to other dropout methods in these experiments.
  • Figure 2 shows that NoDrop achieves the best results. Why is this dataset suitable for comparing different dropout methods?

Minor remarks

  • In Eq. (2), the set should be the argument of the Out function.
  • Eq. (3.2) The index M(1),...\mathbf{M}^{(1)},... is not clear.

问题

Q1) Could you clarify my questions regarding the experimental evaluation mentioned in W1?

局限性

yes

最终评判理由

The authors clarified my question regarding the interpretation of Figure 2 in the rebuttal, and I believe that the promised changes will fix the minor flaws I mentioned in my review. I keep my rating.

格式问题

no

作者回复

Thank you very much for the insightful summary, for recognizing the strengths of our work, and for the overall feedback!

Before we address your concerns, we wanted to bring to your attention that we had accidentally omitted the DropSens configuration $(c, q_{\max}) = (0.8, 0.3)$ from our original hyperparameter sweep. This has now been included, and the results have been updated accordingly. The overall findings remain unchanged, with the exception that DropSens now ranks 3rd on the Proteins dataset.


Kindly find our responses to the concerns raised:

Figure 1(b) is difficult to read since curves overlap. In Section 4, it is claimed that the proposed method, DropSens, improves the sensitivity compared to DropEdge. However, as both lines almost overlap, I cannot follow the interpretation.

Thank you for pointing this out – we agree that the overlapping curves in Figure 1(b) make the comparison difficult to interpret. To address this, we have revised Figure 1(b) by embedding a subplot that explicitly shows the ratio of influence at different distances between DropSens and DropEdge. This makes the differences between the two methods much more visible. Kindly find the entries of the plot below, with the columns denoting shortest distance between nodes:

dG(j,i)=d_G(j,i) =0123456
0.73121.02471.06681.12331.09381.15481.2283

Compared to DropEdge, DropSens decreases sensitivity of a node's final representations to its input features (d=0d=0), and increases it to other nodes (d1d \geq 1). This effect is especially significant for distant nodes, suggesting that DropSens should be better at modeling LRIs than DropEdge.

We hope this modification will make the comparison and the interpretation in Section 4 clearer.

I do not understand why DropSens is not included in the experiments of Sections 5.1 and 5.2; it would be highly relevant to see how the method compares to other dropout methods in these experiments.

That makes complete sense! We have included the results for DropSens, as well as a commentary on the results in Section 5.3:

CoraCiteSeerPubMedChameleonSquirrelTwitchDE
GCN+0.308+0.239+0.383+1.069+0.372-0.018
GIN-0.022-0.065-0.666+0.174+0.330+0.616
GAT+0.577+0.663-0.003-3.859-1.583-0.090
MutagProteinsEnzymesRedditIMDbCollab
GCN+1.700+3.161+0.716-3.670+1.340-1.818
GIN-2.700-1.304-1.650-4.500-6.200-1.379
GAT+4.300+1.036+0.897+0.400-0.7600.000

"We now evaluate whether DropSens offers meaningful improvements over a NoDrop baseline, consistent with our analysis for other dropout methods. Results for GCN and GIN are shown in Table 1a, and those for GAT are in Table 7. The performance boost observed with DropSens is much more remarkable for GCN than for other architectures, which is unsurprising since DropSens was specifically designed to work with GCN's message-passing scheme. We also observe that while it offers significant improvement in 5/6 node classification tasks, its effect on graph classification – while positive in 4/6 datasets – is not statistically significant."

Just to clarify our scope: We do not claim that DropSens undoes all the negative effects of DropEdge. Rather, we intend to demonstrate that techniques originally designed for mitigating over-smoothing can severely underperform on long-range tasks, but they can be readily adapted to perform better on long-range tasks.

Does this address your concerns?

Figure 2 shows that NoDrop achieves the best results. Why is this dataset suitable for comparing different dropout methods?

Thank you for this question. The SyntheticZINC dataset – introduced in Giovanni et al. (2024, Section 5.1); described further in our Appendix E.1 – was specifically designed to study how different levels of information mixing in the underlying ground-truth affect model performance. Here's a brief overview of how it is designed:

  1. A desired mixing level $\alpha \in [0, 1]$ is fixed.
  2. Commute times between all node pairs in a graph are computed.
  3. A node pair is selected whose commute time lies at the $\alpha$-th percentile of the graph's commute time distribution.

As shown in Giovanni et al. (2024, Theorem 4.4), over-squashing in GNNs – especially with MAX pooling for graph classification (as is also our setup) – is directly linked to the commute time between nodes. Therefore, SyntheticZINC is well-suited for comparing dropout methods in this context: it provides a controlled environment to isolate and measure the over-squashing effects of different algorithms.

Does that clarify your concern about the suitability of this dataset?

In Eq. (2), the set should be the argument of the Out function.

Yes, thank you for the correction!

Eq. (3.2) The index $\mathbf{M}^{(1)}, \ldots$ is not clear.

Fair enough, we have now defined the variables to be the edge masks in the preceding line. Kindly let us know in case it still remains unclear.


References:

Francesco Di Giovanni, T. Konstantin Rusch, Michael Bronstein, Andreea Deac, Marc Lackenby, Siddhartha Mishra, and Petar Velickovic. How does over-squashing affect the power of GNNs? Transactions on Machine Learning Research, 2024.

评论

We humbly ask if our rebuttal has helped clarify your concerns. If any points remain unclear or unaddressed, please don’t hesitate to let us know – we’d be glad to clarify them further before the discussion period ends on August 6th.

评论

Thanks for carefully answering my questions. I believe that the promised changes will have fixed the minor flaws I mentioned in my review. The interpretation of Figure 2 is now clear to me.

评论

We thank the reviewers very much for acknowledging our results and actively engaging in the review process!

We want to present some more results, and we hope you can take these into account when assessing our contributions:

Updated Comparisons of DropSens with graph-rewiring methods

We realized that our experimental setup was slightly different compared to the one used by Black et al. (2023); Karhadkar et al. (2023), in that they used a lot more number of maximum epochs for training compared to us (originally 300), and instead used early stopping to determine convergence. Using the same setup, we conducted 20 runs with each hyperparameter setting of DropSens; we are working on collecting 50 samples for the best configurations for each dataset. Kindly find the performance and updated ranks of DropSens below:

GCN

CoraCiteSeerPubMedChameleonSquirrelActor
84.5784.57 (1st)71.7071.70 (4th)83.8083.80 (1st)52.0652.06 (1st)39.3339.33 (1st)22.6322.63 (7th)
MutagProteinsEnzymesRedditIMDbCollab
80.7580.75 (1st)73.1773.17 (2nd)50.8650.86 (1st)78.5078.50 (1st)49.0549.05 (6th)61.1661.16 (1st)

GIN

MutagProteinsEnzymesRedditIMDbCollab
79.7579.75 (2nd)69.0269.02 (8th)61.7161.71 (1st)88.4888.48 (2nd)62.4062.40 (7th)67.2167.21 (6th)

Summary of updates in rankings:

  1. Performance with GCN on CiteSeer went down from 1st to 4th.
  2. Performance with GCN on Mutag went up from 7th to 1st.
  3. Performance with GCN on Proteins went up from 7th to 2nd.
  4. Performance with GCN on Mutag went up from 8th to 2nd.
  5. Performance with GCN on Enzymes went up from 6th to 1st.
  6. Performance with GCN on Reddit went up from 6th to 2nd.

These changes put DropSens with GCN on the podium for 2 more graph-classification datasets, totaling 5/6 datasets. Moreover, they put DropSens with GIN on the podium for 3 graph-classification datasets.


Performance on PeptidesStruct (LRGB) dataset

Below we present the Mean Absolute Error (averaged over 20 runs) of the best performing configurations of DropEdge and DropSens with GCN on the PeptidesStruct dataset (using early-stopping, as above):

NoDropDropSensDropEdge
0.41810.41810.41480.41480.42780.4278

We see that DropEdge results in a performance decline, while DropSens outperforms the NoDrop baseline.

We are working on collecting the full set of results for PeptidesStruct and PeptidesFunc, but in the meantime, we hope these results address any concerns about the variety of datasets used in our experiments.

评论

Hi, in light of the extended discussion period, we kindly invite the reviewers to share any remaining concerns. We’d be glad to do our best to address them during this time.

最终决定

This paper investigates the impact of dropout-based methods (DropEdge, DropNode, etc.) on over-squashing in Graph Neural Networks, revealing that while these methods address over-smoothing, they can exacerbate over-squashing by reducing sensitivity to distant nodes. The authors propose DropSens, a sensitivity-aware variant that preserves long-range interactions while maintaining the benefits of edge dropping. The work received scores of 5 (accept), 4 (borderline accept), 4 (borderline accept), and 4 (borderline accept) from four engaged reviewers. Strengths include: (1) Novel theoretical insight - The paper provides the first systematic analysis of how dropout methods affect over-squashing through sensitivity analysis, revealing an important but previously unrecognized trade-off between addressing over-smoothing and preserving long-range interactions; (2) Strong theoretical foundation - The mathematical analysis extending sensitivity measures to edge-dropping scenarios is rigorous and well-motivated, leading naturally to the DropSens algorithm design; (3) Comprehensive experimental validation - The authors conducted extensive experiments across synthetic and real-world datasets, responded thoroughly to reviewer concerns by adding results on LRGB datasets, and provided detailed computational analysis; (4) Practical contribution - DropSens offers a principled solution that maintains simplicity while addressing the identified limitations. Limitations acknowledged: (1) Architecture specificity - DropSens is primarily designed for GCN architectures, with limited applicability to GIN/GAT due to their different message-passing schemes; (2) Scope constraints - While the experimental evaluation is solid, it focuses on medium-scale datasets with some computational limitations for larger benchmarks. The authors demonstrated exceptional responsiveness during the review process, conducting additional experiments, updating experimental setups to match prior work, and providing thorough theoretical explanations. All reviewers acknowledged the quality of the responses and the authors' transparency in reporting both positive and negative results. The work makes a valuable contribution to understanding the complex interplay between different GNN training techniques and their effects on fundamental graph learning challenges.