Thank you very much for the insightful summary, for recognizing the strengths of our work, and for the overall feedback!

Before we address your concerns, we wanted to bring to your attention that we had accidentally omitted the DropSens configuration from our original hyperparameter sweep. This has now been included, and the results have been updated accordingly. The overall findings remain unchanged, with the exception that DropSens now ranks 3rd on the Proteins dataset.

Kindly find our responses to the concerns raised:

Empirical scope is still modest: all large-scale experiments run on CPUs with no wall-clock comparisons, so the claimed training-speed advantage of DropSens remains anecdotal.

Thank you for pointing this out. To clarify, the hardware setup used for all large-scale experiments is described in Appendix E: we use 4 NVIDIA GeForce GTX TITAN X GPUs (12 GB VRAM each).

You're absolutely right that wall-clock runtimes were not reported in the initial submission. We have now addressed this:

We first show the preprocessing time required to compute edge-dropping probabilities for DropSens as a function of node in-degree, since the cost of solving for grows with . We plot results for , averaged over in Figure 6b; for readability, we report values only for in the table below.

	1	2	5	10	20	50	100
Exact Computation time
Approximation Computation Time

We now compare the initialization and sampling time of DropSens () compared to sampling time of DropEdge, averaged over 10 runs, isolating the runtime difference since sampling is the only point where the two methods differ computationally We have added this comparison in Figure 7, and also present the results in the table below, with all entries reported in milliseconds:

	Cora	CiteSeer	PubMed	Chameleon	Squirrel	TwitchDE	Actor	Mutag	Proteins	Enzymes	Reddit	IMDb	Collab
DropSens Initialization
DropSens Sampling
DropEdge Sampling

As expected, DropSens sampling is more expensive due to the need to compute edge-wise dropping probabilities per graph. That said, this step accounts for only a small fraction of overall training time and was never the bottleneck in our experiments

Note: Runtimes might vary form server to server, but we expect the trends to be similar.

We do want to clarify that we do not claim a training-speed advantage for DropSens. On the contrary, due to its preprocessing step, it may be marginally slower than DropEdge. We made this explicit in Appendix E.3, where we also provide a low-cost approximation for solving Equation 4.1.

Does this address your concern about the empirical scope and runtime reporting?

DropSens relies on in-degree–specific formulas and was tuned for GCN; results for GIN and GAT are weaker, underlining limited architecture generality.

Thank you – we agree that this is an important limitation, and we've acknowledged it in our conclusion.

DropSens was specifically derived for GCN-style architectures, where the edge weights used in message aggregation are simple functions of node degree. Extending it to other architectures raises nontrivial challenges:

GIN uses constant edge weights, so the sensitivity of a node to its neighbors simplifies to for dropping probability . Enforcing a fixed information preservation ratio leads to , which is equivalent to DropEdge. This limitation arises because GIN’s aggregation is insensitive to the local graph structure, making a principled variant of DropSens unnecessary.
GAT, in contrast, uses feature-dependent attention weights that vary at every layer and iteration. This means a DropSens-style approach would require recomputing edge masks dynamically in each iteration and each layer, negating the simplicity we aim for. Moreover, the presence of softmax attention makes a closed-form derivation of sensitivity intractable (if at all possible).

In short, while DropSens is not directly applicable to GIN or GAT, this is a consequence of their architectural design, not a shortcoming of the method per se. We see this as an opportunity for future work – developing principled dropout strategies tailored to different message-passing schemes.

Does this address your concern regarding the architectural generality of DropSens?

The paper does not explore the effect of the preservation hyper-parameter c nor the sensitivity of DropSens to degree distribution outliers.

Thank you for highlighting this. You’re absolutely right that we did not explicitly discuss the role of the preservation hyper-parameter in the main text. While Figure 6a does compare the dropping probabilities for different values of , we have now added a discussion at the end of Section 4 to clarify how varying affects the dropping probabilities for different in-degrees.

Regarding sensitivity to degree outliers, DropSens assigns dropping probabilities based on the degree of the target node, which means that edges connected to extremely high-degree nodes may receive disproportionately high dropping probabilities. To prevent overly aggressive edge-removal in such cases, we clip the computed probabilities, as described under "DropSens Configurations" in Appendix E.2. We have also added this implementation detail at the end of Section 4 to make this design choice explicit:

"Note that higher values of encourage lower , while lower values permit to take higher values; see Figure 6a. Since this can result in abnormally high dropping probabilities, we clip the value of in our experiments using another hyperparameter, ; see details in Appendix E.2."

Does this address your concern about both the role of and the robustness to degree outliers?

Provide an ablation on SyntheticZINC varying the preserved-information parameter ; show accuracy and number of edges kept.

Thank you for the suggestion – we agree that this ablation would offer valuable insight. As a first step, we evaluate DropSens with , which preserve approximately the same number of edges in expectation (k, k, respectively) as DropEdge with (k, k, respectively).

Kindly find the results for DropSens with below, where we observe that it improves on its DropEdge counterpart, ; we will report on soon.

Train MAE (lower is better)

Method
DropSens ()
DropEdge ()
DropSens ()
DropEdge ()

Test MAE (lower is better)

Method
DropSens ()
DropEdge ()
DropSens ()
DropEdge ()

How does DropSens behave on power-law graphs where a few hubs dominate degrees? Include an experiment or theoretical bound on maximum drop probability in that regime.

Thank you for raising this important point. You're absolutely right that on power-law graphs, a small number of high-degree hubs can lead DropSens to assign extremely high dropping probabilities, potentially close to . Since DropSens does not impose a theoretical bound on dropping probability (except the vacuous bound, ), this behavior is expected when in-degrees are very large.

To mitigate this, we cap the dropping probabilities using a hyper-parameter , which ensures that even high-degree nodes retain a minimum level of connectivity. This clipping mechanism is already implemented in our experiments, but we realize this detail was not clearly communicated. We have now added a clarification at the end of Section 4 to make this design choice explicit.

Does this address your concern about DropSens on power-law graphs and the lack of a theoretical bound?

Footnotes:

For graph classification tasks, edge masks are computed in one go (instead of one mini-batch at a time, as in practice).