PaperHub
5.5
/10
Poster4 位审稿人
最低3最高3标准差0.0
3
3
3
3
ICML 2025

Commute Graph Neural Networks

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

We propose an approach to integrate commute time into graph neural networks to enhance the analysis of directed graphs, effectively addressing the asymmetry and complex path interactions inherent in these structures.

摘要

关键词
Graph Neural NetworksCommute TimeMessage PassingNode Classification

评审与讨论

审稿意见
3

The paper introduces Commute Graph Neural Networks (CGNN), a framework designed to improve learning on directed graphs by incorporating bidirectional path information.

Key components of CGNN include a digraph Laplacian (DiLap) leveraging random walk transition probabilities, efficient computation of node-wise commute times via Markov chain theory, integration of commute times into GNN message passing for refined aggregation, and a feature-based graph rewiring strategy to ensure the digraph remains irreducible, aperiodic, and sparse.

Extensive experiments demonstrate that CGNN outperforms existing methods, particularly in scenarios with important mutual relationships, while scaling efficiently with sparse graphs.

给作者的问题

  1. Were there specific scenarios or applications where CGNN's benefits over directed GNN were more pronounced than the incremental improvements reported in Table 1? Understanding these scenarios will help highlight the unique strengths and potential use cases of CGNN, indicating situations where its deployment would be most advantageous.
  2. What are the main differences between feature-based rewiring strategy proposed in this submission and those in the existing literature [4, 5]? Clarifying these differences allows for a better assessment of the method's novelty and its contribution to the ongoing advancements in GNN research.
  3. How does the integration of commute time in directed graph neural networks build upon or differ from existing publications [2, 3] that utilise commute times for addressing the over-squashing phenomenon in undirected graphs? This will provide insight into whether the approach leverages or innovates upon established methodologies, specifically addressing challenges unique to directed graphs.
  4. Were there considerations on examining the impact of hyperparameters, such as the rank in SVD approximation and the number of additional edges in the rewiring strategy, on the performance and robustness of CGNN? Understanding the influence of hyperparameters is crucial for optimizing CGNN’s performance and can inform best practices for its implementation in different contexts.
  5. How does CGNN compare with transformer-based approaches for directed graphs, particularly regarding performance and scalability? Such a comparison provides a clearer picture of how CGNN stands relative to cutting-edge methods, potentially influencing researchers' choice of tools for tackling directed graph problems.

论据与证据

The claims regarding DiLap's derivation and its connection to the fundamental matrix are supported by theoretical evidence, providing a sound basis for computing commute times.

Experimental evidence, including accuracy comparisons in Table 1 and accuracy versus runtime tradeoff analysis in Figure 4 across eight datasets, supports the claim that CGNN surpasses multiple baseline methods in performance.

The paper also validates its rewiring strategy to ensure the graph meets irreducibility and aperiodicity conditions, with experimental evidence in Table 2.

方法与评估标准

By integrating commute times through DiLap into the message passing scheme CGNN directly addresses the issue of capturing asymmetric relationships in digraphs.

Empirical evaluation on eight benchmark datasets, encompassing both homophilic and heterophilic networks, is diverse and compelling.

Runtime comparisons and ablation studies, e.g. assessing the impact of graph rewiring and directed vs. undirected variants, further validate that the method is practically robust.

理论论述

All main theoretical claims, including Proposition 3.2, Theorem 3.3, Proposition 4.1, Lemma 4.2, and Theorem 4.3, are mathematically sound under the stated assumptions, with no significant issues identified.

However, more detailed steps in Lemma 4.2 and Theorem 4.3 would improve clarity.

Specifically, the definition of Moore–Penrose pseudoinverse of a matrix, as well as a detailed step-by-step derivation of Theorem 4.3 would aid clarity.

实验设计与分析

The experimental analyses, including node classification baseline comparison in Table 1, tradeoff analyses in in Figure 4 and Table 3, ablation studies in Tables 2 and 4, and analyses of commute-time proximity related to label similarity in Figure 3 are thoughtfully constructed.

Although the experiments validate the core claims, the impact of hyperparameters—such as the rank (q) in the SVD approximation and the number of additional edges in the rewiring strategy—on CGNN's performance is not thoroughly examined, and further analysis could enhance understanding of its robustness.

The scalability analysis is promising, but since the commute time matrix is dense by nature, it would be beneficial to see more discussion or experiments on very large-scale graphs to fully understand potential limitations in memory usage.

补充材料

A footnote on page 6, line 328, references an anonymized GitHub repository.

Although the repository's file names suggest a well-organised codebase, the individual files are not accessible.

This lack of transparency might weaken the credibility of the proposed method within the research community.

与现有文献的关系

Recent advances in GNNs for directed graphs focus on edge directions but usually handle one-way interactions, while this submission highlights the integration of mutual path dependencies.

Building on existing work that used transition probabilities and stationary distributions for directed graph Laplacians, this paper introduces a weighted digraph Laplacian (DiLap) derived from the signal's gradient divergence.

Inspired by PageRank and diffusion, the authors introduce a sparse rewiring strategy for computing commute times in CGNNs.

遗漏的重要参考文献

Both commute times and rewiring strategies are well-recognized in the graph neural network literature [1], particularly in relation to the over-squashing phenomenon.

Prior findings have established that over‐squashing stems from high commute times [2] and that topology-aware rewiring can effectively mitigate such limitations [3].

Rewiring strategies based on node features have recently been a trend [4, 5] in addressing over-squashing, albeit not specifically in the context of directed graphs.

References:

  1. Rewiring Techniques to Mitigate Oversquashing and Oversmoothing in GNNs: A Survey, arXiv:2411.17429, 2024,
  2. On Over-Squashing in Message Passing Neural Networks: The Impact of Width, Depth, and Topology, ICML 2024,
  3. DiffWire: Inductive Graph Rewiring via the Lovász Bound, In LoG 2022,
  4. Delaunay Graph: Addressing Over-Squashing and Over-Smoothing Using Delaunay Triangulation, ICML 2024,
  5. GNNs Getting ComFy: Community and Feature Similarity Guided Rewiring, In ICLR 2025.

其他优缺点

Strengths:

[+] The paper is well-written and organised with clarity.

[+] Additional experimental details, e.g. hyperaprameters, visualisation, and sensitivity, in the appendix strengthen the submission

Weaknesses:

[-] The improvements over directed GNNs shown in Table 1 are minimal.

其他意见或建议

This submission references recent research [6] on transformer models for directed graphs.

However, it does not discuss the trade-offs between CGNN and transformers on directed graphs.

Moreover, there is no empirical comparison with transformer baselines.

[6] Transformers meet directed graphs, In ICML 2023.

作者回复

Thanks for your insightful feedback. We've consolidated your comments into 8 questions and provided responses to each.

Q1: The definition of Moore–Penrose Pseudoinverse (MPP) in Lemma 4.2, and a detailed derivation of Theorem 4.3

A1: Here we provide the definition of MMP. In linear algebra, MPP of a matrix KK is a unique generalized inverse satisfying 4 conditions:KZK=KKZK=K, ZKZ=ZZKZ=Z, (KZ)T=KZ(KZ)^T=KZ and (ZK)T=ZK(ZK)^T=ZK, then ZZ is the MPP of KK.

Regarding Theorem 4.3, its derivation is straightforward. Specifically, Eq.(4) shows that the fundamental matrix ZZ can be directly used to compute commute time. Since Lemma 4.2 provides an efficient method for obtaining Z=T~Z=\widetilde{T}^\dagger, substituting T~\widetilde{T}^\dagger into Eq.(4) yields the expression in Theorem 4.3. We will include a detailed derivation in the revised manuscript.

Q2: More discussion or experiments on large-scale graphs to understand limitations in memory usage

A2: We acknowledge that memory consumption is a limitation of CGNN due to the quadratic complexity of storing pairwise commute times. Although we threshold small entries to zero and optimize matrix storage, memory improvements remain limited. In contrast, vanilla GCN’s memory usage scales with the number of edges, making it more efficient. In the experiments of the large-scale Snap-Patents dataset(~3 million nodes), CGNN requires 13GB memory compared to GCN's 6GB. Future work will explore localization, sampling, or sparsification methods to reduce memory demands. Following your suggestion, we will add these potential strategies in the part of future work.

Q3: The individual codebase files are not accessible

A3: Here we provide the download link for code files: https://drive.google.com/file/d/1lu-od07JgmQK9OsLxlePeOwHYiN4kp2r/view?usp=sharing

Q4: Essential references not discussed

A4: Thank you for highlighting these relevant works. We will include a discussion in Section 4.2 about the rewiring methods used in these studies and clarify how they differ from ours.

Q5: Scenarios or applications where CGNN's benefits over directed GNN were more pronounced

A5: We emphasize that CGNN demonstrates substantial advantages in scenarios characterized by asymmetric mutual relationships—conditions prevalent in real-world networks such as webpage and social networks (e.g., Squirrel and AM-Photo datasets). In these networks, directional relationships (e.g., celebrity-follower structures) create significant path asymmetry, making traditional directed GNN methods inadequate, as they inherently fail to capture the mutual dependencies of node interactions.

Q6: Differences between our feature-based rewiring strategy and those in the [4,5]

A6: Although [4, 5] also propose rewiring strategies, their motivations and methodologies differ from ours:

[Motivation] [4] rewires the graph to reduce negatively curved edges, thereby mitigating over-squashing(OS). [5] rewires the graph to optimize the spectral gap, thereby modifying the community structure to mitigate OS.

Our method rewires the graph to make it irreducible and aperiodic, thereby deterministic commute time can be computed. Both the motivation and goal differ from [4,5].

[Methodology] [4] rewires the graph by performing Delaunay triangulation on node features. [5] rewires the graph by perturbing inter/intra-cluster edges and adding edges between similar nodes.

Our method rewires the graph by generating a strongly connected line graph (GG^{'} in Fig.2) and combine it to the original graph. Although [5] also uses feature similarity, our approach ensures the rewired graph is irreducible and aperiodic, which [5] does not achieve.

Q7: How does the integration of commute time in directed GNNs build upon or differ from [2,3]?

A7: They are different. In [2], commute time is used as an analytical tool to investigate over-squashing (OS) without directly computing it or use it to enhance message passing. In contrast, our work explicitly computes commute time to encode directional relationships between neighboring nodes, it enhances message passing. Besides, our method focuses on adjacent nodes without considering distant nodes, and thus is not an OS study.

[3] also targets OS alleviation. Although it also embedding commute time into node embeddings, its optimization objective differs from ours. Specifically, [3] assumes undirected graphs, aiming for uniformly small commute times among neighbors (see Eq (4) in [3]). Conversely, our method uses commute time to weight neighbor aggregation, accommodating varying commute time scales inherent to digraphs.

Q8: [Q8.1] Impact of hyperparameters qq in SVD approximation; [Q8.2] the number of additional edges in the rewiring strategy, and [Q8.3] the comparison with Transformer-based model

A8: We address the hyperparameter ablation studies and comparison requests in link: https://anonymous.4open.science/r/K5269.

审稿意见
3

This paper proposes a directed graph Laplacian based on the commute time between nodes. The paper also proposes a commute graph neural network model based on rewiring the graph to satisfy model assumptions and the Laplacian matrix. Experiments are conducted to validate the model's efficacy, together with ablation studies and discussions.

给作者的问题

  1. What is the performance of using DiLap directly as a graph shift operator without considering commute times?
  2. It is not clear why the proposed rewiring is only minimally altering the overall semantics at first glance. The explanation provided at the end of the ablation study sheds some lights, but this rewiring is highly dependent on feature similarity. The rewiring should only be helpful if features play a large role in the prediction. It is better to check whether even not using the graph structure at all, or using a generated graph merely based on feature similarity already suffices. Could you check whether on these datasets, constructing graphs merely based on KNN graphs from features suffices?

论据与证据

  1. It is not clear at first glance why we should compute commute time rather than hitting time twice but with different source and target nodes, as these two are indeed related. The paper provides some sort of explanation only till the end of the main text.
  2. It is not clear why the proposed rewiring is only minimally altering the overall semantics at first glance. The explanation provided at the end of the ablation study sheds some lights, but this rewiring is highly dependent on feature similarity. The rewiring should only be helpful if features play a large role in the prediction. It is better to check whether even not using the graph structure at all, or using a generated graph merely based on feature similarity already suffices.

方法与评估标准

  1. The datasets seem fine.
  2. It is not clear why the proposed rewiring is only minimally altering the overall semantics at first glance. The explanation provided at the end of the ablation study sheds some lights, but this rewiring is highly dependent on feature similarity. The rewiring should only be helpful if features play a large role in the prediction. It is better to check whether even not using the graph structure at all, or using a generated graph merely based on feature similarity already suffices.

理论论述

There is no proof or reference of proof for Proposition 3.2.

实验设计与分析

  1. To better demonstrate the discussion claims, the authors could include some synthetic datasets.
  2. What is the performance of using DiLap directly as a graph shift operator without considering commute times?

补充材料

I glanced through all of it.

与现有文献的关系

The paper is well connected to the wider literature.

遗漏的重要参考文献

N/A

其他优缺点

The paper is not very well motivated.

其他意见或建议

There should be a "tp" after GG' on the RHS of line 198.

作者回复

Thanks for your insightful review. We have summarized all your feedback into 6 key questions and address below.

Q1: Why computing commute time rather than hitting time twice with different source and target nodes

A1: In fact, our method precisely aligns with your suggestion -- “computing hitting time twice with different source and target nodes.” By definition, commute time is the expected number of steps to travel from one node to another and return. It equals the sum of hitting times in both directions. Our approach also follows this principle. Specifically, the proposed efficient commute time computation method first computes the hitting time HH based on the DiLap as Eq.(10). We then obtain commute time by summing up the mutual hitting time between source and target nodes with C=H+HTC=H+H^T. Thus, you can indeed view our method as computing hitting time for both directions and summing them to get commute time.

Q2: [Q2.1] Why the proposed rewiring is minimally altering the overall semantics; [Q2.2] Check whether omitting the graph structure or using a KNN graph is sufficient

A2.1: Our similarity-based rewiring method first generates a line graph GG^\prime (see Fig.2) based on feature similarity. Then merging GG^\prime with GG adds at most two new edges per node, resulting in only minor changes to the original graph’s structural semantics. Moreover, since many real‐world graphs already link nodes with similar features, these newly added edges often overlap with existing ones, further narrowing the semantic shift caused by rewiring. Table 9 in Appendix D.3 confirms that the rewiring does not substantially change the graph.

A2.2: If we entirely ignore the graph structure and rely solely on node features, a natural baseline is MLP. The performance gap between MLP and our methods underscores the importance of structural information. Alternatively, omitting the original structure and reconstructing the graph as a KNN graph based solely on node features, yielding a KNN-GCN baseline. We compare the performance of MLP, KNN-GCN, and CGNN. Results are presented in https://anonymous.4open.science/r/Rebuttal-107E/Q1.md, which shows that CGNN achieves the best performance, confirming the crucial role of the original graph structure.

Q3: Proof for Prop 3.2

A3: Prop 3.2 states standard definitions from Markov chain theory, which are well-established in the literature (e.g., Aldous & Fill, 2002). Thus, this proposition serves to clarify these terminologies and does not require additional proof. We will add the reference in the revised manuscript.

Q4: Suggestion on including synthetic datasets

A4: While our paper prioritized experiments on established real-world datasets to highlight the practical significance of CGNN, synthetic datasets can indeed enhance the interpretability of CGNN. To address your request, we have now constructed a binary‐classification dataset of 3000 nodes (1500 per class), with features drawn from Gaussian distributions (class one: N(0,1)N(0,1), class two: N(3,1)N(3,1)). Within-class edges are created with a 0.2 probability, while across-class edges have a 0.02 probability, with random edge directions. We predefine a commute‐path range of [2,7], thus capturing asymmetric node‐to‐node relationships. We apply CGNN on this graph and compute the Mutual Information between the central node and its neighbors with short commute times denoted as αs\alpha_s, and with long commute times denoted as αl\alpha_l. If the average αs>αl\overline{\alpha}_s>\overline{\alpha}_l, it confirms that our model effectively preserves commute relationships. In our experiments, CGNN achieves αs=13.297\overline{\alpha}_s=13.297 and αl=6.552\overline{\alpha}_l=6.552, showing that CGNN does capture the intended asymmetry.

Q5: Performance of using DiLap as a graph shift operator (GSO) w/o commute times

A5: To treat DiLap TT as a GSO, we need to decompose it into TinT_{in} and TouT_{ou} based on incoming and outgoing edges. These sub-GSOs yield node embeddings HinH_{in} and HouH_{ou}, which are then combined using a learnable weight. This variant called DiLapGNN allows us to evaluate DiLap-induced GNNs without commute times. Please refer to https://anonymous.4open.science/r/Rebuttal-107E for a performance comparison.

Q6: Justification on motivation

A6: Thank you for raising this concern. We respectfully note that other reviewers found our work clearly motivated—reviewer 951e stated it provides a "nice introduction for commute time and a motivation for it," and reviewer 9WJX described our approach as "novel, well motivated, and technically sound." Our motivation is twofold: practically, commute time naturally captures real-world asymmetries (e.g., celebrity-follower relationships) common in directed social networks. Algorithmically, current GNNs fail to model such directional asymmetry effectively (Sec. 3.1). These insights motivated us to leverage commute time to enhance directed GNNs.

审稿意见
3

The paper presents a new approach to embed commute time information when learning on directed graphs, for the task of node property prediction. It gives a nice introduction for commute time and a motivation for it. Then, the work proposes to approximate commute time information which is usually computed in O(|V||E|) at O(q|E|) where q is constant. Based on the tested benchmakrs embedding the commute time information achieve state-of-the-art results.

Update after rebuttal

Although some issues were addressed in the authors’ response, others still remain unresolved. For example, why does q = 5 perform better than q = 10 in the ablation study, and how should a subgraph/neighborhood/sampled graph be chosen to compute B? It is also not clear from the response how the provided clarifications will be incorporated into the manuscript. For these reasons, I choose to maintain my original score.

给作者的问题

I have no further questions.

论据与证据

Overall, the claims presented by the authors are convincing and clearly articulated. However, one point of concern remains. The manuscript motivates the importance of preserving the original commute time information. Yet, in practice, the proposed method introduces new edges from G′ to G, thereby altering the original commute time characteristics. Additionally, the commute time matrix is computed via a low-rank approximation of Z with rank q. It would be beneficial if the authors could clarify how their method maintains effectiveness given that the commute time information is only partially preserved.

方法与评估标准

I believe that the evaluation is ok. Please refer to my point in the strengths and weaknesses section about the hyperparameter tuning of DirGNN .

理论论述

I did not find any issues with the proofs.

实验设计与分析

I did not find any issues with the experimental design or analyses.

补充材料

I reviewed the full appendix.

与现有文献的关系

Developing symmetric matrices for directed graphs (to be used later by spectral techniques) such as the normalized directed Laplacian (Chung, 2005) is an exciting topic. Computing the commute time matrix in an efficient manner might contribute to this field of research.

遗漏的重要参考文献

I believe that the authors covered most of the relevant literature. I would like to point out favorably the refence for the work of Chung (2005).

其他优缺点

Strengths

The paper demonstrates that the proposed weighting based on commute time yields state-of-the-art results on several directed graph benchmarks.

By approximating the full commute time matrix via the matrix Z, the method offers a promising alternative to traditional, more computationally expensive approaches.

The proposed CGNN model is shown to be relatively efficient, with its time complexity being proportional to |E| when q is treated as a constant.

Weaknesses

The paper does not include an ablation study on the parameter q. Given that increasing the rank q in the SVD could potentially yield more accurate commute time computations, such a study would help clarify whether the improvements in CGNN performance can be attributed directly to the enhanced commute time information.

Timing experiments are reported for only a subset of datasets. In particular, it would be beneficial to see timing results for the Snap-Patents dataset, which has the largest number of nodes and edges.

The paper does not provide sufficient details on how the hyperparameters for DirGNN were tuned and whether they were optimized in a manner similar to those for CGNN.

There is a potential concern regarding the scalability of the method to dense graphs, as the dimensions of the matrix B may impose limitations.

其他意见或建议

The expression ℎ(𝑣𝑖,𝑣𝑘) is repeated on line 139 and should be corrected.

Lines 150–154 discuss an aspect that either requires experimental validation or a citation to existing literature.

Please fix the title at the beginning of each new page, currently its "Submission and Formatting Instructions for ICML 2025".

作者回复

Thanks for your feedback. Our detailed responses follow.

Q1: Clarify how to maintain effectiveness given that the commute time information is partially preserved

A1: The effectiveness can be well maintained, as the proposed graph rewiring approach does not drastically alter the original graph’s structural semantics and commute time information. Specifically, our similarity-based rewiring method first generates a line graph GG^\prime (see Fig.2) based on feature similarity. Then merging GG^\prime with GG adds at most two new edges per node, resulting in only minor changes to the original graph’s structural semantics. Moreover, since many real‐world graphs already link nodes with similar features, these newly added edges often overlap with existing ones, further narrowing the semantic shift caused by rewiring. Below, we verify that both the structural semantics and commute times are well maintained from two perspectives.

[Changes of structure] To quantify the changes brought from graph rewiring, we define edge density as δ=MMmax\delta=\frac{M}{M_{max}}, where MmaxM_{max} is the maximum possible number of edges and MM is the actual number of edges. We denote the edge density of the original graph GG as δ\delta and that of the rewired graph G~\widetilde{G} as δ~\widetilde{\delta}. Thus the change of graph density after rewiring can be represented as Δ=δ~δδ(0,1)\Delta=\frac{\widetilde{\delta}-\delta}{\delta} \in (0,1), the smaller Δ\Delta indicates that the less effect of our methods on graph density. The results in the below table reveal that on the AM-Photo dataset, graph rewiring increases density by 10.3%, while on the Snap-Patent and Arxiv-Year, the increases are only 6.7% and 3.2% respectively. These findings show our rewiring method has a modest effect on structure.

AM-PhotoSnap-PatentArxiv-Year
Δ\Delta0.1030.0670.032

[Changes of commute time] Since we cannot directly compute the commute time on the original graph, rewiring is necessary. We extract its largest connected component and remove absorbing nodes to ensure meaningful commute times. Denoting the average normalized commute times for the original and rewired graphs as coc_{o} and crc_{r} respectively, we quantify the change using δ=cocr22co22\delta=\frac{||c_o-c_r||^2_2}{||c_o||^2_2}, representing the proportion of commute information altered. The results show that the rewiring method effectively preserves the original commute times.

AM-PhotoSnap-PatentArxiv-Year
δ\delta0.02270.00130.0125

Q2: Ablation study on q

A2: We conducted an ablation study on the SVD rank q across three datasets of varying sizes. The results indicate that increasing q beyond a certain point does not yield continuous performance gains and incurs additional computational cost. Our findings suggest that setting q=5 or 7 is optimal, as a small rank is sufficient for nodes to capture neighbor relationship strengths.

qqAM-PhotoSnap-PatentArxiv-Year
388.3671.5364.20
590.4272.8966.16
790.1973.0166.21
1090.3772.9766.45

Q3: Timing results for the Snap-Patents dataset

A3: Below are the timing results on the Snap-Patents dataset. Due to its large scale, models like GAT, DGCN, DiGCN, and Magnet ran out of memory. Among the remaining models, APPNP was the fastest at 10.75 mins, followed by GPRGNN at 13.89 mins, CGNN at 20.90 mins, and GCNII at 21.25 mins. Although CGNN requires extra commute time computation, it converges faster with only 80 training epochs are needed to achieve optimal validation accuracy.

Q4: How the hyperparameters (hps.) for DirGNN were tuned and whether they were optimized in a manner similar to those for CGNN

A4: We treat DirGNN and CGNN as separate models and tune their hps. independently. For datasets used in the original DirGNN paper, we follow its recommended settings; for other datasets, we carefully perform our own tuning. Because CGNN integrates commute‐time‐based weights, we cannot simply transfer DirGNN’s hps. For instance, on the Arxiv‐Year dataset, DirGNN uses a hidden dim of 256, whereas CGNN achieves optimal performance with 64.

Q5: Concern regarding the scalability on dense graphs, as the dimensions of the matrix B may impose limitations.

A5: Indeed, the dimensions of the matrix BB scales with the number of edges, which can become problematic for dense graphs. One could localize the commute time computation, approximating it within subgraphs or neighborhoods, rather than computing a global matrix. Another option involves applying sampling or sparsification to reduce edge density. We plan to explore these strategies in future work.

Q6: Other cmts. or suggs

A6: Thanks for pointing out these issues. In the revised manuscript, we have 1) corrected h(vi,vk)h(v_i, v_k) to h(vi,vj)h(v_i, v_j) on line 139; 2) added supporting references (Oversmoothing and non-local GNN papers) to lines 150–154; 3) modified the header on each page.

审稿意见
3

This paper proposes a novel method for learning on directed graphs. The authors argued that it is important to preserve a notion of mutual closeness in directed graphs via commute time, and introduced a method to compute/approximate this efficiently. They then encode this information into the design of message passing blocks in a directed graph neural network. Experimental results demonstrate the effectiveness of the proposed method, and detailed analyses are carried out to understand its behaviour.

给作者的问题

See above.

论据与证据

The main claims of the paper are that taking into account mutual closeness between a pair of nodes in a directed graph is beneficial in certain learning tasks, and commute time (or its approximation) is a useful tool to encode such relationship. I found this well supported both conceptually and empirically.

方法与评估标准

While the CGNN introduced in 4.4 is entirely reasonable, the commute time is used there only to moderate neighbour importance. I wonder whether the authors can think about alternative ways of making use of the commute time in designing a directed GNN, eg using it in a regularisation term to penalise message passing, or even defining a non-MPNN based framework (given that commute time is of general interest)?

The commute time is a measure entirely dependent on the graph topology, and importantly not on the specific downstream task. One may ask whether we should take into account the later in assessing importance of a node i to j and vice versa?

The Commute time has also recently been used to characterise the issue of over-squashing in undirected GNNs (see for example https://arxiv.org/pdf/2302.02941). I wonder if the authors can think about using commute time on directed graphs for similar analysis?

理论论述

The theoretical results seem correct to me.

实验设计与分析

Apart from an exponential function, have the authors tried other ways to convert commute time into a similarity measure? How stable is the exponential transformation?

补充材料

Yes, I reviewed all parts of the Supplementary Material.

与现有文献的关系

The key contribution of the paper is to take into account commute time in designing the message passing framework in a directed GNN. As far as I know this aspect is missing in previous literature.

遗漏的重要参考文献

The references seem adequate to me.

其他优缺点

Strengths:

  • the paper addresses a relatively under-explored but important problem in graph machine learning
  • the proposed approach is novel, well motivated, and technically sound
  • empirical results and analyses are comprehensive and convincing
  • the paper is well written and easy to follow

Weaknesses:

  • while the analysis on commute time is sound and interesting, its utilisation in designing the directed GNNs is rather straightforward. One might argue this design seems a straightforward extension to existing methods such as DirGNN, however it is also understandable the authors choose to stick to the conventional message passing paradigm

其他意见或建议

Line 138: it looks like one of the h should be h(v_i, v_j)?

作者回复

Q1: Alternative ways of using the commute time in designing a directed GNN, e.g., regularisation or a non-MPNN?

A1: We appreciate your suggestion. While alternative approaches like incorporating commute time as a regularization term or exploring non-MPNN frameworks are interesting thoughts, our design was intentional to explicitly leverage commute time to directly modulate neighbor importance within the message-passing, thus ensuring nodes connected via shorter commute times inherently produce more similar embeddings. To rigorously evaluate your suggested alternative, we implemented a regularization-based variant (CGNNregCGNN_{reg}). Specifically, we removed weighting coefficients (C~\widetilde{C}) from Eq. (11) and instead introduced a commute-time regularization term (λLCT\lambda L_{CT}) to be added to cross‐entropy loss LCEL_{CE}. The new term, LCTL_{CT}, penalizes pairs of nodes with long commute times if they end up with too similar representations, while encouraging similarity for pairs with short commute times by LCT=_(u,v)EC~_uvhuhv_22L_{CT} = \sum\_{(u, v) \in E} \widetilde{C}\_{uv}\|\|h_u-h_v\|\|\_2^2. We also included a simpler non‐MPNN baseline, denoted as MLPregMLP_{reg} with LCTL_{CT} computed using MLP embeddings. Results are shown in the table below.

ChameleonRoman
MLPMLP34.9248.30
MLPregMLP_{reg}45.6952.98
CGNNregCGNN_{reg}78.3390.62
CGNNCGNN79.6292.87

It shows, our original design choice not only aligns with the intended theoretical rationale but also empirically demonstrates superior performance -- weighting neighbors during message passing appears to be more effective than relying on the regularization‐based approach.

Q2: Should we consider the task-specific information in assessing importance?

A2: Thanks for raising this point. While we acknowledge that task-specific signals can indeed enhance targeted model performance, we'd like to highlight our focus is to demonstrate utilizing commute time is intentionally rooted in capturing fundamental structural properties inherent to graph topology—particularly critical in digraphs. The topology-driven nature of commute time allows our model to robustly encode essential structural relationships independent of specific downstream tasks, enhancing generalizability and interpretability. Moreover, commute time has the potential to serve as a complementary enhancement to existing task-aware methods like Label Propagation(LP). We used GCN‐LPA (Wang et al., TOIS 2021), an LP-based GNN, for evaluating. Integrating commute time improved accuracy by 2.74% on the Roman dataset. Thus, our method strategically leverages commute time to provide a solid structural foundation, while still enabling targeted task-specific refinements where advantageous.

Q3: Extending the use of commute time to analyze over-squashing in digraphs

A3: As you mentioned, commute time has been used to analyze the cause of over-squashing (OS), and extending the analytical framework to digraphs is a promising direction. While our work primarily focuses on using commute time to encode the directional strength of relationships between nodes, rather than studying the OS problem, we still see two ways that our method might enhance OS analysis: 1) Direction‐Aware Jacobian: The referenced paper models OS with a symmetric Jacobian suitable for undirected graphs. In contrast, our model captures the directionality of edges, allowing us to construct an asymmetric Jacobian for more targeted analysis in digraphs. 2) Quantifying “Long Commutes”: While previous OS work identifies that OS tends to occur between nodes that are “far apart,” it does not quantify exactly how large a commute time triggers OS. Our method, anchored by the DiLap‐based commute time computation, can explicitly quantify these distances.

Q4: Other ways to convert commute time into a similarity measure

A4: Yes, converting commute time into a similarity measure is analogous to turning distances into similarities, and multiple alternatives are possible. Beyond the negative exponential mapping (NEM) used in our paper, we also experimented with a shifted inverse function 1C+ϵ\frac{1}{C+\epsilon}. In empirical evaluations, NEM consistently performed better in continuous embedding spaces and boosted accuracy across datasets of various sizes.

Q5: Other Weakness: Utilisation of commute time is straightforward

A5: Although weighting neighbors by commute time appears straightforward, it provides an intuitive choice that allows us to isolate the effect of commute time while minimizing other interference factors. The primary innovation of our method lies not merely in the weighting scheme itself but fundamentally in identifying and exploiting the notion of mutual path dependencies. Moreover, demonstrating effectiveness via a intuitive implementation is the basic for exploring more sophisticated designs.

Line 138:Thanks! We will correct the typo in revision.

审稿人评论

I thank the authors for their detailed responses and in particular the added empirical experiments. I am therefore keeping my original score.

最终决定

This paper proposes a neural architecture for directed graphs. The main idea is to re-weight the messages of in-/out-edges by commute time to integrate longer-range structure into the message passing. To compute commute time efficiently, the authors use a specific directed Laplacian, and approximate the computation of the required pseudoinverse with truncated SVD. The proposed method performs well when compared to other approaches, including one transformer for directed graphs. One weakness is the scalability to dense graphs, as some computations scale with the number of edges. During the discussion, the reviewers raised a number of questions, e.g., on hyperparameter tuning, additional comparisons (e.g. to transformers) and ablations. The authors answered these well, but should definitely include the experiments from the rebuttal in the main paper. Optional: In addition, I would be curious how the derived commute times would perform when used for positional encodings instead of re-weighting the message passing (but I did not take this into account for the decision).