PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
3
3
4
ICML 2025

HGOT: Self-supervised Heterogeneous Graph Neural Network with Optimal Transport

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

关键词
Heterogeneous graphSelf-supervised learningOptimal transport

评审与讨论

审稿意见
3

The paper proposes a self-supervised heterogeneous graph neural network (HGNN) coined ``HGOT''. It first incorporates optimal transport (OT) into heterogeneous graphs to better facilitate the learning of a more semantic accurate similarity measure between graph instances and structure. The method introduces three components: node feature transformation, which projects different types of node features into one latent space; aggregated view generation, which constructs multiple meta-path-based views and an aggregated view on the entire graph; and multi-view OT alignment, which calculates the optimal transport plan and minimizes the difference between views and costs between feature and structural representations.

给作者的问题

(1) The authors should provide clear definitions of Wasserstein distance, Gromov-Wasserstein (GW) distance, and fused GW distance in the preliminary section, along with appropriate citations, to help readers unfamiliar with OT. Are eq(8)--(13) the original definitions of these distances? What does the symbol \otimes represent? (2) Caption of Figure 1 "We first transform all nodes of the original heterogeneous graph into node features" is ambiguous. Does HGOT utilize special encoding methods to generate node features XX? (3) HeCo conducted experiments under different dataset splits. What is the dataset split in this work? (4) Will the code implementation of HGOT be open-sourced?

论据与证据

Most parts of the paper are clearly written and easy to read, despite some grammatical errors. The authors have clearly demonstrated the semantic advantage of local-global alignment between the meta-path view and the aggregated view in their proposed method. However, I have some concerns about the clarity and persuasiveness of the methodology section. (1) The authors do not explicitly clarify the significance of using OT for graph self-supervised learning. What are the advantages of calculating the distribution distance between views compared to other graph self-supervised learning methods? For instance, why is the OT-based approach superior to reconstruction-based methods (e.g., HGATE [1], HGMAE [2], RMR [3]) when both of them do not require augmentation methods and negative sampling? (2) The explanation of the OT distance is unclear. Authors should provide an intuitive explanation for the OT distances designed for feature and edge information, and it would be better with theoretical or empirical comparisons to other metrics. Additionally, a more convincing analysis for choosing cosine similarity and absolute adjacency matrix difference as CX\mathcal{C}_X and CA\mathcal{C}_A is expected. (3) The ablation study results cannot validate the effectiveness of the aggregated view, as it only provides relatively little contribution to the overall model.

[1] Wang, W., Suo, X., Wei, X., Wang, B., Wang, H., Dai, H. N., and Zhang, X. 2021. HGATE: heterogeneous graph attention auto-encoders. TKDE. [2] Tian, Y., Dong, K., Zhang, C., Zhang, C., and Chawla, N. V. 2023. Heterogeneous graph masked autoencoders. In AAAI. [3] Duan, H., Xie, C., and Li, L. 2024. Reserving-masking-reconstruction model for self-supervised heterogeneous graph representation. In KDD.

方法与评估标准

On the methodology side, I believe that the proposed method is an application of existing self-supervised learning techniques to heterogeneous graphs, rather than presenting innovative insights or designs. (1) Node feature transformation and attention-based meta-path representation aggregation are very common practices in HGNNs, as seen in HAN [1] and HeCo [2]. (2) Self-supervised learning using Wasserstein distance across different contrastive views has already been proposed in COLES [3]. Although the authors claim that they are the first to apply OT to heterogeneous graphs, they do not explain or analyze how OT facilitates the learning of heterogeneous representations. Instead, heterogeneous information is only considered in feature transformation and the construction of contrastive views, while the OT-based loss itself seems unrelated to heterogeneous graph learning. For example, OT could have been used to compute the distance between different node/edge type distributions to facilitate the learning of cross-type heterogeneous representations. However, this paper only focuses on learning the distribution distances of node features and adjacency matrices under the homogeneous transformation, which raises doubts about the necessity of the OT strategy for HGNNs. Therefore, I find the innovation in this paper to be limited. On the evaluation side, the authors selected four heterogeneous benchmark datasets and equipped each downstream task with multiple evaluation metrics. The overall evaluation setup is sound to me except for the absence of dataset split information.

[1] Wang, X., Ji, H., Shi, C., Wang, B., Cui, P., Yu, P., and Ye, Y. 2019. Heterogeneous graph attention network. In WWW. [2] Wang, X., Liu, N., Han, H., and Shi, C. 2021. Self-supervised heterogeneous graph neural network with co-contrastive learning. In KDD. [3] Zhu, H., Sun, K., and Koniusz, P. 2021. Contrastive Laplacian Eigenmaps. In NeurIPS.

理论论述

The theoretical analysis is insufficient. I suggest that the authors include some theoretical analyses of the OT-based learning objective, including but not limited to the connection with other self-supervised objectives such as the InfoMax principle, generalization bounds, and cross-domain transferability. Particularly, contributions of OT in heterogeneous representation learning would be analyzed to support the claims and establish a solid theoretical foundation for the proposed method.

实验设计与分析

I appreciate that the authors conducted experiments on both node classification and node clustering to demonstrate the generalization performance of HGOT across different scales of graph structure. Despite that HGOT achieves significant performance improvements on all datasets and multiple evaluation metrics, I believe that its performance does not reach the state-of-the-art claimed by the authors, as only one baseline method published after 2020 (HGBER) is included. More recent works should be incorporated. Besides, the paper fail to provide an in-depth empirical analysis of HGOT's advantages. Some additional experiments need to be included: (1) More cutting-edge reconstruction-based and attention-based methods should be added as baselines, such as those mentioned in Claims and Evidence [1], [2], [3]. Considering the similarity between HGOT and COLES [4], the authors should also elaborate on the difference and advantages of the proposed OT loss over COLES by comparative experiments. (2) Since OT plays a positive role in domain adaptation, I recommend that the authors design transfer learning-based experiments to evaluate HGOT's transferability across different data domains or to out-of-distribution samples. (3) Considering the complexity of calculating distribution distances for large-scale networks, the efficiency and scalability of HGOT are not guaranteed, considering that it utilizes multiple meta-path views during training. Although the authors provide a rough efficiency analysis of HGOT in Appendix D, I suggest they further present a complexity comparison and efficiency evaluation of HGOT against other self-supervised models. This would provide a clearer validation of HGOT's efficiency without augmentation methods and negative samples.

[1] Wang, W., Suo, X., Wei, X., Wang, B., Wang, H., Dai, H. N., and Zhang, X. 2021. HGATE: heterogeneous graph attention auto-encoders. TKDE. [2] Tian, Y., Dong, K., Zhang, C., Zhang, C., and Chawla, N. V. 2023. Heterogeneous graph masked autoencoders. In AAAI. [3] Duan, H., Xie, C., and Li, L. 2024. Reserving-masking-reconstruction model for self-supervised heterogeneous graph representation. In KDD. [4] Zhu, H., Sun, K., and Koniusz, P. 2021. Contrastive Laplacian Eigenmaps. In NeurIPS.

补充材料

I have reviewed the supplementary material and noted that the authors discussed the details of data, experimental configurations, ablation studies, and complexity analysis. By the way, I suggest that the authors incorporate Appendix C into the main text and further formalize and extend the relevant analyses. This would significantly enhance the soundness of the proposed method.

与现有文献的关系

Self-supervised learning on heterogeneous graphs is one of the key approaches for exploring heterogeneous data structure. This field can provide meaningful insights for building graph foundation models across different graph types in the future.

遗漏的重要参考文献

See Claims And Evidence & Experimental Designs Or Analyses.

其他优缺点

See other parts of the review.

其他意见或建议

(1) The references format is inconsistent, making the citations difficult to retrieve. (2) The mixed use of certain symbols, such as σ\sigma and E\mathcal{E}, can lead to confusion.

作者回复

R(1): Different from other self-supervised learning methods, optimal transport (OT) can capture the matching information from the original graph space to the representation space, obtaining node representations that exhibit consistency with the optimal transport plans. Second, reconstruction-based methods usually involve mask-reconstruction mechanisms, which require a certain degree of damage to the input information. While our method directly extracts useful supervisory signals for matching information based on the characteristics of the heterogeneous graph itself. R(2): The Wasserstein distance is a metric used to quantify the difference between nodes in the graph, measuring the minimal cost required to transform one distribution into another. The Gromov-Wasserstein distance is a metric used to quantify the difference between edges in the graph. The fused Gromov-Wasserstein distance is the fusion of above two distances through a trade-off parameter to achieve optimal transport over the entire graph. The reasons for choosing CX and CA as the cost matrix are as follows: The cosine similarity (CX) is popularly used in most work. The absolute adjacency matrix difference (CA) is convenient to reduce the computational complexity for the calculation of four-dimensional tensors. R(3): The aggregated view, as a center view containing all semantic information, enables each branch view to capture the transport relationship between semantic information through optimal transport. By transporting each branch view to the center view, the transport relationship from each semantic to the comprehensive semantic is obtained, which can better promote heterogeneous representation learning. In summary, although the contribution of this part to the overall model is not very large as shown in the ablation study results, the performance of the algorithm is still improved, and the aggregated view plays an important role in the optimal transport process.

R(1): By using node feature transformation, the feature dimensions of all nodes are unified, which facilitates subsequent optimal transport calculations. Different from HeCo and HAN, which only use the attention mechanism to aggregate node-level representations, we aggregate nodes and edges separately to obtain a complete aggregated graph structure. Then, the relationship between the aggregated view and each meta-path view is obtained through optimal transport to promote heterogeneous representation learning. R(2): Heterogeneous graph contains different semantic information. Meta-paths are used to capture these semantic information in our work. By using OT, we can capture the optimal transport plan from each meta-path view to the aggregated view, and calibrate the encoder by aligning each plan to obtain a better node representations. Therefore, OT promotes heterogeneous representation learning and captures rich semantic information in heterogeneous graphs. It will be clarified in the final version.

Theoretical Claims: We will discuss the connection between OT and other self-supervised learning objectives such as the InfoMax principle in the final version.

Experimental Designs Or Analyses: We will make revisions based on your comments. It is worth mentioning that, different from COLES, HGOT uses OT theory to capture rich information in heterogeneous graph to obtain better node embedding. Our method does not require consideration of complex data enhancement and the selection of positive and negative samples.While COLES is based on the Laplace eigenmap method, it still requires the selection of positive and negative samples.

Supplementary Material~Other Comments Or Suggestions: We will make revisions based on your comments.

Questions For Authors: The responses to the four questions you raised are as follows: (1)We will explain the three OT formulas in more details in the final version and will give the relevant definitions in the introduction, and add several references into this section. Formulas (8)-(13) are formally the original definitions of these distances. We modified them according to some related works (only some symbolic definitions in the formulas have been modified, so the complete physical meaning remains unchanged). The symbol ⊗ is defined for the Hadamard product (element-wise multiplication of matrices), which will be added in the final version. (2)This is a mistake in our writing. We meant to say "we first project the node features in the heterogeneous graph into the same feature space". The node feature X is defined as the original feature of the node in the heterogeneous graph. We will make the corresponding changes in the final version. (3)In our experiment, we divide each dataset into 8 (training set): 1 (validation set): 1 (test set) and report the results. Same as HeCo, we also conducted experiments with other ratios, but they were not presented in the paper. We will add them into the final version. (4)We will open source our code in the future.

审稿人评论

I thank the authors response. I have read the authors’ response to all reviewers, and some of my concerns have been addressed. Overall, I am willing to increase my rating to Weak Accept if the authors address the following.

1 The authors fail to provide any requested empirical results in their response,

2 Some claims remain unpersuasive to me. For example, I know that OT is used to align meta-path views with the aggregated view in heterogeneous graphs. But the question is one can replace the OT loss with any other self-supervised losses (e.g., InfoNCE, VICReg or KL divergence) without losing any heterogeneous information, since such information is only related to the construction of meta-path views in the proposed method. I wonder "why is OT necessary for heterogeneous graphs".

作者评论

Q: The question is one can replace the OT loss with any other self-supervised losses (e.g., InfoNCE, VICReg or KL divergence) without losing any heterogeneous information, since such information is only related to the construction of meta-path views in the proposed method. I wonder "why is OT necessary for heterogeneous graphs".

R: Many thanks. Optimal transport (OT) is a optimization theory that studies the most efficient way to redistribute mass (or resources) from one probability distribution to another, while minimizing a specified cost function. In our study, OT is adopted to calculate the transport plan between the meta-path view and aggregated view. It has the following advantages: (1) It is no longer necessary to perform data augmentation and provide positive and negative samples for heterogeneous graph self-supervised learning; (2) It utilizes the optimal transport plan between the local semantics (branch view) and the global semantics (central view) of heterogeneous graphs to align the matching relationship between the graph space and the representation space, thereby obtaining higher-quality node representations. However, InfoNCE, VICReg and KL divergence are all used to capture the similarity between two distributions in the data. InfoNCE relies on data augmentation and selection of positive and negative sample pairs. VICReg cannot be used to extract the comparative information from the meta-path view and the aggregated view in the heterogeneous graph. Although KL divergence measures the differences between two distributions, it ignores the connection between local semantics (branch view) and the global semantics (central view). Therefore, OT is a better self-supervised learning method for heterogeneous graphs.

Experimental supplement:

1.The performance comparison on node classification and node clustering tasks :

Classification:

Dataset-(Metric)DBLP-(Micro-F1)DBLP-(Macro-F1)ACM- (Micro-F1)ACM- (Macro-F1)IMDB-(Micro-F1)IMDB-(Macro-F1)Yelp- (Micro-F1)Yelp- (Macro-F1)
HGCML[1]91.11 ± 0.590.47 ± 0.387.90 ± 0.388.46 ± 0.457.19 ± 0.751.03 ± 0.473.35 ± 0.852.90 ± 0.6
HGMAE[2]91.89 ± 0.391.60 ± 0.589.15 ± 0.489.29 ± 0.460.13 ± 0.460.09 ± 0.672.57 ± 0.656.33 ± 0.3
HGOT95.66 ± 0.495.14 ± 0.394.49 ± 1.094.60 ± 0.862.71 ± 1.362.34 ± 0.677.58 ± 0.865.12 ± 0.4

Clustering:

Dataset-(Metric)DBLP-(ACC)DBLP-(NMI)DBLP-(ARI)ACM-(ACC)ACM-(NMI)ACM-(ARI)IMDB-(ACC)IMDB-(NMI)IMDB-(ARI)Yelp-(ACC)Yelp-(NMI)Yelp-(ARI)
HGCML90.3673.2879.6189.2165.1371.0550.669.349.1565.7237.7142.49
HGMAE91.4776.9282.3489.8366.6871.5153.9011.3612.0362.1039.0442.88
HGOT93.4178.0584.0089.7766.1271.9460.8016.2818.0466.2339.1543.07

We add two self-supervised learning methods, including a contrastive method HGCMLand a generative method HGMAE. .The node classification results demonstrates that HGOT achieves the best performance over all baselines. In addition, HGOT achieves the better performance on most datasets in clustering task.

[1]Wang Z, Li Q, Yu D, et al. Heterogeneous graph contrastive multi-view learning[C]//Proceedings of the 2023 SIAM international conference on data mining (SDM). Society for Industrial and Applied Mathematics, 2023: 136-144.

[2]Tian Y, Dong K, Zhang C, et al. Heterogeneous graph masked autoencoders[C]//Proceedings of the AAAI conference on artificial intelligence. 2023, 37(8): 9997-10005.

2.Comparison of the average time consumption per training epoch of different contrastive self-supervised methods on DBLP dataset:

MethodTraining time (per epoch) / s
HeCo0.8193
MEOW[3]1.9802
HGCL[4]1.6933
HGCML2.5048
HGMAE2.1471
HGOT0.3495

From the results, we can see that the time complexity of our model HGOT surpasses than the contrastive learning model that requires positive and negative sample pairs and the generative learning model with mask mechanism. For a more detailed and comparative analysis of the complexity of self-supervised learning methods, please see the sections C and D of the Appendix.

[3]Yu J, Ge Q, Li X, et al. Heterogeneous graph contrastive learning with meta-path contexts and adaptively weighted negative samples[J]. IEEE Transactions on Knowledge and Data Engineering, 2024.

[4]Chen M, Huang C, Xia L, et al. Heterogeneous graph contrastive learning for recommendation[C]//Proceedings of the sixteenth ACM international conference on web search and data mining. 2023: 544-552.

审稿意见
3

This paper presents HGOT, a self-supervised heterogeneous graph neural network that harnesses optimal transport theory to establish an optimal transport plan between the meta-path and aggregated views. By compelling the model to learn node representations that faithfully preserve the intrinsic matching relationships between these two views, HGOT significantly enhances the quality of representation learning. Comprehensive experiments conducted on multiple benchmark datasets validate the superiority of the proposed approach, consistently achieving state-of-the-art performance.

给作者的问题

In the ablation study (Section 5.5), the parameter sensitivity analysis indicates that outperforms the combined node-edge alignment. Does this suggest that incorporating edge information may introduce noise or redundancy in certain scenarios? Furthermore, would this effect vary depending on the dataset characteristics?

More intuitive explanations are expected to intuitively understand the effectiveness of equations 13-16. Could the authors provide further insights or illustrative examples to clarify the reasoning behind their derivations?

论据与证据

The paper claims the integration of optimal transport with heterogeneous graph neural networks (HGNNs) is novel. Specifically, the alignment between graph-space and representation-space transport plans provides a fresh perspective for self-supervised learning, circumventing the limitations of traditional contrastive learning. Experimental results validate the effectiveness of the proposed method, showing state-of-the-art performance on benchmark datasets.

方法与评估标准

This paper formulates the alignment problem using Gromov-Wasserstein optimal transport and employs a self-supervised learning framework to refine node representations. The proposed method is technically sound.

The evaluation metrics used to benchmark HGOT’s performance include accuracy and clustering-based evaluation on multiple real-world heterogeneous graph datasets. The evaluation criteria are widely used and appropriate.

理论论述

This work provides mathematical definitions of heterogeneous graphs and optimal transport. The proposed method itself has no theoretical claims.

实验设计与分析

The proposed method is evaluated on four real-world datasets under different tasks, including node classification, node clustering, and visualizations. The experimental designs and performance comparisons are sound.

补充材料

The authors provide experimental details, additional parameter and complexity analyses, and discussions with graph contrastive learning in the supplementary material.

与现有文献的关系

This work may be helpful for researchers interested in molecule structure learning.

遗漏的重要参考文献

NA

其他优缺点

One of the key strengths of HGOT is its novel integration of OT with HGNNs, providing an alternative to contrastive learning. However, the complexity of Gromov-Wasserstein optimization raises scalability concerns for large graphs. The paper lacks a discussion on runtime and memory overhead, which are crucial for practical deployment. Furthermore, interpretability could be improved with case studies or attention weight analysis.

其他意见或建议

In the caption of figure 3, explanations of different ablation settings could be provided for better readability.

作者回复

Other Strengths And Weaknesses:

Q: However, the complexity of Gromov-Wasserstein optimization raises scalability concerns for large graphs. The paper lacks a discussion on runtime and memory overhead, which are crucial for practical deployment. Furthermore, interpretability could be improved with case studies or attention weight analysis. R: Thank for your comments. We discuss the complexity issue in Appendix D, and we will move this section to the main text in the final revision of the paper. On the other hand, we will provide a discussion of running time and memory overhead, and perhaps we will provide the result in the form of charts in the experiments.

Other Comments Or Suggestions:

Q: In the caption of figure 3, explanations of different ablation settings could be provided for better readability. R: Many thanks. We will provide the detailed settings of the ablation experiments in the caption of figure 3 in the final version.

Questions For Authors:

Q1: In the ablation study (Section 5.5), the parameter sensitivity analysis indicates that outperforms the combined node-edge alignment. Does this suggest that incorporating edge information may introduce noise or redundancy in certain scenarios? Furthermore, would this effect vary depending on the dataset characteristics? R1: Thank for your comments. It can be inferred from the experimental results: In some cases (edge information is redundant and complicated), we can appropriately abandon the connection of edges and only consider node features. Too much edge information may bring noise and redundancy, the experimental results have verified this to a certain extent.. In addition, we only shows results on two datasets, but the same results are obtained in the other two datasets. It will be clarified in the final version.

Q2: More intuitive explanations are expected to intuitively understand the effectiveness of equations 13-16. Could the authors provide further insights or illustrative examples to clarify the reasoning behind their derivations? R2: Thank you. Formula 13 is the fused Gromov-Wasserstein distance (considering both nodes and edges) commonly used for OT. Formula 14 is a shorthand for the right side of the equal sign in Formula 13. Formulas 15 and 16 are used to calculate the optimal transport plan Π from the corresponding OT distances. We will provide a more detailed explanation of the formula in the final version.

审稿意见
3

This paper proposes a novel self-supervised heterogeneous graph neural network (HGOT), which aims at addressing the limitations of existing contrastive learning methods on heterogeneous graphs by combining Optimal Transport method. HGOT avoids data augmentation and the construction of positive and negative sample pairs, and proposes a new matching mechanism that performs optimal transmission matching between local (branch view) and global (center view) semantics to improve the quality of node representation. Extensive experiments on four public datasets have validated the effectiveness of HGOT.

给作者的问题

Please refer to the above comments.

论据与证据

Yes.

S1: Applying optimal transport to self-supervised learning on heterogeneous graphs is a unique contribution. This method helps bypass the challenges associated with graph augmentation and the selection of positive and negative samples in traditional contrastive learning.

S2: HGOT optimizes the self-supervised learning strategy of heterogeneous graph neural networks from the perspective of information matching, avoiding the dependence of contrastive learning on data augmentation and positive negative sample pairs.

方法与评估标准

Yes.

S3: The paper has complete experiments and visually demonstrates the effectiveness of HGOT.

S4: This work integrates OT with heterogeneous graph learning, offering a principled alternative to contrastive SSL;

理论论述

yes.

实验设计与分析

Yes.

S5: Achieves SOTA results on node classification (6%+ accuracy improvement) and clustering across four datasets, demonstrating robustness.

W2: The paper does not provide sufficient explanation or analysis for the selection of certain parameters (such as the σ and ρ parameters in optimal transport), particularly lacking a comprehensive discussion on the impact of different parameter values on model performance. As is well known, the training cost of deep learning is very high, and it is not feasible to manually adjust these hyperparameters through repeated experimentation. The paper needs to explain the specific values of these parameters (such as σ and ρ) and provide guidance for their reasonable selection in practical applications.

W3: Due to the significant improvement in some metrics in this paper, it is recommended to provide significant analysis results such as t-test in Table 3.

W5: Computational complexity: Solving fused Gromov-Wasserstein distances involves cubic time complexity in node count, limiting scalability to large graphs.

W6: Meta-Path Dependency: Performance may hinge on pre-defined meta-paths, yet the paper does not explore automated meta-path selection or robustness to suboptimal choices.

补充材料

None.

与现有文献的关系

S6: Eliminates cumbersome graph augmentation and sample selection, simplifying self-supervise training pipelines.

遗漏的重要参考文献

N.A.

其他优缺点

Please refer to the above comments.

其他意见或建议

W1: The abstract is overly verbose. For example, the implementation details of the method can be omitted. The abstract should clearly and concisely convey what the paper has done and what problem it addresses, rather than focusing too much on the specific details of the method. The author spends almost half of the abstract discussing the method's implementation, which makes the abstract appear redundant and detracts from its clarity and focus.

W4: The symbol definitions in the formulas are quite complex. The authors can add specific definitions for each symbol in the appendix and check if there are any inconsistent definitions (such as σ in Eq.13 and Eq.2)

W7: About parameter sensitivity: the experimental results shows that the parameter σ (node and edge weights) has a significant impact on the results, and the best effect is achieved when it only relies on node attributes, which may mean that it is lacking in processing edge information

W8: Experiment incomplete: The proposed method is an algorithm for heterogeneous graphs, but only experiments on point classification and point clustering are conducted, without explanation for other graph problems such as link prediction.

作者回复

Experimental Designs Or Analyses:

Q1: The paper does not provide sufficient explanation or analysis for the selection of certain parameters (such as the σ and ρ parameters in optimal transport), particularly lacking a comprehensive discussion on the impact of different parameter values on model performance. As is well known, the training cost of deep learning is very high, and it is not feasible to manually adjust these hyperparameters through repeated experimentation. The paper needs to explain the specific values of these parameters (such as σ and ρ) and provide guidance for their reasonable selection in practical applications. R1:Thank you. In the parameter experiment in Section 5.6, we can see that the model performs better when the parameterρis larger, indicating the importance of implicit structural loss. Similarly, when the parameterσis larger, the model performs better, indicating that the transport information between nodes is more important in the fused Gromov-Wasserstein distance formula. We will explain the selection of these hyperparameters and discuss the impact of different parameter values on model performance in the final version.

Q2: Due to the significant improvement in some metrics in this paper, it is recommended to provide significant analysis results such as t-test in Table 3. R2: Thank for your comments. We will add variance analysis to the clustering results experiment in Table 3 to demonstrate the stability of our method in the clustering experiment.

Q3: Computational complexity: Solving fused Gromov-Wasserstein distances involves cubic time complexity in node count, limiting scalability to large graphs. R3: Thanks. Although the fused Gromov-Wasserstein distance requires more calculations, it does not take up too much memory or too much running time. In Appendix D, we give the complexity analysis. We employ the OT method instead of traditional contrastive learning to significantly reduce complexity. Consequently, compared to other self-supervised learning methods, our approach maintains a slight edge in terms of complexity.

Q4: Meta-Path Dependency: Performance may hinge on pre-defined meta-paths, yet the paper does not explore automated meta-path selection or robustness to suboptimal choices. R4: Thank you. Inspired by your suggestion, we will discuss how to automatically select meta-paths or directly decompose the heterogeneous graph into several subgraphs to drop the dependence on meta-paths.

Other Comments Or Suggestions:

Q1: The abstract is overly verbose. For example, the implementation details of the method can be omitted. The abstract should clearly and concisely convey what the paper has done and what problem it addresses, rather than focusing too much on the specific details of the method. The author spends almost half of the abstract discussing the method's implementation, which makes the abstract appear redundant and detracts from its clarity and focus. R1: Many thanks. We will revise the abstract based on your valuable suggestions, and delete unnecessary parts and made the abstract more concise.

Q2: The symbol definitions in the formulas are quite complex. The authors can add specific definitions for each symbol in the appendix and check if there are any inconsistent definitions (such as σ in Eq.13 and Eq.2). R2: Many thanks. According to theσsymbol in formula 2 and 13 you mentioned,σin formula 2 represents the activation function, andσin formula 13 is an adjustable hyperparameter. To avoid confusion, we will replace the activation function's σ with a capital Greek letter. In addition, we will give the detailed meaning of each symbol in the appendix in the final version.

Q3: About parameter sensitivity: the experimental results shows that the parameter σ (node and edge weights) has a significant impact on the results, and the best effect is achieved when it only relies on node attributes, which may mean that it is lacking in processing edge information. R3: Thank you. Our experiments show that in some cases (edge information is redundant and complicated), we can appropriately abandon the connection of edges and only consider node features, which can reduce the complexity of the model. It will be clarified in the final version.

Q4: Experiment incomplete: The proposed method is an algorithm for heterogeneous graphs, but only experiments on point classification and point clustering are conducted, without explanation for other graph problems such as link prediction. R4: Thanks. Inspired by your advice, we will study deeper into this issue and discuss in the final version.

审稿意见
4

This paper proposes a novel self-supervised learning framework for heterogeneous graphs that leverages optimal transport theory to align meta-path views with an aggregated central view, eliminating the need for graph augmentation or explicit positive/negative sampling. The method achieves state-of-the-art performance on node classification, clustering, and visualization tasks across four real-world datasets.

给作者的问题

Please refer to above sections.

论据与证据

The claims made in the submission are well-supported by clear and convincing evidence. The authors propose a novel self-supervised heterogeneous graph neural network with optimal transport (HGOT) and provide detailed algorithm design.

方法与评估标准

The proposed methods and evaluation criteria are appropriate and well-justified for the problem at hand.

理论论述

The paper primarily focuses on methodological innovation and empirical validation rather than formal theoretical proofs. The authors provide detailed formulations of the optimal transport-based alignment mechanism and its integration into heterogeneous graph learning.

A concern is why the fused Gromov-Wasserstein distance is the optimal choice for heterogeneous graphs? A theoretical or empirical comparison with alternative OT metrics (e.g., Wasserstein barycenters) is missing.

实验设计与分析

I have checked the soundness of the experimental designs and analyses, and they appear appropriate and valid. The authors evaluate HGOT on four widely used heterogeneous graph datasets (DBLP, ACM, IMDB, Yelp), covering diverse application domains. They compare against a comprehensive set of baselines, including both supervised and self-supervised methods, which ensures fair and relevant comparisons.

My concern about this part is that the parameter sensitivity analysis (Section 5.6) lacks depth. For example, the claim that "abandoning edge information improves performance” (p. 24) conflicts with the fundamental role of edges in graph learning. This requires further exploration (e.g., edge sparsity analysis).

补充材料

N/A

与现有文献的关系

The paper builds on and extends prior work in self-supervised heterogeneous graph learning and optimal transport. It addresses key limitations of existing contrastive self-supervised methods (e.g., HeCo, HDGI) that rely on graph augmentations and positive/negative sample selection, by introducing an optimal transport-based alignment mechanism.

遗漏的重要参考文献

Most of key references are cited.

其他优缺点

The method section (Section 4) lacks intuitive explanations. Visualizations of the OT alignment process would aid understanding.

其他意见或建议

No

作者回复

Theoretical Claims:

Q: A concern is why the fused Gromov-Wasserstein distance is the optimal choice for heterogeneous graphs? A theoretical or empirical comparison with alternative OT metrics (e.g., Wasserstein barycenters) is missing. R: Thank for your comments. The fused Gromov-Wasserstein distance considers both nodes and edges, the information in the heterogeneous graph is more comprehensive. Therefore, compared with other OT distances, the fused Gromov-Wasserstein distance is the best choice for optimal transmission on heterogeneous graphs. We will present experimental analysis with other OT distance metrics in the final version.

Experimental Designs Or Analyses:

Q: My concern about this part is that the parameter sensitivity analysis (Section 5.6) lacks depth. For example, the claim that "abandoning edge information improves performance” (p. 24) conflicts with the fundamental role of edges in graph learning. This requires further exploration (e.g., edge sparsity analysis). R: Thank you. The results of this experiment can be inferred that: Although the edge information on the graph is important, in some cases (redundant and complex edges), we may abandon the connection relationship of some edges and only consider the node feature information, which may lead to better results. We will discuss this issue more in the final version.

Other Strengths And Weaknesses:

Q: The method section (Section 4) lacks intuitive explanations. Visualizations of the OT alignment process would aid understanding. R: Many thanks. We will try to provide the visualization of the optimal transport to help understand the transporting and matching process in the final version.

最终决定

This paper proposes a Heterogeneous graph neural network with Optimal Transport (HGOT), a self-supervised learning model without graph augmentation. It employs optimal transport theory to align meta-path views with an aggregated central view. It avoids the construction of positive and negative sample pairs. Experimental evaluations show its high performance, but can be strengthened.