5.8

/10

Poster4 位审稿人

最低5最高7标准差0.8

3.3

置信度

正确性2.8

贡献度3.0

表达2.3

NeurIPS 2024

Revisiting Self-Supervised Heterogeneous Graph Learning from Spectral Clustering Perspective

Yujie Mo,Zhihe Lu,Runpeng Yu,Xiaofeng Zhu,Xinchao Wang

OpenReview PDF

提交: 2024-05-04更新: 2025-01-11

TL;DR

We theoretically revisited existing self-supervised heterogeneous graph learning from the spectral clustering perspective and introduced a novel framework to alleviate issues in existing works.

摘要

关键词

Heterogeneous graphSpectral clusteringSelf-supervised learning

评审与讨论

审稿意见

评分: 5置信度: 22024-07-01

The paper proposes a new framework that revisits SHGL from a spectral clustering perspective, incorporating rank and dual consistency constraints. This approach uses a rank-constrained spectral clustering method to refine the affinity matrix and remove noise, while also integrating node-level and cluster-level consistency constraints to better capture and use invariant and clustering information. Theoretical analysis and experimental results show that this new method significantly improves performance in downstream tasks compared to existing methods.

优点

This paper provides a theoretical investigation of previous SHGL methods from the perspective of spectral clustering.
This paper proposes a new framework to capture the cluster-level graph invariant representations.
Experiments demonstrate the effectiveness of the proposed methods.

缺点

Unfortunately, this paper is not easy to follow due to the missing parts in terms of motivation and logic.

In the introduction part, the authors mention three challenges abruptly from Lines 44 to 48. However, it is unclear why they are challenging to address. For example, why it is difficult to understand previous SHGL methods from the clustering perspective? It is also unclear the logical relationships between the three challenges.
In Section 2, the authors introduce some notations about the heterogeneous graph. However, it is unclear what is the objective of the problem.
In Section 2.1, it is unclear why authors need to bridge the gap between SHGL methods and graph-cut algorithm, and what is its benefits.
In Section 2.2, it is unclear why the rank constraint can mitigate noisy connections.
When reading Section 2.2, the reviewer gets lost in the lemma and equation details. It is better if the authors can highlight the contributions and leave all technical derivations to the appendix.
The logical relationship between Sections 2.2 and 2.3 is unclear. Why both the low-rank constraints and dual consistency constraints should be imposed, and what is the relationship between these two constraints?
In the experiments, it is unclear why two homogenous graph datasets were used for performance evaluation, given the objective of this paper is for heterogeneous graph learning.

问题

Please find questions in the weaknesses above.

局限性

This paper does not mention the limitations of this work or the potential negative societal impacts.

作者回复

2024-08-07

Thanks for the positive comments on the novelty, theoretical analysis, and experimental results of our method. We are so encouraged and will try our best to address the concerns one by one.

Q1. Unclear why three challenges are challenging to address and the logical relationships between them.

A1. Since previous SHGL methods are close to clustering techniques, as stated in lines 30-33, and suffer from two issues (i.e., noisy connections and neglect of cluster-level information), as stated in lines 34-41, we have challenges as follows.

(i) To formally understand previous methods from clustering perspective. This is challenging because no existing literature formally connects self-supervised graph learning with spectral clustering. This requires us to derive the whole process from scratch.

(ii) To learn an adaptive graph structure to avoid inter-class noise. The challenge is hard to solve because although some graph structure learning works [1-2] try to optimize graph structures, they may have noisy connections due to the lack of theoretical guarantees.

(iii) To effectively incorporate cluster-level information to boost downstream tasks. This is difficult as existing works often divide nodes within the same class into different clusters, resulting in sub-optimal cluster-level information that cannot be used in an effective way.

Relationships: Challenge (i) is the theoretical basis of the whole paper and points out the direction for solving the following two challenges. The latter two challenges build on each other under the theory framework of challenge (i). Specifically, challenge (ii) mitigates noise and outputs high-quality cluster-level information for challenge (iii), while challenge (iii) aligns intra-class representations, thus aiding challenge (ii).

[1] Towards Unsupervised Deep Graph Structure Learning. WWW 2022.

[2] Heterogeneous Graph Structure Learning for Graph Neural Networks. AAAI 2021.

Q2. In Section 2, unclear the objective of the problem.

A2. The goal is to learn a model receiving diverse types of nodes and edges in the heterogeneous graph as input, and output low-dimensional representations for downstream tasks. We will add it in the final version.

Q3. In Section 2.1, unclear why to bridge the gap between SHGL methods and graph-cut algorithm and what is its benefits.

A3. Theorem 2.3 bridges the gap between them to further point out the issues in previous SHGL methods. That is, previous SHGL methods cut the learned representations into $d$ clusters, which is much larger than the number of classes $c$ . Therefore, nodes within the same class may be divided into different clusters, thus the learned representations cannot be clustered well.

Benefits: Theorem 2.3 gives the theoretical motivation of Section 2.2 and provides a direction for improving previous SHGL methods, i.e., adding low-rank constraints to divide the learned representation into $c$ clusters instead of $d$ in previous methods.

Q4. Unclear why the rank constraint can mitigate noisy connections.

A4. The ideal noise-free affinity matrix is to only contain connections within the same class and have no noisy connections among different classes. That is, the ideal affinity matrix contains exactly $c$ (i.e., the number of classes) connected components. Based on Lemma 2.4, if the rank of $\mathbf{L}_\mathbf{S}$ equals $n-c$ , then the affinity matrix $\mathbf{S}$ contains exactly $c$ connected components to achieve the ideal case and mitigate noisy connections.

Q5. In Section 2.2, the reviewer gets lost in lemma and equation details. It is better to highlight the contributions and leave all technical derivations to the appendix.

A5. Section 2.2 is designed to mitigate noise with the rank constraint, and we highlight it as follows. First, we propose to learn an adaptive graph structure via Eq. (5). After that, Lemma 2.4 explains how we add the rank constraint to Eq. (5) to avoid noise. Then we rewrite the rank constraint to spectral clustering by Eq. (6)-Eq. (8) as the rank constraint is not easy to tackle. Finally, we solve Eq. (8) by alternating optimization, i.e., fix eigenvectors $\mathbf{F}$ to obtain the closed-form solution of the affinity matrix $\mathbf{S}$ by Eq. (9)-Eq. (10), while fixing the affinity matrix $\mathbf{S}$ to obtain the optimized eigenvectors $\mathbf{F}$ . To avoid the high time complexity of eignedecomposition, we replace it with Eq. (11)-Eq. (13) to efficiently approximate $\mathbf{F}$ with $\mathbf{Y}$ . We will leave most derivations in the appendix in the final version.

Q6. The logical relationship between Sections 2.2 and 2.3 is unclear. Why impose both low-rank and dual consistency constraints, and what is the relationship between them?

A6. Without the low-rank constraint, our framework would be affected by noisy connections. Without the dual consistency constraint, our framework would fail to consider the cluster-level information, thereby weakening downstream tasks. Hence, it is crucial for our method to impose both constraints.

Relationships: The rank constraint in Section 2.2 and the dual consistency constraint in Section 2.3 actually benefit from each other. Specifically, the rank-constrained spectral clustering provides high-quality cluster-level information for the dual consistency constraint. Moreover, the dual consistency constraint encourages the representations within the same class to align with each other to help the rank-constrained spectral clustering.

Q7. Unclear why homogeneous graph datasets were used for evaluation.

A7. We use both heterogeneous and homogeneous datasets to verify the effectiveness of our method on different types of graphs.

Q8. There is no discussion of limitations or potential negative societal impacts.

A8. We discussed limitations and potential negative societal impacts in Appendix F, and we will move them to the main paper in the final version.

评论- Thank you for the rebuttal

2024-08-08

Thank you for the detailed rebuttal, and I increase the score to 5.

审稿意见

评分: 5置信度: 32024-07-13

This work deals with the problem of self-supervised representation learning in heterogeneous graphs. First, a spectral-clustering based objective is presented to unify the objectives of existing methods. Second, a novel self-supervised method is proposed that tries to capture both the cluster information and node-level information in the learnt representation. Experiments show the effectiveness of the proposed method over existing methods.

优点

The proposed method is novel and follows a principled design.
Presented theoretical analysis is helpful in deeper understanding of existing and proposed method.
Experiemental section is comprehensive with adequate baselines, datasets, and ablations.

缺点

Presented theorems are often unclear or imprecise. For example

(a) In Theorem 2.6, authors use the term model complexity abruptly without any definition or reference. I could not understand the relation between the infimum and supremum of model complexity with generalization ability (which also needs to be rigorously defined).

(b) In Theorem 2.2, what exactly are the previous meta-path-based and adaptive-graph-based SHGL method mentioned in the theorem statement? What is the expression for the regularization term for the corresponding previous methods?

(c) There does not seem to be a clear delineation between what already exists in the literature and what is the new contribution of the paper. E.g., does Theorem 2.3 directly follow from existing results in the literature, or does this require custom analysis?

(d) Theorem 2.5 is not precise enough. When referring to "proposed method", it would be more precise to refer to specific objective function or algorithm block. Also "equivalent to performing spectral clustering based on the affinity matrix ....". Authors are recommended to replace the language description with precise objective function.
The writing is often unclear and could be greatly improved. Several terminologies (which might not be standard in the literature) are never explained/defined. For example:

(a) Section 1 Line 42 ".. it is feasible to analyse ..", it is not clear what exactly is feasible to analyse (although after Section 2.1, it can be understood).

(b) In abstract and introduction, it is not clear what is invariant information is

(c) Sentence in line 216-218 is unclear.

(d) How does H^T H = I imply statistical independence?

问题

Please refer to the questions and comments in Weaknesses section.

局限性

Limitations and impacts are adequately discussed in the Appendix.

作者回复

2024-08-07

Thanks for the positive comments on the novelty, theoretical analysis, and experimental results of our method. We are so encouraged and will try our best to address the concerns one by one.

Q1. Presented theorems are often unclear or imprecise. For example,

Q1-a. In Theorem 2.6, no definition or reference of the model complexity. Unclear relation between the infimum and supremum of model complexity with generalization ability.

A1-a. Due to the space limitations, in the submission, we followed previous works [1-2] and gave Definition C.5 of the model complexity in lines 686-691 in Appendix C.4. In the rebuttal, we further followed previous works [1-2] to define the generalization error $E$ of a model based on the model complexity, i.e.,

Definition 1. For any $\delta \in [0, 1]$ , with probability at least $1-\delta$ , the generalization error $E$ of a model follows the inequality, i.e., $E \le \frac{1}{n} \sum_{i=1}^{n} \ell(f(x_i), y_{i})+\sqrt{\frac{C}{n}}+\mathcal{O}(\sqrt{\frac{\log (1 / \delta)}{n}}),$ where $(x_i, y_i)$ is a pair of labeled data, $f$ is the model, $l$ is the loss function, $n$ is the number of labeled data, $C$ is the model complexity measure. Based on Theorem 2.6 and the Definition above, we can obtain that the proposed method obtains a lower model complexity boundary, thus achieving a lower generalization error and higher generalization ability boundary than previous SHGL methods. We will move Definition C.5 and add the above Definition 1 to the main paper in the final version.

[1] Representation Based Complexity Measures for Predicting Generalization in Deep Learning. NeurIPS 2020.

[2] Methods and Analysis of The First Competition in Predicting Generalization of Deep Learning. NeurIPS 2020.

Q1-b. In Theorem 2.2, what exactly are the meta-path-based and adaptive-graph-based SHGL methods, and their regularization terms?

A1-b. In Theorem 2.2, the meta-path-based SHGL methods indicate the works that utilize meta-paths to build edges among nodes. Almost all SHGL comparison methods in the paper belong to this group, such as HGMAE, CPIM, HGCML, HeCo, HDMI, etc. We gave their regularization term in Appendix C.1 in Eq. (28). The adaptive-graph-based SHGL methods indicate the works that utilize the adaptive graph structure to build edges among nodes. Only the latest comparison method (i.e., HERO) belongs to this group. We gave the regularization term in Appendix C.1 in Eq. (35). We will clarify two group's methods and regularization terms in the main paper in the final version.

Q1-c. Clarify new contributions of the paper. E.g., does Theorem 2.3 require custom analysis?

A1-c. Theorem 2.3 connects the traditional graph-cut algorithm with existing SHGL methods, which has never been done in the existing literature and requires custom analysis.

The new contributions of the paper can be summarized as:

Motivation: We first attempt to revisit existing SHGL methods from the spectral clustering perspective in a unified manner.
Theory: This paper builds theoretically connections between previous SHGL methods and the spectral clustering as well as the graph-cut algorithm, which have never been analyzed in existing literature.
Method: This paper proposes for the first time to adaptively learn a rank-constrained affinity matrix and introduce a dual consistency constraint to alleviate the issues in previous methods.

Q1-d. Theorem 2.5 is not precise enough.

A1-d. Thanks for your suggestion, we will reformulate Theorem 2.5 in the final version as follows,

Theorem 2.5. Optimizing the spectral loss $L_{sp}$ leads to performing the spectral clustering based on the affinity matrix $\mathbf{S}$ and conducting RatioCut ( $V_1,\ldots, V_c$ ) algorithm to divide the learned representations into $c$ partitions, i.e.,

\min L_{sp} \Rightarrow \min \operatorname{Tr} (\mathbf{Y}^T\mathbf{L}_S\mathbf{Y}) \Rightarrow \min \operatorname{RatioCut}(V_1,\ldots,V_c),

where $\mathbf{Y}$ is the cluster assignment matrix, $\mathbf{L}_S$ is the Laplacian matrix of the affinity matrix $\mathbf{S}$ , $\operatorname{Tr}(\cdot)$ indicates the matrix trace, and $c$ is the number of classes.

Q2. The writing is often unclear and could be greatly improved.

(a). Section 1 Line 42, it is not clear what exactly is feasible to analyse (although after Section 2.1, it can be understood).

(b). In the abstract and introduction, it is not clear what the invariant information is.

(c). Sentence in line 216-218 is unclear.

(d). How does $\mathbf{H}^T\mathbf{H}= \mathbf{I}$ imply statistical independence?

A2. (a) We will modify it as "it is feasible to analyze previous SHGL methods from a clustering perspective thanks to their close connection to clustering techniques ...". Actually, we have mentioned this connection in lines 24-33.

(b) Although node representations collect information from nodes within the same type, and heterogeneous representations collect information from nodes of different types, they both contain the same feature of the node itself, which we call the invariant information.

(c) We will modify it as "In addition, according to Theorem 2.2, previous SHGL methods actually perform spectral clustering to learn node representations. However, previous SHGL methods fail to utilize the cluster-level information outputted by the spectral clustering, thus weakening the downstream task performance."

(d) The independence refers to the independence among different representation dimensions. Specifically, $\mathbf{H}^T\mathbf{H} \in \mathbb{R}^{d_1 \times d_1}$ calculates the correlation among different dimensions of $\mathbf{H}$ , where $d_1$ is the dimension of $\mathbf{H}$ . After that, the constraint $\mathbf{H}^T\mathbf{H}= \mathbf{I}$ enforces the off-diagonal elements of $\mathbf{H}^T\mathbf{H}$ to 0, thus minimizing the correlation among different dimensions of $\mathbf{H}$ to achieve the statistical independence.

2024-08-11

Thank you for the response.

I encourage the authors to revise the manuscript with discussed changes to improve the presentation and theoretical rigor.

Regarding $H^T H = I$ , note that uncorrelation does not imply statistical independence. Uncorrelation only prevents linear dependence but cannot prevent some nonlinear dependence. Consider using "uncorrelated" instead of "statistically independent".

Based on the overall quality of the paper, I maintain my score.

2024-08-11

Thank you for your valuable feedback and positive comments on our paper. According to your suggestions, we will replace "statistically independent" with "uncorrelated" to avoid confusion. Additionally, all discussed changes will be incorporated into the final version.

审稿意见

评分: 6置信度: 32024-07-13

This paper proposes a theory-backed method for Self-Supervised Heterogeneous Graph Learning (SHGL) based on spectral clustering and incorporates rank constraint and node/cluster consistency regularizers to generate better embeddings. In specific, the authors start by showing that existing algorithms divide the representations into a certain number of clusters that are much larger than the number of real classes. Then, an objective function with rank constraint is proposed to reduce the noise of message-passing and consistency terms are employed to improve downstream task performance. Experiments on the real datasets show that the proposed method outperforms other baselines.

优点

The paper analyzes the issue of the exiting methods that are based on meta-path-based graph or the adaptive graph structure theoretically. Furthermore, a rank and consistency constrained approach is proposed to tackle know issues of SHGL and it is shown to have less complexity theoretically.
Experimental results show the superiority of the proposed method consistently over a variety of datasets.

缺点

Notations are not always clearly defined. For example, what are the dimensions of the mapping $g_{\phi}$ and $p_{\phi}$ ?
The derivation of the first loss term is not clear to me. Why is there an entropy constraint term in (13)? How do we translate Eq. (8) to Eq. (13) and what is the correspondence of the terms? Line 183 states that ‘fitting eigenvectors F by (13)’, but it seems that F does not appear (13). Or is it that (13) is only for solving the last term in (8)? Clarity can be further improved upon the objective function, especially the term $\cal{L}_{sp}$ .
Why is $Y$ a probability matrix? Is there any guarantee on that? Only orthogonality is proved for the matrix. Do we need a simplex constraint on the columns of $Y$ ?

问题

In Line 172, whose independence are we referring to? Why $H^T H=I$ is equivalent to statistical independence?
In practice, is it always the case that number of clusters equal to $c$ will give the optimal results?
How do we reach the inequality in (53) and (54)? Does the term in (55) correspond to the loss function in (17)?

局限性

The authors are encouraged to discuss the limitations in the main paper.

作者回复

2024-08-07

Thanks for the positive comments on our theoretical and experimental results. We are so encouraged and will try our best to address the concerns one by one.

Q1. Some undefined notations, e.g., the dimensions of $g_\phi$ and $p_\varphi$ .

A1. We employ $g_{\phi}\in\mathbb{R}^{f\times d_1}$ and $f_\theta\in\mathbb{R}^{f\times d_1}$ to derive semantic and heterogeneous representations, where $f$ and $d_1$ are the dimensions of node features and representations, respectively. We then employ projection heads $p_{\varphi}\in\mathbb{R}^{d_1\times c}$ and $q_\gamma\in\mathbb{R}^{d_1\times d_2}$ to map representations, where $c$ and $d_2$ are mapped dimensions. We summarized the settings of the above notations in Table 3 in the uploaded PDF file.

Q2. Unclear derivation of the first loss. Why an entropy constraint in (13)? How do we translate (8) to (13) and what is the correspondence terms? Or is (13) only for solving the last term in (8)? Clarity can be further improved upon $\mathcal{L}_{sp}$ .

A2. The second term (i.e., the entropy constraint) in Eq. (13) is used to maximize the entropy of $\mathbf{Y}$ (the entropy is maximum when nodes are evenly assigned in different clusters), thereby preventing most nodes from being assigned to the same cluster.

The first term in Eq. (13) is only for solving the last term in Eq. (8). Specifically, as stated in lines 153-157, Eq. (8) is solved by the alternating optimization, i.e., fix $\mathbf{S}$ then optimize $\mathbf{F}$ , and fix $\mathbf{F}$ then optimize $\mathbf{S}$ . When fixing $\mathbf{F}$ , Eq. (8) only remains the first two terms to optimize $\mathbf{S}$ (the last term in Eq. (8) is fixed), thus we can obtain the closed-form solution of $\mathbf{S}$ with Eq. (10). When fixing $\mathbf{S}$ , Eq. (8) only remains the last term to optimize $\mathbf{F}$ (the first two terms in Eq. (8) are fixed), which requires cubic time complexity due to the eigendecomposition. Hence, we replace it with neural networks (i.e., $p_\varphi$ and orthogonal layer) and the spectral loss to reduce the time complexity.

Therefore, $\mathcal{L}_{sp}$ in Eq. (13) consists of two parts, i.e., the first term optimizes $\mathbf{Y}$ to approximate $\mathbf{F}$ to solve the last term of Eq. (8), and the second term assigns nodes evenly into different clusters.

Q3. Why is $\mathbf{Y}$ a probability matrix? Any guarantee on that? Do we need a simplex constraint on $\mathbf{Y}$ ?

A3. In the submission, we called $\mathbf{Y}$ as the probability matrix, because $\mathbf{Y}$ indicates the prediction to assign nodes to different clusters, and the larger the value $\mathbf{y}_{ij}$ , the greater the probability of node $v_i$ to be assigned to the $j$ -th cluster. There is no other constraint (e.g., the simplex constraint) on $\mathbf{Y}$ except the orthogonality and spectral loss. Indeed, $\mathbf{Y}$ is not a typical probability matrix, and we will rename it as the cluster assignment matrix in the final version to avoid confusion.

In the rebuttal, we further conducted experiments to add the simplex constraint on $\mathbf{Y}$ , and reported the results in Table 4 in the uploaded PDF file. In Table 4, the method with the simplex constraint obtains inferior results than the method without. This is reasonable because adding a simplex constraint to $\mathbf{Y}$ may hinder its approximation to $\mathbf{F}$ , affecting the quality of the affinity matrix.

Q4. In Line 172, whose independence is referring to? Why $\mathbf{H}^T\mathbf{H}=\mathbf{I}$ is equals to statistical independence?

A4. The independence refers to the independence among different representation dimensions. Specifically, $\mathbf{H}^T\mathbf{H}\in \mathbb{R}^{d_1\times d_1}$ calculates the correlation among different dimensions of $\mathbf{H}$ , where $d_1$ is the dimension of $\mathbf{H}$ . After that, the constraint $\mathbf{H}^T\mathbf{H}= \mathbf{I}$ enforces the off-diagonal elements of $\mathbf{H}^T\mathbf{H}$ to 0, thus minimizing the correlation among different dimensions of $\mathbf{H}$ to achieve the statistical independence.

Q5. Is $c$ clusters always yield optimal results?

A5. To verify it, we changed the number of clusters and reported the results in Figure 1 in the uploaded PDF file. Obviously, our method obtains the best results when the number of clusters equals $c$ and decreases as the number of clusters increases. This is reasonable because if the number of clusters is larger than $c$ , the nodes within the same class may be divided into different clusters. Moreover, if the number of clusters is smaller than $c$ , the nodes from different classes may be divided into the same cluster.

Q6. How do we reach the inequality in (53) and (54)? Does the term in (55) correspond to the loss function in (17)?

A6. We can reach (53) and (54) by proving the following inequality: $a^2b+(1-a)^2c\ge\frac{bc}{b+c},$ where $0\le a\le1$ , and $0\le b,c$ . To prove it, we construct $f(a) = a^2b+(1-a)^2c-\frac{bc}{b+c}$ . We then take the derivative of $f(a)$ , i.e., $f'(a)=2ab-2c+2ac.$ Then we let $f'(a)=0$ , and have $a=\frac{c}{b+c}$ . We have $f(\frac{c}{b+c})=\frac{c^2b}{(b+c)^2}+\frac{b^2c}{(b+c)^2}-\frac{bc}{b+c}=0.$ In addition, we take the second-order derivative of $f(a)$ and obtain $f''(a)=2(b+c)\ge0.$

Therefore, $f(a)$ is decreasing when $a<\frac{c}{b+c}$ and increasing when $a>\frac{c}{b+c}$ , and reaches its minimum 0 at $\frac{c}{b+c}$ . Hence, $f(a)\ge0$ always holds for $0\le a\le1$ , thus proving the above inequality. We can reach the inequality in (53) by replacing $a$ , $b$ , and $c$ with $P_0$ , $\sigma_0^2$ , and $\sigma_j^2$ , respectively. Similarly, we can also reach the inequality in (54). In addition, the term in (55) corresponds to the loss function in Eq. (17).

Q7. Discuss limitations in the main paper.

A7. Thanks for the suggestion. We discussed limitations in Appendix F, and we will move them to the main paper in the final version.

2024-08-12

Thank you for the response and clarification. I would suggest summarizing the notations in the Appendix and please add the details for deriving Eq. (53) and (54) as well. Besides, it looks like $H^{\top} H = I$ only implies uncorrelatedness instead of statistical independence. Please consider rephrasing the sentences. A follow-up question: my question on the loss term derivation was how do we go from (8) to (13)? If (13) is only for solving the last term in (8) (i.e., for solving Y with fixed S), then why does the final objective function (18) only contains ${\cal{L}}_{sp}$ but not the whole Eq. (8)? If this loss is for solving (8), how is the entropy constraint term derived from (8)?

2024-08-12

Thank you for your constructive feedback. According to your suggestions, we will summarize the notations and details for deriving Eq. (53) and Eq. (54) in the Appendix. Moreover, we will replace "statistically independent" with "uncorrelated" to avoid confusion.

For the follow-up question, Eq. (13) contains two terms, the first term in Eq. (13) is derived from and for solving the last term of Eq. (8) by fixing $\mathbf{S}$ and approximating $\mathbf{F}$ with $\mathbf{Y}$ . In contrast, the second term (i.e., the entropy constraint) in Eq. (13) is not derived from Eq. (8), and it is a widely used regularization term in existing literature [1-4] to help optimize $\mathbf{Y}$ by preventing most nodes from being assigned to the same cluster. We will add more clarification about Eq. (13) in the final version.

The final objective function Eq. (18) only contains Eq. (13) (i.e., $L_{sp}$ ) but not the whole Eq. (8) because, in the final objective function, we only need to approximate one variable (i.e., $\mathbf{F}$ ) in the last term of Eq. (8) with $L_{sp}$ . In contrast, another variable (i.e., $\mathbf{S}$ ) in the first two terms of Eq. (8) is already solved by its optimal solution in Eq. (10) and does not require gradient back-propagation to optimize. That is, the first two terms of Eq. (8) do not have to appear in the final objective function. Please let us know if you have any other concerns. Thanks!

[1] Sparse Subspace Clustering with Entropy-Norm. ICML 2020.

[2] Deep semantic clustering by partition confidence maximisation. CVPR 2020.

[3] Contrastive clustering. AAAI 2021.

[4] Entropy regularization for unsupervised clustering with adaptive neighbors. Pattern Recognition 2022.

2024-08-13

Thank you for the further clarification and addressing my concerns. I will increase the score accordingly.

审稿意见

评分: 7置信度: 52024-07-18

Overall, this paper makes the first attempt to theoretically revisit previous SHGL methods from the spectral clustering perspective in a unified manner. Specifically, this paper revisits SHGL from the spectral clustering and introducing a novel framework enhanced by rank and dual consistency constraints. Specifically, the proposed framework incorporates a rank-constrained spectral clustering method that refines the affinity matrix to exclude noise effectively. Additionally, the proposed method integrates node-level and cluster-level consistency constraints that concurrently capture invariant and clustering information to facilitate learning in downstream tasks.

优点

I really appreciate the idea that revisiting previous SHGL methos from the perspective of spectral clustering, which is interesting and may inspire some researchers in the graph learning as well as the self-supervised learning communities.
Theoretical analysis verifies the effectiveness of the proposed method, which divides the learned representations into distinct partitions based on the number of classes. Furthermore, the proposed method exhibits enhanced generalization ability than previous SHGL methods.
Extensive experiments on both heterogeneous and homogeneous graphs demonstrate the effectiveness of the proposed method. Visualization and case studies further verify the claims in this paper.
The proposed rank-constrained spectral clustering is novel and interesting, and it seems like can also be used to the graph structure learning in the homogeneous graph.

缺点

In the Introduction, the authors claim that previous methods conduct message-passing relying on meta-path-based graphs and adaptive graph structures, which inevitably include noise. Are there any real examples to better illustrate the noise in meta-path-path based graph as well as the adaptive graph structures?
In the dual consistency constraints, the proposed method designs the node-level consistency constraint to capture the invariant information between the node representations and the heterogeneous representations. How about replacing the node-level consistency constraint with other common loss, such as InfoNCE [1]? [1] Aaron van den Oord et al., Representation learning with contrastive predictive coding.
The authors design the rank-constrained spectral clustering to learn the affinity matrix for nodes belonging to the same node type. Although the visualization verifies the effectiveness of the affinity matrix, it would be better for the authors to add some ablations studies to further verify it, such as replacing the affinity matrix with a self-attention mechanism or simple cosine similarity.
What are the specific processes for different downstream tasks (e.g., node classification and node clustering)? It seems like training the proposed method first, and then freezing parameters of the model and applying outputted representations for downstream tasks.
The proposed method is designed for the heterogeneous graph while it can be implemented on both the homogeneous graph datasets and heterogeneous graph datasets. Is it just to replace the heterogeneous graph encoder with the graph convolutional layer? How to generate two different views for the dual consistency constraints?
It would be better to add some recent works about the self-supervised heterogeneous graph learning in the related work, especially those published in the past two years.
The paper needs further proofreading. For example,

In Eq. (17) I know $q_i$ is the i-th projected representation. What’s the $\hat{\mathbf{q}}_{\mathbf{y}_i}$ actually means?
The definitions of some symbols need to be further determined, such as \mathbf{I} in Definition 2.1. I guess it might represent the identity matrix.
Some mistakes in Table 3, the Freebase dataset should be removed from the Table.
The caption of Table 7 should be fixed.

问题

see weakness.

局限性

Yes, the authors have discussed the limitations and potential negative societal impact.

作者回复

2024-08-07

Thanks for the positive comments on the novelty, theoretical analysis, and experimental results of our method. We are so encouraged and will try our best to address the concerns one by one.

Q1. Any real examples to illustrate the noise in meta-path-based graphs and adaptive graph structures?

A1. Yes. Actually, the noise in meta-path-based graphs and adaptive graph structures indicates the edges connected nodes from different classes. Take the academic heterogeneous graph with several node types (i.e., paper, author, and subject) as an example. For the meta-path-based graph, if an author wrote two papers belonging to different classes (e.g., "data mining" and "computer vision"), then there will be a meta-path "paper-author-paper" to connect these two papers. For the adaptive graph structure, if two papers belonging to "data mining" and "computer vision" share part of keywords, there also may be an edge to connect them. As a result, such noisy edges in meta-path-based graphs and adaptive graph structures connect two paper nodes from different classes and introduce confusion into node representations after the message-passing.

Q2. How about replacing the node-level consistency constraint with other loss, e.g., InfoNCE?

A2. To verify the effectiveness of the node-level consistency constraint, we conducted experiments to replace the proposed node-level consistency constraint with the InfoNCE loss and reported the results in Table 1 in the uploaded PDF file. From Table 1, we can find that the variant method with InfoNCE loss obtains a similar performance to the proposed method. However, the InfoNCE loss generally requires the time complexity of $\mathcal{O}(n^2)$ , where $n$ is the number of nodes. This may introduce large computation costs during the training process. In contrast, the proposed method simply designs the node-level consistency constraint in Eq. (15) to capture the invariant information with the time complexity of $\mathcal{O}(nd^2)$ , where $d$ is the representation dimension and generally $d^2<n$ . We will add the experimental results and analysis in the final version.

Q3. It would be better to add some ablations studies to verify the effectiveness of the affinity matrix, e.g., replacing it with the self-attention mechanism or simple cosine similarity.

A3. In the submission, we conducted experiments to replace the affinity matrix with the self-attention mechanism and reported the results in Table 6 in Appendix E. To further verify the effectiveness of the rank-constrained affinity matrix, we further investigated to replace the affinity matrix with the simple cosine similarity and reported the results in Table 2 in the uploaded PDF file.

Obviously, the proposed method with the affinity matrix obtains superior performance than the cosine similarity and the self-attention mechanism on all datasets. The reason can be attributed to the fact that the affinity matrix in the proposed method is constrained to contain exactly $c$ components to mitigate noisy connections from different classes. In contrast, although either the cosine similarity or self-attention mechanisms may assign small weights for node pairs from different classes, it inevitably introduces noise during the message-passing process to affect the quality of node representations. As a result, the effectiveness of the rank-constrained affinity matrix is verified. We will add the experimental results and analysis in the final version.

Q4. What are the specific processes for downstream tasks (e.g., classification and clustering)?

A4. We follow the evaluation in previous works [1-2] to conduct node classification and node clustering as semi-supervised and unsupervised downstream tasks, respectively. Specifically, we first pre-train models with unlabeled data in a self-supervised manner and output learned node representations. After that, the resulting representations can be used for different downstream tasks. For the node classification task, we train a simple logistic regression classifier with a fixed iteration number and evaluate the effectiveness of all methods with Micro-F1 and Macro-F1 scores. For the node clustering task, we conduct clustering and split the learned representations into c clusters with the K-means algorithm, then calculate the normalized mutual information (NMI) and average rand index (ARI) to evaluate the performance of node clustering. We will add the details about downstream tasks in the final version.

[1] HDMI: High-order Deep Multiplex Infomax. WWW 2021.

[2] Multi-view Contrastive Graph Clustering. NeurIPS 2021.

Q5. Is it just to replace the heterogeneous graph encoder with the graph convolutional layer to implement on homogeneous graph datasets? How to generate two different views for the dual consistency constraints?

A5. Yes, we replace the heterogeneous graph encoder with the graph convolutional layer to implement the proposed method on homogeneous graph datasets. In addition, we follow previous work [3] to generate two different views of the homogeneous graph by removing edges and masking features.

[3] Graph Contrastive Learning with Adaptive Augmentation. WWW 2021.

Q6. It would be better to add some recent works about the SHGL in the related work.

A6. Thanks for your suggestion, we will add more recent related works in the final version.

Q7. The paper needs further proofreading. For example,

1) In Eq. (17), what does the $\hat{\mathbf{q}}_{\mathbf{y}_i}$ actually mean?

2) The definitions of some symbols need to be further determined, such as $\mathbf{I}$ in Definition 2.1.

3) Some mistakes in Table 3.

4) The caption of Table 7 should be fixed.

A7. 1) $\hat{\mathbf{q}}_{\mathbf{y}_i}$ indicates the cluster representation whose label equals to $\mathbf{y}_i$ .

2) $\mathbf{I}$ indicates the identity matrix.

3)-4) We will fix these mistakes in the final version.

2024-08-14

Thanks for the authors' clarification and my concerns are well addressed. Overall, this is an interesting work based on its insights, novelty, and theory framework. After acrossing other comments and the rebuttal, I maintain the acceptance of this paper.

作者回复

2024-08-07

We would like to thank all the reviewers for their insightful and constructive comments. Due to space limitations in the rebuttal, we listed tables and figures in the uploaded PDF file. All modifications will be found in the final version. Our key responses are summarized as follows:

> Clarification of unclear points.

As Reviewer NZZJ suggested, we provided more details to clarify some notations. For example, the dimensions of the encoders $g_\phi$ , $f_\theta$ , and projection heads $p_\varphi$ , $q_\gamma$ . Moreover, we provided more explanation to clarify the objective function $\mathcal{L}_{sp}$ . In addition, we renamed $\mathbf{Y}$ as the cluster assignment matrix to avoid confusion. Finally, we provided more explanations to clarify the statistical independence.

As Reviewer v7xk suggested, we provided more explanation to clarify previous SHGL methods in Theorem 2.2, and listed their corresponding expressions for the regularization term. Moreover, we highlighted the new contributions of the paper, compared to previous literature. In addition, we provided explanation and modified the sentences in abstract and introduction for a clear expression.

As Reviewer oBdx suggested, we explained three challenges and the logical relationships between them. Moreover, we clarified the objective of the problem, the reason for bridging the gap between SHGL methods and graph-cut algorithm, and its benefits. In addition, we explained why rank constraints can mitigate noisy connections and the logical relationship between Sections 2.2 and 2.3. Finally, we explained why homogeneous graph datasets were used for performance evaluation.

> Additional theoretical details.

As Reviewer NZZJ suggested, we provided the details to reach the inequality in (53) and (54).

As Reviewer v7xk suggested, we will move the definition of the model complexity in the appendix to the main paper. Moreover, we provided the definition of the relationship between model complexity and generalization ability. In addition, we reformulated language description with a precise objective function to achieve precise Theorem 2.5.

> Additional experimental results.

As Reviewer LzB2 suggested, we conducted experiments to replace the proposed node-level consistency constraint with the InfoNCE loss. Moreover, we conducted experiments to replace the affinity matrix with the simple cosine similarity and self-attention mechanism.

As Reviewer NZZJ suggested, we conducted experiments to add a simplex constraint on the columns of $\mathbf{Y}$ . Moreover, we conducted experiments to change the number of clusters in the proposed method.

> Other responses.

As Reviewer LzB2 suggested, we provided a real example to better illustrate the noise in meta-path-based graphs and adaptive graph structures. Moreover, we provided more details about the specific processes for different downstream tasks and the implementation on the homogeneous graph.

> Summary.

We thank all the reviewers again for the detailed and constructive review. It seems that almost all the reviewers have agreed with the novelty and contributions of the proposed method, and that most of the concerns are raised to part of the unclear expressions, and experiments. We hope our clarification, additional theoretical details, and additional experimental results in the rebuttal could address all of your concerns. Please let us know if you have any questions or concerns.

最终决定Accept (poster)

2024-09-25

This work introduces a novel loss function for graph embedding, drawing on insights from spectral clustering, along with dual consistency regularizations and orthogonality constraints.

The reviewers collectively appreciated the innovative perspective of the paper, particularly its connection to spectral clustering. They also commended the theoretical justification provided for the proposed loss function. Several reviewers noted that the experimental section was comprehensive, offering sufficient baseline comparisons.

However, the primary concerns raised were related to presentation. The reviewers pointed out issues with the clarity of notation and the use of certain terminologies, such as distinguishing between statistical independence and uncorrelatedness, and the definitions of terms like "probability matrix" and "complexity measure." These concepts were either not carefully defined or misused. In their rebuttal, the authors thoroughly addressed these concerns and acknowledged the need for improvements in writing and presentation.

Given the unanimous positive recommendations from the reviewers and the fact that the authors have the opportunity to address the clarity and presentation issues before publication, I recommend accepting this work.