3.8

/10

Rejected4 位审稿人

最低3最高6标准差1.3

3.8

置信度

正确性2.3

贡献度2.3

表达2.3

ICLR 2025

Towards an Understanding of Graph Sequence Models

Ali Behrouz,Ali Parviz,Mahdi Karami,Clayton Sanford,Bryan Perozzi,Vahab Mirrokni

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

摘要

关键词

Graph LearningSequence ModelsGraph TransformersHierarchical Clustering

评审与讨论

审稿意见

评分: 3置信度: 42024-10-30

This work aims to study graph sequence model (GSM) in a unified framework. A GSM in this framework consists of the tokenization, local encoding (e.g., GNN) and global encoding (e.g., recurrence models or transformers). The authors discuss the pros and cons of recurrence models v.s. transformers in global encoding stage, as well as node/edge/subgraph tokenizer in tokenization stage. Besides, they propose a hierarchical tokenization technique using Hierarchical Affinity Clustering, allowing for a graph encoding at different levels of granularity.

优点

a comprehensive theoretical analysis on different ingredients of graph sequence model
a novel hierarchical tokenizer with hierarchical positional encodings

缺点

the theory part (Section 3) does not seem well-organized, self-consistent and clear enough to follow. For examples,
- Line 219 "Based on the above results, we can see that when dealing with naturally sequential tasks, recurrent models are more powerful than softmax non-causal Transformers". The "above result" is that softmax non-causal transformer cannot resolve counting task while recurrence model can. However, counting task does not seem to be sufficient to represent "naturally sequencetial tasks". In fact, the counting of node colors in a graph is a permutation invariant function, since node ordering does not impact the counting.
- Line 215 "Can causality resolve this issue?". This question is raised by the authors but there is no any follow-up.
- Line 268 "This result motivates us to explore hybrid models, combining SSMs with non-causal Transformers to take advantage their inductive bias while avoiding representational collapse by self-attention layers". The statement is based on causal transformers have no representational collaspse, which is not argued in the context.
- Line 251 "Transformers have constant sensitivity on the tokens distances" and Line 257 "In both Transformers and SSMs, the information about tokens located near the start of the sequence have more opportunity to be maintained at the end". Given that the transformer has constant sensitivity w.r.t. distances, I wonder why the dependency will still concentrate on the start of the sequence?
the experiment part misses some details:
- what are the performance metrics shown in Table 1 and 2?
- what is specifically "MPNN" in Table 1?
from ablation study Table 4, it seems the proposed HAC tokenization have very marginal improvement, while the main improvement comes from the simple hybrid of different models and tokenizations.

问题

Line 303, Defintition 1 $\arg\max_i\\{e_i:v\in e_i\\}$ : what does it mean to take maximium over an edge set?
in Table 1, does node degree task refer to compute the node degree of each node? If so, why GIN cannot perform well on the task?

2024-11-22

Thank you so much for your time and constructive review. We really appreciate it. Please see below for our response to your comments:

W1: However, counting task does not seem to be sufficient to represent "naturally sequential tasks".

Response: Thank you for mentioning that. Here, by naturally sequential tasks, we meant tasks that do not require graph inductive bias. We agree that this term has been confusing, and following your suggestion, we have modified the writing and changed this term. We hope this has addressed your concern.

W1: Line 215 "Can causality resolve this issue?". This question is raised by the authors but there is no any follow-up.

Response: We want to kindly bring to your consideration that this question is followed by Theorem 1, which has answered this question. In fact, causal-transformers are similar to recurrent models and due to their causal nature can count the number of nodes with each specific color iff $m \geq C$ , where m is the width and $C$ is the number of colors. To avoid any confusion and to further clarify the statements, we have revised this part in the revised version.

W1: The statement is based on causal transformers have no representational collapse, which is not argued in the context.

Response: Thank you for bringing this to our attention. Please note that non-causal transformers (traditional transformers) are permutation invariance and so the order of tokens cannot affect the final output after $L$ layers. Accordingly, they do not suffer from the representational collapse issue. Following your suggestion, we have added this argument in the revised version to further clarify this conclusion.

W1: Given that the transformer has constant sensitivity w.r.t. distances, I wonder why the dependency will still concentrate on the start of the sequence?

Response: Please note that these results correspond to different variants of Transformers. That is, non-causal transformers (traditional transformers) are permutation invariant and also have constant sensitivity issue. On the other hand, causal transformers (i.e., decoder-only or causal masked transformers) have the issue of representational collapse and will concentrate on the start of the sequence. Following your suggestion and to avoid any misinterpretation, we discuss this in the revised version in lines 258-274. This is one of the main motivations for presenting a hybrid model (causal SSM + non-causal transformer). That is. non-causal transformers do not have representational collapse and so combining them with causal SSMs is the best of both worlds, resulting in a model with linear sensitivity while avoiding representational collapse.

W2: what are the performance metrics shown in Table 1 and 2?

Response: Thank you for bringing this to our attention. Following previous studies (e.g., Sanford et al., NeurIPS 2024), the performance metric for node degree, cycle check, connectivity, and color counting is accuracy (all are balanced datasets), and the metric for triangle counting and shortest path is MSE. We have added all the performance metrics to the tables in the revised version. We hope this change has addressed the reviewer’s comment.

W2: what is specifically "MPNN" in Table 1?

Response: We thank the reviewer for bringing this to our attention. The MPNN module is the original Neural message passing model that comes from Gilmer et al. (ICML 2017). Accordingly, to further clarify and to address the reviewer’s concern, in the revised version, we have included this information.

W3: from ablation study Table 4, it seems the proposed HAC tokenization have very marginal improvement, while the main improvement comes from the simple hybrid of different models and tokenizations.

Response: Thank you very much for bringing this to our attention. We realize that in these ablation studies, we were using HAC-based hierarchical positional encoding, which results in taking advantage of the HAC and not fully removing it from the model. We have updated the results and also report the results for more challenging datasets, which can further show the significance of the presented modules. The results show that removing HAC can result in 1.5% performance drop on average. Also, please note that using hybrid models and mixture of tokenization are contributions of this study and to the best of our knowledge, no other studies have explored their effectiveness in graph learning tasks.

2024-11-22

Q1: what does it mean to take maximum over an edge set?

Response: Please note that this formula is not taking maximum over an edge set, but it takes the argmax over the indices of edges. That is, it represents the maximum index of an edge that a node v belongs to. Please note that here, indices are the location of the edges in the sequence.

Q2: does node degree task refer to compute the node degree of each node? If so, why GIN cannot perform well on the task?

Response: Yes, this task aims to compute the node degree of each node. GIN and other convolutional models due to the over-squashing do not perform well. That is, if we want these models to perform perfectly on this task, we need to use a vector with size N (number of nodes), which results in overparameterization in this experiment. The same pattern is also reported by Sanford et al. (NeurIPS 2024). Please also note that the goal of our study is not to compare with MPNN-based or convolutional graph learning models and these results are presented as a baseline to give the reader a sense of the hardness of the tasks. The main goal of these experiments is to compare different types of graph sequence models, meaning that we sequentialize the graph and use sequence models (e.g., transformers) on top of it.

Please also see our global response. Once again, we thank the reviewer for their time and review. We hope our above responses have fully addressed your concerns/questions about the paper. We would be more than happy to answer any further questions or discuss any remaining concerns of the reviewer.

审稿意见

评分: 3置信度: 42024-10-31

This paper aims to enhance the understanding of graph sequence models. It tries to draw insights via theoretical analysis to see which sequence models and node tokenization strategies would be more suitable for certain graph tasks. Then, the paper proposes to utilize Hierarchical Affinity Clustering (HAC) for graph tokenization and conducts experiments with different sequence models. Results on multiple standard datasets show good performance of the proposed method.

优点

The manuscript is well-written and easy to follow.
Sec. 3 is interesting to read, which provides some insights into the capability of different sequence models on graph tasks.

缺点

I'm a bit confused about the positioning of this paper. It seems to me the most interesting part is Sec. 3, which organizes some previous results and also provides a few new results. However, if we were to see the technical results of this paper, then they seem a bit limited. For example, the overall framework (tokenization + local + global) is not new, the usage of SSMs on graphs is not new, etc. Perhaps the use of HAC could be considered somewhat new, but the ablation studies cannot really show its benefits. In the meantime, the experiments exclude many recent baselines (even though they are actually cited), and it seems the empirical performance is not that excellent compared to recent baselines, e.g., Grit, Graph-Mamba (two concurrent works), GSSC, etc.
If we only look at Sec. 3, it seems the results are not comprehensive enough.
- I do appreciate that Sec. 3 organizes some previous results well and also includes some new results. But still, many results in Sec. 3 are not that original.
- After reading Sec. 3, it still remains hard to really draw practical insights. For example,
  - I still find it hard to compare recurrent models and transformers from PROPOSITION 1 and THEOREM 1, because PROPOSITION 1 limits itself to cases without properly using PE, which may not be the cases in the graph domain.
  - It seems hard to conclude that recurrent models could be more efficient on connectivity tasks in practice, because the bounds in Sec. 3.3 may not really show significant difference in practical cases.
- The other thing is that it seems the experiments (table 2) cannot sufficiently support the takeaway listed in Sec. 3. The selected datasets may not be challenging enough, and in most cases using different sequence models yields close performance.

问题

See above.

2024-11-22

W1: empirical performance is not that excellent compared to recent baselines, e.g., Grit, Graph-Mamba (two concurrent works), GSSC, etc.

Response: We kindly bring to your consideration that GSM++ has provided 0.85 performance improvements on average over the best of these baselines. Please note that this is a larger improvement than what Graph-Mamba and GRIT have gained over their own baselines, which shows the significance of our contribution. Please note that when the best performance approaches near-perfect scores for a benchmark datasets, designing an architecture that results in a better performance becomes extremely more challenging. However, despite these challenges, GSM++’s design has resulted in a performance gain that is higher than the performance gain of earlier studies.

W2: But still, many results in Sec. 3 are not that original

Response: We kindly refer the reviewer to our previous reply. This paper has provided 10 distinct and new theorems with 5 corollaries, which to the best of our knowledge, result in one of the most extensive theoretical analyses of recent studies in top tier conferences. All these results are original and only 2 corollaries and 1 proposition are from previous studies. Moreover, many theoretical results presented in this paper are specifically tailored to graph learning tasks. This comprehensive theoretical exploration emphasizes the depth and novelty of our contributions.

W2: I still find it hard to compare recurrent models and transformers … limits itself to cases without properly using PE

Response: Please note that theoretical analysis of graph sequence models have been extremely challenging and somewhat insightless. The reason is existing studies have tried to evaluate their expressive power using the WL test, which is widely used by other types of graph neural networks. Graph sequence models, however, with proper positional encoding are more powerful than any k-WL test, which made the comparison of their expressive power with respect to this test impossible. To mitigate this, and to provide further insight about the ability of sequence models, in this paper, in addition to general purpose theorems, we also aim to answer what type of sequence models has more inductive bias and is more suitable for the task at hand. We want to kindly bring to your consideration that the assumption of without proper PE is only taken for Proposition 1 and Theorem 1. Even without these two results, the paper has provided 9 theorems with 5 corollaries, which all can provide insight about which sequence model is useful.

Please note that our theorems cover different possible cases, providing insights for different setups. That is, Theorem 1 and Proposition 1 consider sequence models without specific tokenization (i.e., node tokenization) and positional encoding. Theorems 3, 4, and 5 consider sequence models without tokenization but with any arbitrary positional encoding, and finally Theorems 6 and 7 consider the complete setup. These theorems are complementary and all together provide new insights about the power of sequence models for graph tasks.

W2: It seems hard to conclude that recurrent models could be more efficient on connectivity tasks in practice … may not really show a significant difference in practical cases.

Response: Please note that if we fix the node locality to a reasonable number of 12 and the number of edges to 1M, then these results indicate that a recurrent model with hidden state size of 12 can solve this task, while a transformer requires at least 1M parameters. We believe in practice (on a reasonable graph with 1M edges), the difference of 12 and 1M is indeed a significant difference.

In addition to the above, we want to kindly bring to your consideration that the same argument can be said about the WL test, which is widely used in the literature, or most theoretical results. That is, one might argue that representation learning on most graphs does not require even 1 or 2- WL test as in practice graphs are not that challenging and similar to each other. However, theoretical results often present as the motivation/justification of empirical results and to support them. That is, theoretical results are mathematical facts about the power of architectures in the general (or worst) case, while empirical results are instances of performance on a subset of potential datasets. So theoretical results guarantee a good performance when in the future larger/complicated/challenging graphs come. Accordingly, we believe all the presented theoretical results can be valuable and result in better understanding of the graph sequence models.

2024-11-22

W2: The selected datasets may not be challenging enough, and in most cases using different sequence models yields close performance.

Response: Thank you for bringing this to our attention. Following your suggestion, we have provided the results on datasets with a more challenging setup. In the previous setup, the structure of the graph could simply and perfectly be encoded and so the differences of recurrent and transformer models were marginal. Using more complex datasets, now results show a clearer difference between recurrent models, transformers, and hybrid models, supporting the theorems provided in the paper. That is, mamba significantly outperforms transformers in connectivity tasks, while transformers are better in color counting and shortest path tasks. Please see Table 2 for the details.

Please also see our general response. Once again, we thank the reviewer for their time and review. We hope our above responses have fully addressed your concerns/questions about the paper. We would be more than happy to answer any further questions or discuss any remaining concerns of the reviewer.

2024-11-22

Thank you so much for your time and constructive review. We really appreciate it. Please see below for our response to your comments:

W1: I'm a bit confused about the positioning of this paper

Response: Thank you for raising this point. Recently, there have been huge interests in designing efficient alternative architectures for transformers and accordingly, they all can be applied to graph data when we tokenize the graph. This results in a huge number of models based on different combinations of sequence models and tokenization methods (as well as their combination with local and global encoding). Although this can provide an interesting set of models with potentially good performance, it can also cause confusion about what constitutes a good sequence model for graphs and how one can compare these combinations. To address this, we provide extensive theoretical results and motivated by these results, we present a new architecture GSM++, that uses a hybrid sequence model with a novel tokenization process based on the HAC algorithm. We believe these contributions help position the paper not only as a source of theoretical guidance but also as a proposal for a new architecture that leverages these insights for practical performance improvement.

W1: which organizes some previous results and also provides a few new results … if we were to see the technical results of this paper, then they seem a bit limited

Response: We would like to draw your attention to the depth and breadth of the theoretical results presented in this work. Specifically, we have 10 distinct theorems, 7 corollaries, and 1 proposition, from which 2 corollaries and 1 propositions are cited from prior work. This leaves this paper with 10 distinct and new theorems and 5 corollaries, which to the best of our knowledge result in one of the most extensive theoretical analyses of recent studies in top tier conferences. The theoretical results span topics such as graph connectivity, k-locality, motif counting, color counting, and hierarchical tokenization, providing both novel insights and practical design guidelines. Furthermore, please note that these theorems are considerably non-trivial with about 3 + 8 pages of discussion and proofs. While a few theoretical results are revisited for completeness and to contextualize our contributions, the majority of results are new and directly inform the design of GSM++.

W1: the overall framework (tokenization + local + global) is not new

Response: We would like to clarify that to the best of our knowledge, none of the existing studies have explicitly presented this unifying framework. For example, graph mamba also discusses similar steps, but it lacks the global encoding step. Graph GPS also discusses a framework, but it lacks both tokenization and local steps, as it simply treats the graph as a sequence of nodes.

We, however, would be happy to discuss and consider any other model that has explicitly presented and formalized this framework. Also, we want to clarify that we do not claim the framework itself as a major contribution of this work, but we believe it is an important step and establish a clear structure that has allowed us to extensively evaluate (both theoretically and empirically) different sequence models and tokenizers. Please note that, as an example of extensive evaluation, Figure 3 alone is representing the results of 54 different (mostly new) models on 7 datasets!

W1: the usage of SSMs on graphs is not new

Response: Please note that we are not using SSMs alone for graphs. While standalone usage of state-space models (SSMs) in various tasks has been explored, tTo the best of our knowledge, this paper is the first study that presents a hybrid model (SSMs+attention) for graph structured data and shows its effectiveness.

In addition to this fact, we want to kindly bring to your consideration that before the recent interest in SSMs, graph transformers, based on self-attention, have been the only graph sequence models for several years, and several models based on graph transformers have been published in top-tier conferences due to their unique contributions in design or application. The novelty of a method comes from the combination of its proposed elements, as in our case is HAC + hybrid model + mixture of tokenization which is motivated by theoretical analysis. To the best of our knowledge, none of these elements has been discussed in the existing methods. Please see our general response about the significance of our contributions.

We hope this clarifies how GSM++ constitutes a significant and novel contribution to the field beyond a simple application of SSMs to graphs.

2024-11-22

W1: Perhaps the use of HAC could be considered somewhat new

Response: Please note that in addition to presenting all the theoretical results, hybrid model, and mixture of tokenization, we have also presented two different tokenizations based on HAC algorithm. We want to kindly bring to your consideration that presenting only a tokenization method is an important contribution and has been the main contribution of most state-of-the-art recent studies. For example, NAGFormer’s contribution (Chen et al., ICLR 2023) is to present a neighborhood-based tokenizer that each token is the k-hop neighborhood, improving the efficiency of graph transformers. Similarly, GRED’s main contribution (Ding et al., ICML 2024) is to treat the set of nodes in the k-hop neighborhood as a token (ignoring the inter-neighborhood connections). Finally, graph mamba’s main contribution (Behrouz et al., KDD 2024) is to present a random walk-based tokenization that can fully take advantage of SSMs. Here, in addition to all the abovementioned contributions, we also present a HAC-based tokenization that is motivated by two of our theorems, which can fully take advantage of linear sensitivity of SSMs and can provide proper k node locality, resulting in fully taking advantages of recurrent models. In other words, the entire architecture design and almost all its elements are novel and new.

Please note that while our study has focused on different aspects (i.e., a unifying framework, theoretical results for how to choose a sequence model, new hybrid model for graphs, a new positional encoding, and two new tokenizations), all of which support and align with each other. Other recent studies, however, have focused on a subset of the above aspects, which shows the significance of the contributions of this paper.

W1: but the ablation studies cannot really show its benefits

Response: Thank you very much for bringing this to our attention. We realize that in these ablation studies, when removing HAC, we were still using HAC-based hierarchical positional encoding, which results in taking advantage of the HAC and not fully removing it from the model. We have updated the results and also report the results for more challenging datasets, which can further show the significance of the presented modules. The results show that removing HAC results in a 1.5% drop in the F1Score and Accuracy of the models. Note that this is a huge benefit as GSM++ without HAC underperforms several baselines.

W1: the experiments exclude many recent baselines (even though they are actually cited)

Response: To the best of our knowledge, we have included all the relevant baselines, including many recent baselines (e.g., graph mamba, Behrouz et al., KDD 2024, Aug. 2024). Please note that graph mamba itself has been compared with and has outperformed previous baselines, including very recent studies (the other graph-mamba paper (Wang et al., 2024) and GRED (Ding et al., ICML 2024), S4G, etc.). We, however, following your suggestion, have provided additional baselines’ results in the revised version.

We want to kindly bring to your consideration that the main goal of these experiments is to show the effectiveness of GSM++ compared to existing graph sequence models. Models like GSSC are (global) convolutional and are a different type of graph learning method. That is, graph sequence models translate the graph into sequences and employ sequence models on top of that, while models like GSSC naturally extend the concept of a sequence encoder (e.g., SSM) to graphs. The focus of this work is on graph sequence models. Each of these types (e.g., graph sequence models, convolutional, message-passing-based, random walk-based, rewiring-based, etc.) has its own advantages and disadvantages. All these research directions are valuable, sometimes incomparable (not fair to compare), and help to further understand the challenges of learning on graphs.

For example, GSSC, similar to transformers, has constant sensitivity to the distance of nodes, which results in a weaker expressive power for local tasks (similar to non-causal transformers). It is also strictly less powerful than the 3-WL test, while given the universality of graph transformer and graph mamba (proved in their original paper), it is simple to show that GSM++ with proper initial features can be more powerful than any k-WL test. Furthermore, while GSSC uses about the same memory as GSM++, it is slower in training but has a better inference latency. Accordingly, it is fair to say that each of these models have their own advantages and disadvantages, but the focus of this work is on understanding and improving graph sequence models.

We hope our above response has addressed the reviewer’s concern. However, we would be happy to include any specific relevant baselines that the reviewer believes is required.

审稿意见

评分: 3置信度: 42024-11-04

This paper studies the sequence models such as Transformers, RNNs, SSMs on graphs. The authors unify graph sequence models by three operators: Tokenization, Local Embedding and Global Embedding. Then, the authors study the pros and cons of applying different sequence models on Counting and Connectivity tasks. The authors also propose a hybrid model, GSM++, which incorporates HAC-based tokenization, a GatedGCN as the local encoder, and a hybrid global encoder that combines Mamba and Transformer models sequentially.

优点

The exploration of sequence models on graphs is interesting.
The authors investigate the effects of different sequence models on graph-based tasks, specifically focusing on Counting and Connectivity.
The authors provide examination of different tokenization methods.

缺点

Most of the theoretical foundations in this paper are derived from prior work focused on sequence models, such as Transformers and RNNs, originally designed for sequential data rather than graphs. The paper lacks theoretical adaptations specifically tailored to graphs, which limits its contribution in this area. Additionally, several theorems (e.g., Theorems 3, 5, 6, and 7) are presented without proof.
The logical flow of the paper is unclear. While several theorems are presented, many do not lead to the conclusions that the authors claim, which makes it difficult to follow the arguments effectively.

问题

Why are the color counting problem important for graphs?
The authors claim that "Both causal Transformers and SSMs can suffer from representational collapse, resulting in limited representational power. This result motivates us to explore hybrid models, combining SSMs with non-causal Transformers to take advantage their inductive bias while avoiding representational col- lapse by self-attention layers". Why does non-causal transformers solve the representational collapse issue?
This work appears incomplete in several respects. For example, although the authors introduce a variety of tasks, such as node- and graph-level tasks, and discuss relevant datasets in Table 6, they do not present corresponding results. Additionally, most of the theorems lack experimental support.
The writing quality could be improved, as there are some typos, such as duplicate words (e.g., in line 137).

2024-11-22

Q1: Why are the color counting problem important for graphs?

Response: Please note that there are four aspects to answer this question:

(1) The color counting is a fundamental task that we expect any effective graph learning model should be able to handle. The reason for this is that counting is a seemingly simple, yet important, reasoning task that requires only keeping track of a single scalar for each color.
(2) The task is structured and simple enough to be studied theoretically, while it is still hard enough for the model as counting tasks require aggregating information from all nodes and even missing a single node's color can potentially result in an incorrect prediction. Please note that theoretically analysis of a task is a highly non-trivial process and most theoretical studies (even in a broader context) focus on the tasks that can simply be modeled while they are still hard enough for ML models.
(3) The color counting is related to important concepts in graph learning like color refinement and the Weisfeiler-Lehman (WL) test.
(4) As mentioned by Barbero et al. (NeurIPS 2024), counting is a problem that requires some notion of ‘unboundedness’ of the representations, whilst the normalisations used inside a Transformer work against this. As discussed/motivated in our paper (lines 206 - 211), color counting can be considered as the graph counterpart to the popular copying task in sequence modeling, widely studied in language models (e.g., Mamba, Gu el al., 2023).

Also, please note that our study is not the first study to use color counting tasks. As we also have cited in the initial submission, several papers including (Barbero et al., NeurIPS 2024) also use graph counting tasks to theoretically analyze the power of the deep learning architectures. For example, Barbero et al., (NeurIPS 2024) use this task to understand the reasoning capability of Transformers. The simplicity for theoretical understanding, its connection to several fundamental concepts (e.g., color refinement, WL test, and copying task), and being a least expectation from models for reasoning on a scalar feature, all make color counting an important and suitable problem for theoretical analysis of graphs.

Q2: Why does non-causal transformers solve the representational collapse issue?

Response: Thank you for mentioning that. Please note that the position of a token in non-causal transformers (traditional transformers) cannot affect its encoding due to the fact that Transformers are permutation equivariance. Since the order of tokens cannot affect the final output after L layers, Transformers do not suffer from the representational collapse issue. Following your suggestion, we have made this point clearer in the revised paper.

Q3: This work appears incomplete in several respects. For example, although the authors introduce a variety of tasks, such as node- and graph-level tasks, and discuss relevant datasets in Table 6, they do not present corresponding results

Response: We respectfully disagree with the statement that the paper is incomplete. The results for both node-level and graph-level tasks were already included in the initial submission, specifically in: Tables 3 (page 10), 7 (page 30), and 8 (page 30). In the revised version, these tables can be found in Tables 3 (page 10), 8 (page 33), and 9 (page 33). In the main text, we also have referred to these results in line 524 (line 521 in revised version). As you also mentioned, the details for these experiments are also presented in Table 6.

2024-11-22

Q3: Additionally, most of the theorems lack experimental support.

Response: We would like to clarify that the experimental results supporting the theorems are already included in the initial submission. More specifically, we have performed experiments on graph connectivity, color counting, motif counting, and additional basic tasks in Tables 1 and 2. We have also reported the results for Transformers, Mamba (recurrent model), and hybrid models. We acknowledge that the assumptions of certain theorems (i.e., Theorem 1 and Proposition 1) are different from the experimental results as without positional encoding, transformers are incapable of performing some tasks. However, following the reviewer’s suggestion and to address remaining concerns, we have added additional experiments with more challenging setups, which completely reflect the assumptions in the theorems and can support them (Please see Table 2 in the revised version).

We want to kindly bring to your consideration that in the broader literature, theoretical results often serve as a complement to experimental findings. Theoretical results are mathematical facts which generally provide insights about the capabilities of models in a general context, while experimental results are just instances of performance on a subset of potential datasets. In our paper, the empirical results are supported by our theorems and also additional theorems are presented to complement the good results of GSM++. Please note that in our paper, although the theoretical results are experimentally supported, the main goal of our theoretical results have been to motivate the design of GSM++ and to provide a guideline for future studies. Accordingly, this paper has used both theoretical and empirical results to support each other.

We hope this clarification addresses the reviewer’s concern, and highlights the completeness of the paper and rigorous theoretical and experimental foundation of our work. We would be happy to clarify any additional concerns about any specific theorem that the reviewer believes lacks experimental support.

Q4: The writing quality could be improved, as there are some typos, such as duplicate words

Response: Thank you for pointing this out. Following your comment, we have proofread the paper to identify and correct all typos, including this and other minor issues. We hope the revised version has addressed the reviewer’s concern.

2024-11-22

Thank you so much for your time and constructive review. We really appreciate it. Please see below for our response to your comments:

W1: Most of the theoretical foundations in this paper are derived from prior work focused on sequence models, such as Transformers and RNNs. … The paper lacks theoretical adaptations specifically tailored to graphs

Response: We kindly bring to your attention that many theoretical results presented in this paper are indeed specifically tailored to graph learning tasks. For instance, graph connectivity (e.g., Theorem 3), its variant based on k-locality (e.g., Theorems 4 and 5), and motif counting (e.g., Theorem 7) directly address graph-specific tasks, and so are theoretical results about learning on graphs. To the best of our knowledge, none of these contributions are mentioned or used by other studies.

In total, the paper introduces 10 distinct theorems, 7 corollaries, and 1 proposition, of which only 2 corollaries and 1 proposition are derived from prior work. This comprehensive theoretical exploration emphasizes the depth and novelty of our contributions.

Furthermore, please note that some results, such as our Theorems 1 and 2, that are non-speciality tailored to graphs, are indeed strength of our results as they provide a general theorem for a general machine learning model when applied to arbitrary data form, resulting in a broader impact on diverse sub-communities. Moreover, these analyses lay the foundation to the understanding and logical flow of the paper and are important in the design of the final GSM architecture.

W1: Additionally, several theorems (e.g., Theorems 3, 5, 6, and 7) are presented without proof

Response: We kindly bring to your consideration that all the proofs for the mentioned theorems were already provided in the Appendix. Specifically: (1) Theorem 3: Proof presented on page 23, line 1220 (revised version: page 26, line 1365). (2) Theorem 5: Proof presented on page 25, line 1300 (revised version: page 27, line 1444). (3) Theorem 6: Proof presented on page 28, line 1469 (revised version: page 30, line 1611). (4) Theorem 7: Proof presented on page 27, line 1417 (revised version: page 29, line 1557).

Please note that for some theorems, since we have provided more extensive theoretical analysis in the appendix, we have discussed their proofs in separate sections. This has allowed us to clearly and cohesively discuss and motivate theoretical results. We have mentioned this after each theorem that where the reader can find additional details.

We hope this response addresses the Reviewer’s concern and highlights the rigorous theoretical foundation of our work.

W2: The logical flow of the paper is unclear. …

Response: Following Reviewer suggestion and to enhance the logical flow of the paper, we have revised the presentation and made sure to explain all the required steps for the derived conclusion. Specifically, we have updated the Takeaway summaries and discussed all the required steps. These Takeaway paragraphs at the end of subsections, distill the key results and insights, and clarifies how subsections connect to subsequent subsections and the overall narrative of the paper. We have further added a “Contributions and Roadmap” paragraph at the end of the Introduction that provides a clear outline of the paper’s structure.

We hope the current revised version has addressed the Reviewer’s concern. We, however, would be happy to clarify any additional concerns about any specific conclusion that is not clear.

审稿意见

评分: 6置信度: 32024-11-10

This paper proposes a robust and unified model that offers both flexibility and adaptability for designing sequence models intended for learning on graphs. The architecture includes three main modules: (1) Tokenization, (2) Local Encoding, and (3) Global Encoding, which facilitates effortless and efficient comparison between graph sequence models in terms of their performance on graph learning tasks, including RNN, SSM, etc. The authors also provide a theoretical analysis of the recurrent models and transformer-based graph models. An enhanced model, GSM++, is proposed based on empirical findings. GSM++ is specifically designed with three enhancement techniques: Hierarchical Affinity Clustering, a mixture of tokenization, and hybrid encoders.

优点

The paper is well-written and easy to follow. Theorems are proved in appendix.
The proposed framework (tokenizer, local encoding, and global encoding) unifies most existing graph sequence models, enabling effortless evaluation and creation of GSMs within a unified setting.
Experimental results are promising. GSM++(MoT) achieves state-of-the-art performance in most cases.

缺点

Critical experimental details are missing. The paper omits essential hyper-parameter configurations, including the number of self-attention layers and hidden dimensions utilized in GSM++. Such omissions hinder reproducibility and meaningful comparison with other models.
The architectural design choices in GSM++ lack comprehensive ablation studies. While the combination of Mamba and attention mechanisms demonstrate improved performance, alternative architectural variants (e.g., different stacking patterns of Mamba and attention layers) remain unexplored. The superior performance of the ensemble approach, though noteworthy, is an expected outcome rather than a novel contribution.
Appendix H (Line1544) mentioned that baselines include state-of-the-art GTs and recurrent graph models (e.g., graph mamba and GRED). However, the empirical performance of these baselines are missing.
MoT requires concatenation of sequence encodings generated by two different tokenizers, which significantly increases the computational complexity. GSM++ may suffer from scalability issues when applying larger datasets and more complex tasks.
The efficiency and complexity of the proposed GSM++ is missing. The paper should provide theoretical complexity analysis compared to existing Graph Transformers and recurrent models; training and inference time, memory cost, etc.

问题

The term "softmax non-causal Transformers" requires a clearer definition. What is the relation between non-causality and permutation equivariance? How does this property specifically benefit graph-based tasks?
The discrete router mechanism in the Model-of-Thought (MoT) framework requires further elaboration. Is it implemented as a learned component with trainable parameters, or does it follow predetermined heuristic rules?
The HAC tree implementation raises several computational concerns. What is the time complexity for constructing the HAC tree, especially for large graphs? How computationally intensive is the hierarchical positional encoding computation? Are there any optimizations or approximations that can be used to make these steps more efficient?

2024-11-22

Thank you so much for your time and constructive review. We really appreciate it. Please see below for our response to your comments:

W1: experimental details such as hyper-parameter are missing

Response: Thank you for bringing this to our attention. To improve the clarity of the paper, in the revised version, we have added a new table (Table 7 in the revised paper) and reports all the hyper-parameter configurations.

W2: The architectural design choices in GSM++ lack comprehensive ablation studies.

Response: Thank you for mentioning this interesting point. We want to kindly bring to your consideration that in the paper we have not claimed that the current order of hybrid architecture is the best possible case, and so in fact we left this exploration for future studies. However, following your suggestion and to address your concern, we have added additional case studies on the different stacking patterns of the hybrid approach. The results of these ablations are reported in Appendix J. Based on these results the proposed SSM+attention can be both more effective and efficient than other variants (e.g., attention + SSM).

W2:The superior performance of the ensemble approach, though noteworthy, is an expected outcome rather than a novel contribution.

Response: Please note that this study, to the best of our knowledge, is the first study that suggests a hybrid model (SSM + Attention) for graphs, whose effectiveness is non-trivial due to the small scale of graph learning models. We also want to kindly bring to your consideration that neither the mixture of experts (here mixture of tokens) nor hybrid architecture are an ensemble approach. In fact, in the ensemble method, we have different models and all nodes go through all models and the average (or voting) of all models will be considered as the final output. In our hybrid model, we use SSM and attention sequentially, rather than as two separate models. In the mixture of tokens (as a variant of the mixture of experts), also, nodes do not go through all models (as in ensemble method) and instead, there is a router that decides which type of tokenization is the best for each node. Therefore, the effectiveness of both of these approaches for graph sequence models are non-trivial, not fully expected, and novel.

W3: the empirical performance of baselines (such as graph mamba and GRED) are missing

Response: We kindly bring to your attention that the empirical performance of graph mamba were presented in Tables 3, 7, and 8 in the initial submission. Please note that in the original paper, their model is called GMN (graph mamba networks) and so we have used GMN in the tables. Also, the results of GRED are already reported in and outperformed by graph mamba paper, which we have already compared with. However, following your suggestion, we have included the results of GRED and a new efficient graph transformer (GRIT) in the revised version.

W4: MoT requires concatenation of sequence encodings generated by two different tokenizers …

Response: Please note that concatenation of these two vectors only results in $d_{in}^2$ additional parameters, which is negligible in practice as $d_{in} <= 100$ ( $d_{in}$ is the dimension of the input of the sequence model). To further address your concern, we have provided a theoretical discussion in Appendix K. We further have experimented the efficiency of MoT in Table 10 (page 34).

W5: The efficiency and complexity of the proposed GSM++ is missing.

Response: Thank you for mentioning this. Following your suggestion, we have provided the theoretical time complexity of GSM++ in Appendix K of the revised version. It’s worth mentioning that because of the linear time training of SSMs, the time complexity of GSM++ is linear with respect to the graph size. We further provide the empirical results of efficiency comparison on two large-scale datasets in Appendix K.

2024-11-22

Q1: The term "softmax non-causal Transformers" requires a clearer definition …

Response: Thank you for bringing this to our attention. The non-causal Transformers (or simply Transformers) refer to the traditional Transformer architecture that employs a non-causal attention while causal Transformers use attention with causal masking. Non-causality, in this context, indicates the use of the full attention matrix without masking, making the model permutation equivariant. Note that, the term softmax non-causal Transformers in the initial submission is updated to non-causal Transformers in the current version to enhance clarity. Following your suggestion, and to avoid any misunderstanding about the concepts, we have modified this term, and also provided this clarification in section 3.2 and in Appendix A (under Graph Transformer definition).

Q2 The discrete router mechanism in the MoT framework requires further elaboration

Response: We kindly bring to your consideration that for the training process of MoT and its router mechanism, we adopt the same mechanism as the widely used Mixture of Expert (MoE) approach. Specifically, the router is a trainable linear module that learns end-to-end alongside the other components of the architecture. The novelty of our work lies in adapting this mechanism for tokenization, enabling dynamic selection of tokens during training and inference. To address the reviewer’s suggestion, we have provided a detailed explanation of this process in Appendix A4.

Q3: The HAC tree implementation raises several computational concerns …

Response: Please note that the HAC tree algorithm is a well-established clustering method known for its high parallelizability and efficiency, which is capable of scaling to graphs with billions of nodes and trillions of edges in less than an hour! Also, this clustering algorithm, similar to other positional encodings, is a one-time computation and can be done as the preprocessing step, ensuring that it does not impact runtime efficiency during model training. Following your suggestion and to fully address your concern, we compare the construction time of HAC with common positional encodings in Appendix I4 (Figure 4). Based on these results HAC algorithm does not result in computational burden.

评论- General Response

2024-11-22

We sincerely thank all the reviewers for their time and constructive reviews, which have helped us to improve the paper.

Revisions

Here, we have summarized the changes in our revised version:

Presentation: We really appreciate that Reviewers brCn and 7fVS have found the presentation of our work well-written and easy to follow and interesting to read. We, however, following the reviews of Reviewers WpZs and 42GL, have made some revisions to further improve the presentation of the paper as follows:
- We consistently use the term of causal or non-causal Transformers, when differentiation is required, to avoid the confusion.
- We added a paragraph at the end of the Introduction section to further enhance the clarity of our goals and roadmap in the paper.
- We revised the takeaway paragraphs for all subsections, which can help the reader to better understand the main goals of each subsection. We further made sure that all the required steps for the conclusion are explained.
- We have provided the used metrics for each table.
Additional Details about Mixture of Tokens: We have provided additional details about the router we use in the Mixture of Tokens in Appendix A4.
Experimental Setups: We have added all the used configurations for all the datasets in Table 7.
Additional Experiments:
- We provide additional experiments on the effect of different stacking patterns of layers in the hybrid model. The results are presented and discussed in Appendix J.
- We have performed additional experiments on two large-scale datasets to compare the efficiency of GSM++ with state-of-the-art methods.
- We have performed additional experiments on the efficiency of our HAC tokenization and compared it with commonly used positional encodings.
- Using hyperparameter tuning, we have improved the results presented in the paper.
- We have updated the results in Table 2 with a more challenging setup to better support our theoretical results.
- We have added a new baseline (w/o PE) to Table 2 that supports Theorem 1.
- We have updated the results of the ablation study for HAC as we used HAC-based positional encoding, which could not show the effect of fully removing HAC from the model.
- We present the results of more baselines, including GRED and GRIT.
Theoretical Time Complexity: We provided the theoretical time complexity of GSM++ in Appendix K.

Contributions

We further want to kindly bring to your consideration the main contributions of this paper:

Due to the recent interest in adapting sequence models for efficient graph learning, understanding their power beyond the WL test is important. This paper is a first step toward this direction.
This paper has made different types of contributions:
- First, we provide extensive and new theoretical results to better understand the power of sequence models for graph tasks. While the main focus of this paper is on graph learning, some of these results are generally about the power of sequence models, which have a broader impact on different communities. Most of these theorems are through the lens of circuit complexity, which can help understand graph sequence models’ power beyond the WL test.
- Second, motivated by these theoretical results, we have presented a novel method that all its components (except message-passing) are new and novel! Particularly, we present (1) a novel tokenization method based on the HAC tree algorithm, (2) a new hierarchical positional encoding based on HAC, (3) a new hybrid architecture of SSM + Transformers, and (4) a new technique of mixture of tokenization that allows flexible graph tokenization. To the best of our knowledge, none of these contributions are mentioned or used by other studies. We want to kindly bring to your attention that coming up with an architecture in which most of its components are new is challenging and usually has not been done in the literature (For example, GraphGPS (NeurIPS, 2022) presents new positional encodings, NAGPhormer (ICLR, 2023) presents new tokenization, Graph-Mamba (KDD, 2024) presents the use of Mamba, GRIT (ICML, 2023) presents a new positional encoding, etc.)
- Third, our extensive experimental evaluation can be useful for the community as a benchmark of different existing sequence models on graph tasks. In particular, Figure 3 is the result of 378 experimental evaluations on 9 different sequence models and 6 different graph tokenizations.

Once again, we thank all the reviewers for their time and kind consideration. We hope that our responses have fully addressed their concerns. We would be happy to provide further information or address any remaining concerns.

2024-12-01

Dear Reviewers, ACs,

We sincerely appreciate your time in helping us improve the paper.

We want to kindly remind you that the discussion period is coming to an end. We believe this discussion is particularly important for our submission as there were some misunderstandings in the initial reviews and we have tried to clarify them in our rebuttal. We also, kindly bring to your attention that in our responses we fully addressed the reviewers' suggestions by providing new experiments to further support our claims, adding a discussion on time complexity, and improving the presentation.

We hope you kindly consider a renewed assessment of the paper considering the addressed misunderstanding and the recent manuscript.

Please let us know if you have any additional questions or doubts; we would be happy to address them.

AC 元评审

2024-12-22

Majority of the reviewers are concerned about this paper in major ways and the paper received weak support overall. Thus a reject is recommended.

审稿人讨论附加意见

The authors provided some rebuttals, but the reviewers are not willing to engage intensively during discussions. However, I feel the original concerns are extensive and thus may not be addressed during rebuttal period given the timing constraint. Thus the paper will benefit from a major revision beyond the rebuttal period.

最终决定Reject

2025-01-22

Reject