6.0

/10

Rejected4 位审稿人

最低5最高8标准差1.2

3.0

置信度

正确性2.8

贡献度2.3

表达2.0

ICLR 2025

GRAMA: Adaptive Graph Autoregressive Moving Average Models

Moshe Eliasof,Alessio Gravina,Andrea Ceni,Claudio Gallicchio,Davide Bacciu,Carola-Bibiane Schönlieb

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

TL;DR

We introduce a selective graph neural ARMA-based method, drawing links to Graph SSMs, to address oversquashing in graphs.

摘要

关键词

Graph Neural NetworksAuto-regressive Moving AverageState Space Models

评审与讨论

审稿意见

评分: 6置信度: 32024-10-28

The paper introduces an approach to enhance GNNs by modeling long-range interactions through a learnable ARMA framework. The proposed GRAMA transforms static graph data into sequential data, leveraging the strengths of ARMA while maintaining permutation equivariance. GRAMA incorporates a selective attention mechanism for dynamic learning of ARMA coefficients, enabling efficient and flexible long-range information propagation.

优点

GRAMA can see further in the graph data, capturing long-range interactions that traditional methods might miss.
GRAMA works well with various GNN backbones, enhancing their capabilities without losing their original strengths.
The paper is well-written and easy to understand.

缺点

How does the integration of the ARMA framework with GNNs affect the model's ability to capture long-range dependencies compared to traditional GNNs?
What insights can the authors provide on the design choices behind the selective attention mechanism for learning ARMA coefficients?
How do the theoretical connections to SSMs influence the model's architectural decisions and performance?
How do the results of GRAMA compare to other state-of-the-art methods in terms of accuracy and computational efficiency?

问题

See above.

评论- Rebuttal Part 1

2024-11-21

We genuinely thank the Reviewer for their thoughtful comments and for highlighting key strengths of our work, such as GRAMA’s ability to effectively capture long-range interactions, its seamless integration with diverse GNN backbones, and the clarity of the paper’s presentation. We are also grateful for the insightful questions and suggestions, which have allowed us to better articulate our contributions and refine our work. Below, we provide detailed responses to each of your comments, and we hope these clarifications and improvements address your concerns and lead to a positive reevaluation.

Regarding Weakness 1 (GRAMA vs. Traditional GNNs): Thank you for the important comment. A similar question has been raised by Reviewer s9VW. We would like to split our response to this point into two parts:

The transformation from a graph to a sequence of graphs: as discussed in Section 3, to initialize a sequential model like ARMA or SSM, it is required to have an initial sequence. The question of “how to transform a graph into a sequence” is a challenging open problem, and previous studies like [R1, R2] use heuristics to “sort” the graph into a sequence, which loses permutation-equivariance, while other approaches like [R3] retain permutation-equivariance by considering pairwise interactions, but, this approach also may not fully capture the long-range abilities offered by models like ARMA and SSM. Therefore, we proposed a novel approach that transforms a graph into a sequence of graphs.
The difference from standard GNNs, and how it allows to capture long-range dependencies: stems from the fact that here we utilize a neural, selective sequential mechanism by learning ARMA coefficients that are aware of the graph and downstream task, and importantly, all the hidden states of the node features. This allows GRAMA to operate in two separate domains: the spatial, graph domain (i.e., the connectivity between nodes in the graph) obtained by the use of a GNN backbone, and the learned temporal, sequence domain, which is obtained by our neural selective ARMA mechanism. Therefore, this process is not equivalent to standard MPNNs that update node features based on current node features, but rather the state update is selective and based on all states of the sequence of the graphs. Combined with our theoretical derivations in Section 4, which lay the foundation for the design of our GRAMA, this allows capturing long-range dependencies. This is a major difference, and we added this discussion to the revised paper. Thank you for the insightful question.

Regarding Weakness 2 (Insights on the Design of Selective Mechanism in GRAMA): Thank you for the important question about a core aspect of our work. In Section 3.2, we provide a dedicated subsection for the learning of selective coefficients and discuss its difference from using naive coefficients to be learned. Inspired by your question, we now discuss two key details:

(1) Results that show the performance difference between non-selective (denoted by GRAMA-Non-Selective) and our selective GRAMA (denoted by GRAMA): The results in the Table below, show that selectivity is important for downstream performance, similar to findings in other non-graph SSM studies like [R4,R5]. These results were included in our original submission, now in Tables 7 and 9, where a naive model means that the ARMA coefficients are learned but not selective, as discussed in Section 3.2. For the completeness of the rebuttal, we provide Table 9 here for your reference:

Model	Roman-empire (Acc ↑)	Peptides-func (AP ↑)
GCN	73.69 $_{\pm0.74}$	59.30 $_{\pm0.23}$
GRAMA_GCN (Non-selective ARMA coefficients)	85.13 $_{\pm0.58}$	68.98 $_{\pm0.52}$
GRAMA_GCN	88.61 $_{\pm0.43}$	70.93 $_{\pm0.78}$

(2) Discussion on the specific implementation of selective coefficient learning: In this work, we opted for leveraging attention due to its tradeoff between performance and flexibility in parameterizing learned coefficients. This enabled our key highlight advancement, that is the integration of SSM/ARMA mechanisms with GNNs while preserving permutation equivariance and allowing long-range processing of sequences of the graph. In this sense, our choice of attention allowed to build such mechanisms fully utilizing the strength of the sequential models like SSM and ARMA, without losing the desired graph learning properties. Additional parametrization details were considered in our current work, such as the normalization of the coefficients, guided by our theoretical analysis of GRAMA,, as discussed in Section 3.2. In future works we plan to explore further ideas along the lines of the developments that took place between S4 [R4] and Mamba [R5]. In the revised paper, we incorporated this discussion to provide readers with a better rationale for the design of our GRAMA. Thank you.

评论- Rebuttal Part 2

2024-11-21

Regarding Weakness 3 (Connection between Theoretical properties and Model Design): Thank you for the question. In our response to Weakness 2 we also touched on that aspect. Namely, the learning mechanism of the ARMA coefficient includes a normalization term that is influenced by the theoretical understanding, to maintain stability and allow long-range propagation. This was discussed in Section 3.2, but we now better highlighted the connection between this subsection, and the theoretical properties shown in Section 4. Thank you.

Regarding Weakness 4 (GRAMA vs State-of-the-art in terms of performance and computational efficiency): We appreciate your question. In our paper, we included extensive comparisons between GRAMA and recent state-of-the-art methods, both in Section 5 and the Appendix. For your convenience, we now provide a Table that summarizes the results on all datasets from the main paper, and compared it with the best performing method. As can be seen, in most cases, our GRAMA offers better performance, and similar in others. We also added the table to our revised paper for added clarity. Thank you.

Model / Dataset	Diameter ( $log_{10}(MSE)$ ↓)	SSSP ( $log_{10}(MSE)$ ↓)	Eccentricity ( $log_{10}(MSE)$ ↓)	Peptides-func (AP ↑)	Peptides-struct (MAE ↓)	MalNet-Tiny (Acc ↑)	Roman-empire (Acc ↑)	Amazon-ratings (Acc ↑)	Minesweeper (AUC ↑)	Tolokers (AUC ↑)	Questions (AUC ↑)
Best baseline out of all methods in Tables	-0.5981 ± 0.1145	-3.599 ± 0.1949	-0.0739 ± 0.2190	71.50 ± 0.44	0.2478 ± 0.0016	94.22 ± 0.24	91.37 ± 0.35	54.17 ± 0.37	97.31 ± 0.41	84.52 ± 0.21	80.02 ± 0.86
GRAMA (Best performing model out of 3 variants)	-0.8663 ± 0.0514	-4.1289 ± 0.0988	-1.3012 ± 0.1258	70.93 ± 0.78	0.2436 ± 0.0022	94.37 ± 0.36	91.82 ± 0.39	53.71 ± 0.57	98.33 ± 0.55	86.23 ± 1.10	80.47 ± 1.09

Secondly, we understand that the Reviewer would like to also see a comparison in terms of time and downstream performance with recent state-of-the-art methods, and to address your question, we now added a Table, below, that compares the required runtime (training and inference) of our GRAMA and other models, as well as the obtained downstream performance. As can be seen, our GRAMA requires similar time to other methods like GPS+Mamba and GMN, which are also selective models, while requiring significantly less time than GPS. For reference, we also provide the training runtimes and accuracy of GCN as a baseline. We believe that this comparison further highlights the applicability of GRAMA, and the results were added to the revised paper in the appendix. Thank you.

Model	Training runtime (ms)	Accuracy (%)
GatedGCN	18.38	73.69 ± 0.74
GPS	1139.05	82.00 ± 0.61
GPS + Mamba [R2]	320.39	83.10 ± 0.28
GMN [R2]	387.04	87.69 ± 0.50
GRAMA (Non-selective, Ours)	249.68	85.13 ± 0.36
GRAMA (Ours)	362.41	88.61 ± 0.43

References:

[R1] Wang C., et al. (2024). Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789.

[R2] Behrouz A., and Hashemi A. (2024). Graph mamba: Towards learning on graphs with state space models. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

[R3] Huang Y., Miao S., Li P. (2024). What Can We Learn from State Space Models for Machine Learning on Graphs?. arXiv preprint arXiv:2406.05815.

[R4] Gu A., Goel K., Ré C. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. International Conference on Learning Representations (ICLR).

[R5] Dao T., Gu A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. International Conference on Machine Learning (ICML).

2024-11-27

Thank you for the author's response, which has partially alleviated my concerns, and I keep my score.

2024-11-27

Dear Reviewer WBsY,

Thank you again for your positive feedback on our work, and the response to our rebuttal.

We would like to highlight that we made significant efforts to address all your comments and questions, as similarly done for the other reviewers, including additional comparisons and experiments, and concrete revisions to the paper, that we believe further highlight the relevance, significance, and contribution of our work to the graph learning community.
Therefore, we are keen to clarify any aspects of our study that concern the Reviewer during the remainder of the rebuttal period and we welcome your continued feedback.

We appreciate your engagement and consideration, and the feedback provided so far, which helped us to improve the quality of our paper.

Best regards,

The Authors.

审稿意见

评分: 5置信度: 32024-10-29

The article presents GRAMA a novel framework designed to enhance Graph Neural Networks (GNNs) by addressing their limitations in handling long-range dependencies. Traditionally, GNNs struggle with issues like oversquashing, where distant node information is compressed, making it challenging to capture extensive interactions across the graph.

GRAMA is built on the Autoregressive Moving Average (ARMA) model framework, allowing GNNs to convert static graphs into sequences of graphs, effectively enabling them to process autoregressively. GRAMA adapts SSM concepts to a graph setting while maintaining permutation invariance. GRAMA includes a selective attention mechanism that dynamically adjusts ARMA coefficients across nodes in the graph, which efficiently propagates long-range information across graph structures.

GRAMA is built on blocks combining AR and MA framework. In these blocks, the graph representation is transform in a graph sequence. A GNN backbone is incorporated into each GRAMA block, allowing further information exchange between nodes based on graph structure. The model stacks multiple GRAMA blocks to form a deep GRAMA network. GRAMA includes an attention mechanism, which adjusts the AR and MA coefficients.

The paper also present some theoretical basis about the model interpretation (equivalence between ARMA and SSM) and and its stability conditions. The model is evaluated on 4 datasets, including synthetic and real-world graphs, where it consistently outperforms its backbone models.

优点

The paper presents an innovative approach for sequential learning on graphs, specifically adapting the ARMA/SSM framework to graph data while preserving permutation equivariance.

Experimental results demonstrate that GRAMA achieves strong performance, consistently outperforming its backbone model and competing effectively with state-of-the-art methods.

NB: These are shortly formulated but important strengths.

缺点

The paper has three main weaknesses: (1) model presentation readability, (2) lack of references and attributions, and (3) limited discussion on potential model limitations.

Model Presentation Readability The model presentation is challenging to follow, and improvements in clarity are necessary for readers to grasp the structure and function of GRAMA fully.

For instance, in the caption of Figure 1, it is stated that the blocks are linear systems; however, they contain GNN layers, which are often nonlinear. Clarification is needed on what aspects are considered linear.
There is a lack of clarity on the non-linearity between blocks mentioned in caption of Figure 1, as they do not appear in the model presentation.
In Equations (4), (5), and (6), the notation involving $L$ is ambiguous; it is unclear when $L$ represents sequence length and when it is used as a superscript for varying position (i.e., $l$ in {L, ... , S⋅L}).
The distinction between GRAMA blocks and all possible moving-window remains unclear (i.e., between $[\mathbf{f}^0, \mathbf{f}^{L-1}]$ and $[\mathbf{f}^a, \mathbf{f}^{L-1+a}]$ , with $a<L$ ) a more ordered presentation structure could help delineate their unique aspects.

Lack of References and Attributions

The theoretical properties discussed in Section 4 are primarily inherited from the ARMA/SSM framework rather than novel contributions of the paper. It is necessary to explicitly acknowledge that these inherited properties are not original contributions of the paper and to ensure all relevant references are included in Section 4.

Lack of Discussion on Limitations The paper does not adequately discuss potential limitations, which would be important to understand the broader applicability of the model.

Computational Cost: While computational cost is briefly mentioned in Appendix E, it should be addressed directly in the main text as it appears as a clear limitation of GRAMA. Since GRAMA's may increase the parameter count and computational cost, an ablation study comparing models with equivalent computational budgets would be insightful.
The experiments evaluate GRAMA on long-range dependency modeling, tasks for which it has been specifically designed. It would be interesting to assess how the model performs on other types of tasks. If not, the paper should at least explicitly state that the evaluation only focus on long-range task experiments.

I still think the model is interesting and I am willing to improve (significantly) my evaluation if the authors answer my concerns

问题

The questions relates to point (1) of 'weaknesses'

In what blocks are linear systems, and in what they are not?
How are implemented the non-linearity between blocks?
More generally, what are the relations between blocks?
More fundamentally, except to define the initiating and the readout blocks $F^{(0)}$ and $F^{(S)}$ in what blocks are relevant entities?
In Equations (4), (5), and (6), does $L$ represents the sequence length? If not, please could you clarify. If yes, how to compute $\mathbf{f}^{L+1}$

评论- Rebuttal Part 1

2024-11-21

We thank the Reviewer for their detailed feedback and for recognizing key aspects of our work, including the innovative adaptation of ARMA and SSM frameworks to graph data, the introduction of permutation-invariant sequential learning mechanisms, and the strong experimental performance of GRAMA across synthetic and real-world benchmarks. We also appreciate the constructive suggestions regarding the clarity of the model presentation, the need for explicit attributions, and discussions of limitations. These points have significantly helped us refine our paper. Below, we address each of the Reviewer’s comments in detail and provide clarifications and revisions that we hope address your concerns. We also hope that you will consider revising your score.

Regarding Weakness 1 (Presentation): Thank you for the detailed comments that guided us in improving the presentation of our paper. Below, we address each of your comments on that aspect of the paper, and we included them in our revised paper:

Regarding GNN blocks: in our experiments, the GNNs are linear (i.e., no activation function is applied to their output). Instead, the activation function $\sigma$ is applied to the overall updated state and residual sequences, similar to works like Mamba [R1]. We revised our paper to include a new paragraph and Equation (7) that explains this process, now aligned with Figure 1. Thank you.
Regarding Figure 1: our response to point (1), and in particular the addition of Equation (7) now aligns it with Figure 1. Thank you.
Regarding notation of sequence length and index $L$ : we appreciate your comment. Our goal was to show what the $L+1$ -th state (indexed by $L$ ) should look like. This was represented by Equation (6) in the original submission. Following your guidance, we now revised Section 3.1 such that Equation (6) shows the general recurrence, at the $\ell$ -th recurrence, which yields the $\ell + L$ state. We think that the description is now more general and clearer. Thank you.
Regarding the connection between the moving window and GRAMA blocks: Thank you for the comment. We revised Section 3.1 to clarify the relation. The moving window has a size of $R$ , which denotes the number of recurrences to be computed at each GRAMA block. In practice, we take $R=L$ , which, as discussed in the paper, is a standard approach in sequential models. We thank you for the feedback that allowed us to improve our paper.

评论- Rebuttal Part 2

2024-11-21

Regarding Weakness 2 (References and Attributions): We sincerely appreciate this comment. Please allow us to clarify that our full intention is to take inspiration and theoretical understanding and properties from frameworks like SSM and ARMA. Throughout our paper, starting from Section 1, elaborated in Section 2, and then formally discussed in Section 4, we acknowledge and discuss the attributes of ARMA and SSM and why they are desired in graph learning, particularly to address the oversquashing problem. Put concretely, we provide the following references and acknowledgments:

A summary of classical works of SSM (Hamilton, 1994b; Aoki, 2013) and recent works on the topic of SSM (Gu et al., 2021c; Fu et al., 2023, Gu et al., 2023; Liu et al., 2024) as well as recent attempts to include SSM in graph learning (Wang et al., 2024, Behrouz & Hashemi, 2024; Huang et al., 2024), in Section 2.
An overview of ARMA methods, that include references to its origin (Whittle, 1951), its popularization (Box, 1970), and the equivalence between ARMA and SSM that was shown in the literature (Hamilton, 1994a), in Section 2
An overview of ARMA methods for graphs (Isufi et al., 2016; Bianchi et al., 2019).
A discussion of why we are motivated to learn selective ARMA coefficients based on findings in Gu et al. (2023), in Section 3.
References to literature that we base our proofs on, in Section 4 and Appendix B

Nonetheless, we fully agree with the Reviewer that Section 4 and Appendix B can be further revised to highlight the fact that the properties that stem from ARMA and SSM methods are inherited, and this was our full intention, to begin with. In our revised paper, we implemented your important comments into Section 4 and Appendix B. Lastly, it is important for us to point the main novelty of this work, is identifying issues in previous attempts to combine SSMs with graphs, as discussed in Sections 1,2 and 3. These issues are the sacrifice of permutation-equivariance and the use of pairwise interactions instead of longer sequences, which may not fully utilize the properties inherited from ARMA and SSM. This identification motivates us to develop a novel framework, GRAMA, that allows learning selective ARMA coefficients coupled with GNNs, which is grounded by the understanding and properties derived in their literature. We would like to express our gratitude again for the important comment, which helped us to give credit to the pioneers of the fields of SSM and ARMA and to distinguish the novelty of our work from others. Thank you.

评论- Rebuttal Part 3

2024-11-21

Regarding Weakness 3 (Limitations): Thank you for the thorough reading of our paper and the actionable suggestions. We split our response according to the two parts of your comment:

(1) Regarding computational cost: indeed, as noted by the Reviewer, our submission includes thorough time and space analyses, as well as runtime measurements in Appendix E, due to space limitations. We agree with the Reviewer that this aspect of our work can be better presented in the main text and we revised the paper accordingly, such that it better refers to Appendix E and provides a discussion also in the main text. Specifically, it is important for us to clarify that in our main paper, the number of parameters was similar between models to ensure fairness, for example in the LRGB benchmarks, we followed the standard of maintaining a budget of 500k parameters. We highlighted this in the revised paper. Also, following your suggestion about benchmarking our GRAMA with other models with equivalent computation time budget, we would like to point out that our results in Table 7 in the paper include that; to be precise, we can compare the performance of GRAMA with GCN and GPS when changing the number of layers, thereby being able to compare the results under similar runtimes. For your convenience, we provide this Table below, as well, and it can be seen from it that our GRAMA offers a substantial reduction in computational cost compared with methods of similar accuracy (GPS) also in this scenario. It can also be seen that our non-selective GRAMA allows maintaining a strong predictive performance, if one seeks to reduce the computational load. These results further highlight the applicability of GRAMA, whether in a limited computation time budget scenario or not. We added this discussion to Appendix E. Thank you.

Metrics	Method	Depth 4	Depth 8	Depth 16	Depth 32
Training (ms)	GCN	18.38	33.09	61.86	120.93
Inference (ms)		9.30	14.64	27.95	53.55
Accuracy (%)		73.60	61.52	56.86	52.42
Training (ms)	GPS	1139.05	2286.96	4545.46	OOM
Inference (ms)		119.10	208.26	427.89	OOM
Accuracy (%)		81.97	81.53	81.88	OOM
Training (ms)	GRAMA_GCN (Non-selective Coefficients)	41.16	98.83	249.68	747.26
Inference (ms)		13.03	26.83	63.61	164.87
Accuracy (%)		83.23	84.72	85.13	85.04
Training (ms)	GRAMA_GCN	75.75	141.79	463.76	1378.91
Inference (ms)		40.33	70.91	240.78	702.17
Accuracy (%)		86.33	88.14	88.24	88.22

评论- Rebuttal Part 4

2024-11-21

(2) Non long-range datasets: Thank you for the query. We agree with the Reviewer that the experiments should be aligned with the goal of the paper. This was our effort in the experiments here. As discussed throughout the paper, starting from Section 1, we focus on the problem of oversquashing in graphs and improving the limited ability of MPNNs in performing long-range propagation. This is also discussed in the beginning of Section 5 (Experiments). Nonetheless, inspired by your comment, we think that although our paper is focused on addressing oversquashing and long-range propagation, including more datasets that are more general and may not necessarily be related to the specific tasks we try to solve in this paper can further show the applicability of GRAMA. To this end, we added results on ZINC-12k, OGBG-MOLHIV, Cora, CiteSeer, and PubMed. The results are reported in the revised paper in Appendix F, and are also shown in the Table here, below. We can see that compared to the backbone GNNs, our GRAMA offers significantly better performance. Thank you for the suggestion that helped us to improve our paper.

Method / Dataset	ZINC-12k (MAE ↓)	OGBG-MOLHIV (ROC-AUC ↑)	Cora (Acc. ↑)	CiteSeer (Acc. ↑)	PubMed (Acc. ↑)
GCN	0.278 ± 0.003	76.06 ± 0.97	85.77 ± 1.27	73.68 ± 1.36	88.13 ± 0.50
GatedGCN	0.254 ± 0.005	76.72 ± 0.88	86.21 ± 1.28	74.10 ± 1.22	88.09 ± 0.44
GPS	0.125 ± 0.009	77.39 ± 1.14	85.42 ± 1.80	73.99 ± 1.57	88.23 ± 0.61
GPS + RWSE	0.070 ± 0.004	78.80 ± 1.01	86.67 ± 1.53	74.52 ± 1.49	88.94 ± 0.49
GRAMA-GCN (Ours)	0.142 ± 0.010	77.47 ± 1.05	88.02 ± 1.01	77.09 ± 1.53	90.20 ± 0.47
GRAMA-GatedGCN (Ours)	0.140 ± 0.008	77.60 ± 0.98	88.13 ± 0.99	77.63 ± 1.38	90.07 ± 0.45
GRAMA-GPS (Ours)	0.100 ± 0.006	78.19 ± 1.10	87.95 ± 1.72	77.13 ± 1.51	89.76 ± 0.64
GRAMA-GPS + RWSE (Ours)	0.061 ± 0.003	79.21 ± 0.94	88.37 ± 1.64	77.68 ± 1.55	90.31 ± 0.58

评论- Rebuttal Part 5

2024-11-21

Regarding Questions: Thank you for your willingness to reconsider the score and for asking these important questions, which we address below.

Regarding Question (1) The recurrence shown in Equation (6) is linear, in the sense that the dependency on previous states and residuals is linear, and the GNN is linear as we do not apply non-linearity to its output. After R recurrences, shown in the added Equation (7), we obtain updated states and residual sequences, to which we apply non-linearity This means that a GRAMA block is linear, with non-linearity applied between blocks. This design choice takes inspiration from architectures like Mamba [R1] and also allows us to theoretically understand the behavior of GRAMA, as discussed in Section 4. We clarified this important point in the revised paper. Thank you.

Regarding Question (2): Similar to our response to Question (1), the non-linearity is applied between GRAMA blocks, as can be seen from Equation (7) and Figure 1. As discussed in the revised Section 3, the nonlinearity is applied in an element-wise fashion, such that popular activation functions like ReLU and GeLU can be used. This is also shown in the added Equation (7).

Regarding Question (3): The block weights are independent of each other, as these are different ARMA systems that are learned. As discussed in Section 3, each block has L recurrences (using the same weights), to produce a new sequence of states and residuals of length L, that is similar to the course of action of other mechanisms like classical SSM or ARMA frameworks. We clarified it in the revised text. Thank you.

Regarding Question (4): Thank you for the question. We agree that showing the intermediate steps is beneficial and helps to better understand the method. We therefore revised Section 3.1 and added Equation (8), that shows the flow between GRAMA blocks.

Regarding Question (5): The Reviewer is correct that Equations (4)-(6) show L as the sequence length. Because we start counting the entries from 0, it means that the latest step of the input sequence will be at index L-1. The definition of the state at step $L+1$ was given in Equation (6) in our original submission. Following your questions, we revised it to show the general recurrence of GRAMA at the $\ell$ recurrence, which yields the state at step $L + \ell$ , which also includes the case of $L+1$ . We would like to express our gratitude for the important questions and guidance, that overall, in our opinion, helped us to improve the quality of the paper. Thank you.

References:

[R1] Dao T., Gu A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. International Conference on Machine Learning (ICML).

2024-11-25

Dear Reviewer nRqm,

We thank you again for the detailed feedback provided in your review, and for the willingness to revise your score.

In our rebuttal and revised paper, we made significant efforts that we believe address your comments. In particular, we improved the presentation of the method, clarified the attributions for previous works , addressed the limitations of our model, and added experiments as per your request, which further highlight the significance of our GRAMA.

Overall, we believe that these responses address your comments, which helped us to improve the quality of our paper. We thus would like to express our gratitude for your guidance.

Before the discussion period ends, we would like to know if there are additional clarifications you would like us to make, or if you are satisfied with our responses and consider revising your score.

Best regards,

The Authors.

评论- Response Part 2

2024-11-28

Regarding Results: We would like to kindly note that our GRAMA offers significant and consistent improvements over the baselines backbone GNNs (GCN, GatedGCN, and GPS). This is reflected in all the Tables and experiments in Section 5. To be concrete, we summarize the results obtained with baseline backbones and with our GRAMA in the Table below.

Model	Diameter ( $log_{10}(MSE)$ ↓)	SSSP ( $log_{10}(MSE)$ ↓)	Eccentricity ( $log_{10}(MSE)$ ↓)	Peptides-func (AP ↑)	Peptides-struct (MAE↓)	MalNet-Tiny (Acc ↑)	Roman-empire (Acc ↑)	Amazon-ratings (Acc ↑)	Minesweeper (AUC ↑)	Tolokers (AUC ↑)	Questions (AUC ↑)
GCN	0.7424±0.0466	0.9499±9.18·10−5	0.8468±0.0028	59.30±0.23	0.3496±0.0013	81.00	73.69±0.74	48.70±0.63	89.75±0.52	83.64±0.67	76.09±1.27
GatedGCN	0.1348±0.0397	-3.261±0.0514	0.6995±0.0302	58.64±0.77	0.3420±0.0013	92.23±0.65	74.46±0.54	43.00±0.32	87.54±1.22	77.31±1.14	76.61±1.13
GPS	-0.5121±0.0426	-3.599±0.1949	0.6077±0.0282	65.35±0.41	0.2500±0.0005	92.64±0.78	82.00±0.61	53.10±0.42	90.63±0.67	83.71±0.48	71.73±1.47
GRAMA-GCN	0.2577±0.0368	0.0095±0.0877	0.6193±0.0441	70.93±0.78	0.2439±0.0017	93.43±0.29	88.61±0.43	53.48±0.62	95.27±0.71	86.23±1.10	79.23±1.16
GRAMA-GatedGCN	-0.5485±0.1489	-4.1289±0.0988	0.5523±0.0511	70.49±0.51	0.2459±0.0020	93.66±0.40	91.82±0.39	53.71±0.57	98.19±0.58	85.42±0.95	80.47±1.09
GRAMA-GPS	-0.8663±0.0514	-3.9349±0.0699	-1.3012±0.1258	69.83±0.83	0.2436±0.0022	94.37±0.36	91.73±0.59	53.36±0.38	98.33±0.55	85.71±0.98	79.11±1.19

We therefore believe that the proposed GRAMA offers a significant improvement which stems from the architecture of GRAMA. Also, our results in Appendix E, which were also discussed in our responses to your review above, compare different runtime budgets in terms of number of model depth, and show that other models with more runtime are not better than our GRAMA. Thus, we think that these results reflect the significance of the GRAMA architecture.

Moreover, we note that compared with other state-of-the-art methods, like DRew, Co-GNN and GMN, our GRAMA also offers similar or better performance. Overall, on 12 out of 14 experiments, our GRAMA offers better performance than current state-of-the-art performance. Additionally, considering the standard deviation of the methods, we see that GRAMA is statistically better, with higher mean downstream performance and minimal or no overlap in terms of standard deviations, in 7 out of 11 datasets below. We believe that this highlights the significance of GRAMA. For your convenience, we provide a summary of the results in the Table below, which was also included in our revised paper (please see Appendix G):

Model / Dataset	Diameter ( $log_{10}(MSE)$ ↓)	SSSP ( $log_{10}(MSE)$ ↓)	Eccentricity ( $log_{10}(MSE)$ ↓)	Peptides-func (AP ↑)	Peptides-struct (MAE ↓)	MalNet-Tiny (Acc ↑)	Roman-empire (Acc ↑)	Amazon-ratings (Acc ↑)	Minesweeper (AUC ↑)	Tolokers (AUC ↑)	Questions (AUC ↑)
Best baseline out of all methods in Tables	-0.5981 ± 0.1145	-3.599 ± 0.1949	-0.0739 ± 0.2190	71.50 ± 0.44	0.2478 ± 0.0016	94.22 ± 0.24	91.37 ± 0.35	54.17 ± 0.37	97.31 ± 0.41	84.52 ± 0.21	80.02 ± 0.86
GRAMA (Best performing model out of 3 variants)	-0.8663 ± 0.0514	-4.1289 ± 0.0988	-1.3012 ± 0.1258	70.93 ± 0.78	0.2436 ± 0.0022	94.37 ± 0.36	91.82 ± 0.39	53.71 ± 0.57	98.33 ± 0.55	86.23 ± 1.10	80.47 ± 1.09

We would like to thank you again for your valuable feedback that helped us to improve the paper and for your response to our rebuttal.
While the discussion period is still going, we appreciate your feedback and continued engagement.
We hope that you find our responses satisfactory, and that you will reconsider your score.

With kind regards,

The Authors.

2024-11-28

Thank you for your answer.

You revised version addresses partially my comments.

Even improved, I still think the presentation is hard to follow. And the motivations for the ARMA model unclear.

Regarding the experiments, since most leading models results lies in the std of the other, it is not clear if the improvement , if any, comes from the architecture, or the additional computation.

However, I noticed that this revised version is better than the previous one, one I revised my score accordingly

评论- Response Part 1

2024-11-28

Dear Reviewer nRqm,

Thank you for your responses and feedback, and for finding our revised paper to be improved and better, as well as for raising your score.

Below, we respond to your comments.

Regarding Presentation: Thank you for the response. We have made significant efforts to address all your comments including the presentation. This includes a substantial revision to the method section according to your comments, questions, and guidance. We think that these modifications improved the quality of our paper.
We are happy to accommodate any additional changes required by the reviewer for the final version of our paper, and kindly ask for your concrete feedback during this author-reviewer discussion period.

Regarding Motivation: In our revised paper, following discussions with you and with other reviewers, we have highlighted (e.g., please see our responses to Reviewer fRxB regarding the logic of the paper and its motivation), and significantly expanded the discussion on the motivation for using GRAMA compared with other relevant architectures. In particular, we included several revisions to Sections 1-4, as well as the following discussion following the important comments from Reviewer s9vW:

GRAMA offers several advantages compared with other GSSMs [D1,D2], namely, its improved downstream performance, its permutation-equivariance, and the benefit from sequences beyond pairwise interactions. We believe that these properties are important, because they were not provided by any one method previous to ours. Furthermore, inspired by your question, we elaborate on why these properties ultimately yield a better method: (i) Permutation-equivariance was shown to be important for long-range dependency modeling in graphs [D3, D4]. (ii) Turning the graph into a sequence as in [D1] not only compromises permutation-equivariance, but it also compromises the utilization of the input graph because it may not capture all the relations encoded by the graph. For example, sorting nodes by some policy (e.g., their node degree) is not sufficient to uniquely describe an input graph, because there could be many different graphs with the same node degrees. Moreover, this procedure can potentially make relevant pairs of nodes further distant along the sequence, depending on the policy. On the other hand, our GRAMA utilizes the input graph by using a GNN backbone, as shown in Figure 1. (iii) Using pairwise interactions relies on using powers of the adjacency matrix, which, as shown in [D5], exponentially decreases the bound of feature sensitivity between nodes after K hops. On the other hand, as shown in our theoretical results in Section 4, our GRAMA does not suffer from the same issue. Particularly, it does not rely on powers of the adjacency matrix solely for transferring information between nodes, but also on the selective and learnable ARMA mechanism.

These are important differences, and we highlighted them in our revised paper. We hope that the Reviewer find the added clarifications satisfactory, and we are keen to learn if there are additional comments from the Reviewer regarding this point.

审稿意见

评分: 8置信度: 22024-11-03

This paper introduces GRAMA (Graph Adaptive Autoregressive Moving Average Models), a novel framework that enhances Graph Neural Networks (GNNs) by integrating adaptive neural Autoregressive Moving Average (ARMA) models. GRAMA addresses the limitations of existing Graph State Space Models (SSMs), which either compromise on permutation equivariance or focus only on pairwise interactions rather than sequences. The key innovation is the transformation of static graph data into sequential graph data, allowing GRAMA to leverage the strengths of the ARMA framework while preserving permutation equivariance. GRAMA also incorporates a selective attention mechanism for dynamic learning of ARMA coefficients, leading to efficient and flexible long-range information propagation.

优点

This paper introduces a new way of combining ARMA with GNNs, addressing limitations in handling long-range interactions and permutation equivariance in graph data.
This paper provides a theoretical foundation that establishes connections between GRAMA and SSMs.
The authors have tested the proposed GRAMA on a wide range of tasks and benchmarks, and the overall paper is experimentally solid.

缺点

While I acknowledge that the authors conducted extensive experiments, several highly relevant baselines or benchmarks are notably absent. In Section 5.2, the graph property prediction datasets employed are not mainstream. Why not utilize more widely recognized real-world datasets such as ZINC or OGBG-MOLHIV/MOLPCBA? Additionally, the chosen heterophily GNNs in Table 4 do not represent the current state-of-the-art models; please refer to [1,2].
The baselines selected by the authors across various tables exhibit significant variation. For instance, in Tables 2 and 3, despite both being graph-level tasks, the baseline choices for the graph transformer differ. Furthermore, Table 1 includes only a single GPS, contrasting sharply with the diverse graph transformer baselines in other tables. Figure 2 does not specify which GNN is combined with GRAMA, and Table 1 appears to lack results for GRAMA+GatedGCN. Such inconsistencies may confuse readers; I believe the authors should standardize baseline selections or provide detailed explanations in the main text.
(I am a non-expert of SSMs, and the following comment may be somewhat amateurish) My understanding is that the sequence of $L$ node features $\mathbf{F}^{(0)}$ is generated by MLPs, which somewhat artificially constructs a dynamic graph sequence where node features evolve over time. However, I fail to see how this leads to a performance gain; generating a length- $L$ node feature sequence seems equivalent to performing $L$ transformations on node embeddings in MPNN. Even with GRAMA block, it appears to be somewhat equivalent to MPNN's ego-node update. I feel (though I may be incorrect; please correct me if so) that this approach is better suited for real-world dynamic graph sequence modeling.

Minor:

I would appreciate it if the authors could clarify in Table 7 which attention implementation is combined with GPS. If vanilla attention is used, this comparison is somewhat unfair.

[1] The Heterophilic Snowflake Hypothesis: Training and Empowering GNNs for Heterophilic Graphs

[2] The Heterophilic Graph Learning Handbook: Benchmarks, Models, Theoretical Analysis, Applications and Challenges

问题

see weaknesses

评论- Rebuttal Part 1

2024-11-21

We sincerely thank the Reviewer for the thoughtful feedback and for acknowledging key aspects of our work, including the novel integration of ARMA with GNNs, the theoretical connections between GRAMA and SSMs, and the robustness of our experimental evaluation across diverse tasks and benchmarks. We also appreciate the actionable and constructive suggestions, that have helped us improve the paper. Below, we provide detailed responses to each of your comments. We hope that these clarifications and revisions are satisfactory, and that you will consider revising your score.

Regarding Weakness 1 (More evaluations): Thank you for the invaluable suggestions and questions which we address one-by-one, below.

Regarding the graph property prediction tasks, we agree that these are specific datasets, and their purpose is to demonstrate whether a method suffers from oversquashing or not. To be precise, the tasks in these datasets require capturing long-term dependencies between nodes, as solving them requires computing shortest paths within the graph. Moreover, as described in [R1], similar to standard algorithmic approaches (e.g., Bellman-Ford, Dijkstra's algorithm), accurate solutions depend on the exchange of multiple messages between nodes, making local information insufficient for this task. In addition, the graphs used in these tasks are sampled from caveman, tree, line, star, caterpillar, and lobster distributions, all of which include bottlenecks by design, which are known to be a cause of oversquashing [R2] . In addition to these, we include results on other real-world datasets that are well-known in the graph learning community like the Long Range Graph Benchmark (LRGB) in Section 5.3, and heterophilic node classification datasets in Section 5.4.
Regarding the comparisons with other recent methods outlined in [1,2], we now refer to [1,2] in our revised paper, and we also added comparisons with methods that were reported in this paper. Thank you.
Regarding evaluations on ZINC and MOLHIV, we welcome your suggestion, and we now added experimental results with GRAMA on the following datasets: ZINC-12k, OGBG-MOLHIV, Cora, CiteSeer, and PubMed. On the ZINC-12k with follow the standard splits and protocols from [R3], on OGBG-MOLHIV, we follow the protocols and splits from [R4], and on Cora, CiteSeer, PubMed, we used the splits and protocol from [R5]. The results are shown in the Table below, and were also added to the paper. In particular, we compare the performance of our GRAMA combined with three different backbones (GCN, GatedGCN, and GPS) with baseline models, showing that our GRAMA significantly improves the downstream performance of the baseline backbone models. Thank you for the suggestion.

Method / Dataset	ZINC-12k (MAE ↓)	OGBG-MOLHIV (AUC ↑)	Cora (Accuracy ↑)	CiteSeer (Accuracy ↑)	PubMed (Accuracy ↑)
GCN	0.278 ± 0.003	76.06 ± 0.97	85.77 ± 1.27	73.68 ± 1.36	88.13 ± 0.50
GatedGCN	0.254 ± 0.005	76.72 ± 0.88	86.21 ± 1.28	74.10 ± 1.22	88.09 ± 0.44
GPS	0.125 ± 0.009	77.39 ± 1.14	85.42 ± 1.80	73.99 ± 1.57	88.23 ± 0.61
GPS + RWSE	0.070 ± 0.004	78.80 ± 1.01	86.67 ± 1.53	74.52 ± 1.49	88.94 ± 0.49
GRAMA-GCN (Ours)	0.142 ± 0.010	77.47 ± 1.05	88.02 ± 1.01	77.09 ± 1.53	90.20 ± 0.47
GRAMA-GatedGCN (Ours)	0.140 ± 0.008	77.60 ± 0.98	88.13 ± 0.99	77.63 ± 1.38	90.07 ± 0.45
GRAMA-GPS (Ours)	0.100 ± 0.006	78.19 ± 1.10	87.95 ± 1.72	77.13 ± 1.51	89.76 ± 0.64
GRAMA-GPS + RWSE (Ours)	0.061 ± 0.003	79.21 ± 0.94	88.37 ± 1.64	77.68 ± 1.55	90.31 ± 0.58

评论- Rebuttal Part 2

2024-11-21

Regarding Weakness 2 (Baselines): We appreciate your comment and actionable feedback. We agree with the Reviewer that it is important to explain the comparison choices. In our experiments, we have attempted to provide a wide comparison between our GRAMA and other methods like GPS. For example, it can be seen that in Table 12 we have 13 comparisons of different variants of GPS. In Figure 2 we used GRAMA coupled with GCN. In Table 1, we used the basic GPS because our GRAMA is also using it, where the motivation is to measure the impact of GRAMA rather than other components that may improve GPS’s performance. In Table 2 we report different graph transformers that use laplacian positional encoding for a direct comparison between them, which only improves the performance of the basic architecture of GPS. In Table 3 we used the results from [R6] that also use lapalcian positional encoding, similar to Table 2. We also added GatedGCN and GRAMA-GatedGCN to Table 1. In our revised paper, we added all the explanations and clarifications discussed here. We think that it helped to better understand our results – thank you.

Regarding Weakness 3 (Performance gain with GRAMA): Thank you for the interesting discussion and suggestion. We would like to split our response to this point into three parts:

The transformation from a graph to a sequence of graphs: as discussed in Section 3, to initialize a sequential model like ARMA or SSM, it is required to have an initial sequence. The question of “how to transform a graph into a sequence” is a challenging open problem, and previous studies like [R6, R7] use heuristics to “sort” the graph into a sequence, which loses permutation-equivariance, while other approaches like [R8] retain permutation-equivariance by considering pairwise interactions, but, this approach also may not fully capture the long-range abilities offered by models like ARMA and SSM. Therefore, we proposed a novel approach that transforms a graph into a sequence of graphs, which is our initialization process.
The difference from standard MPNNs: stems from the fact that here we utilize a neural, selective sequential mechanism by learning ARMA coefficients that are aware of the graph and downstream task. This allows GRAMA to operate in two separate domains: the spatial, graph domain (i.e., the connectivity between nodes in the graph) obtained by the use of a GNN backbone, and the learned temporal, sequence domain, which is obtained by our neural selective ARMA mechanism. Therefore, this process is not equivalent to standard MPNNs that update node features based on current node features, but rather the state update is selective and based on all states of the sequence of graphs. This is a major difference between standard MPNNs and our GRAMA, and we added this discussion to the revised paper. Thank you for the insightful question.
Regarding dynamic graph datasets: We agree with the Reviewer that applying our GRAMA to dynamic graphs is a natural research direction. In this work, as discussed in Section 1 and later elaborated in Section 2 and Section 3, our goal was to first address the fundamental issue in existing graph SSMs (which, in literature, have so far mostly focused on static graphs) – the utilization of long sequences offered by models like ARMA or SSM (i.e., processing information beyond pairwise interactions), and the maintenance of permutation-equivariance, which is often times sacrificed in such methods. Following that, interesting future work will include studies on dynamic graphs. Nonetheless, inspired by your suggestion, we conducted experiments with our GRAMA on three spatio-temporal graph datasets and evaluation protocols from [R9], where the goal is to predict future values given past values. The results, shown in the Table below, which were also added to the revised paper, show that our GRAMA is a promising approach for processing such datasets. We thank you for the fruitful discussion.

Model / Dataset	Chickenpox Hungary	PedalMe London	Wikipedia Math
A3T-GCN [R10]	1.114 ± 0.008	1.469 ± 0.027	0.781 ± 0.011
T-GCN [R11]	1.117 ± 0.011	1.479 ± 0.012	0.764 ± 0.011
GRAMA-GCN (Ours)	0.790 ± 0.031	1.089 ± 0.049	0.608 ± 0.019

评论- Rebuttal Part 3

2024-11-21

Regarding Minor: Thank you for the question. As further discussed in our response to W2, the architecture we compared with is indeed the baseline GPS because our GRAMA also utilizes it in order to directly measure the contribution of our GRAMA. In Table 12 in our original submission (now Table 13), we included 13 variants of GPS, where it is evident that our GRAMA offer better performance. Nonetheless, we agree with the Reviewer that Table 7 can also benefit from the comparison with the best performing variant of GPS on this dataset, and we now added these results. It can be seen that also in this case, our GRAMA obtains better performance. We revised our paper accordingly, to ensure clarity. Thank you.

References:

[R1] Gravina A., Bacciu D., Gallicchio C. (2023). Anti-Symmetric DGN: a stable architecture for Deep Graph Networks. International Conference on Learning Representations (ICLR).

[R2] Topping J., Di Giovanni F., Chamberlain B.P., Dong X., . Bronstein M.M. (2023). Understanding over-squashing and bottlenecks on graphs via curvature. International Conference on Learning Representations (ICLR) [R3] Dwivedi V.P., Joshi C., Lu A.T., Laurent T., Bengio Y., Bression X. (2023). Benchmarking Graph Neural Networks. Journal of Machine Learning Research (JMLR). [R4] Hu et al. (2020).Open graph benchmark: Datasets for machine learning on graphs. In Advances in Neural Information Processing Systems (NeurIPS). [R5] Pei H., et al. (2020) . Geom-GCN: Geometric Graph Convolutional Networks. International Conference on Learning Representations (ICLR).

[R6] Wang C., et al. (2024). Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789. [R7] Behrouz A., and Hashemi A. (2024). Graph mamba: Towards learning on graphs with state space models. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

[R8] Huang Y., Miao S., Li P. (2024). What Can We Learn from State Space Models for Machine Learning on Graphs?. arXiv preprint arXiv:2406.05815.

[R9] Rozemberczki B., et al. (2021). PyTorch Geometric Temporal: Spatiotemporal Signal Processing with Neural Machine Learning Models. Conference on Information and Knowledge Management (CIKM).

[R10] Zhu J., et al. (2020). A3T-GCN: Attention Temporal Graph Convolutional Network for Traffic Forecasting. ISPRS Int. J. Geo-Inf. 10(7), 485.

[R11] Zhao L., et al. (2019). T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction. IEEE Transactions on Intelligent Transportation Systems, 21(9):3848–3858.

2024-11-21

Thank you for the detailed response. As a non-expert in SSM, my background may not be sufficient to justify raising my score to an 8. However, personally, I find this work to be very solid, with extensive experiments addressing a meaningful problem.

2024-11-22

Dear Reviewer s9vW,

We thank you for the positive feedback on our work, stating that it is a solid paper with extensive experiments that addresses a meaningful problem. We also would like to thank you for the swift and prompt response to our rebuttal.

In addition, we would like to highlight that we did significant efforts to address all your comments and questions, as similarly done for the other reviewers. The fruits of these efforts in response to your comments resulted in a better paper that also includes additional experiments on multiple datasets to show the relevance and significance of our paper.

We are more than happy to clarify any aspects of our study that may be unfamiliar, recognizing that the Reviewer may not have a background in SSM, during this rebuttal time.

Conversely, if there are no further questions, we kindly remind the Reviewer that if they feel our work merits a score of 8 but are uncertain due to differences in expertise, they can assign their desired score while adjusting the confidence level accordingly.

We appreciate your engagement and consideration, and the feedback provided so far.

Best regards,

The Authors.

2024-11-25

Thank you for your patience and detailed response. To gain a more thorough understanding of this work, I have revisited the updated manuscript and the discussions between the authors and other reviewers. At this point, my only remaining concern is as follows:

I still feel that the graph sequence initialization in GRAMA appears somewhat unnatural. While I understand, as the authors have noted, that GRAMA offers a unique approach to “transforming a graph into a sequence,” this approach seems, in my view, quite similar to the existing discrete-time dynamic graph neural networks (DyGNN) [1] or spatial-temporal graph neural networks (ST-GNN) [2]. Specifically, these methods typically involve the coupling of a temporal module (which could be RNN, LSTM, GRU, Attention, or ARMA in the context of GRAMA) with a spatial module (such as GNN and Attention, GNN in the GRAMA context) in various configurations (e.g., factorized or coupled, interleaved in the case of GRAMA). These modules then output node embeddings that integrate both spatial and temporal information.

The only distinction, as I see it, is that the graph sequences in these works are naturally occurring, whereas the sequences in GRAMA are generated in a somewhat artificial manner (by MLP). Consequently, I find GRAMA’s positioning somewhat perplexing. Its methodology appears to higly align with DyGNN or ST-GNN, yet it is being applied to address the issue of over-squashing.

To make our communication more efficient, I have distilled my concerns into the following points:

Could you briefly summarize the key advantages of GRAMA’s graph serialization approach compared to prior GSSMs, beyond the experimental evidence? Specifically, how does it better address the issue of long-range dependencies? While I understand the authors’ points about the shortcomings of prior GSSMs—such as their loss of permutation-equivariance or reliance on pairwise interactions (Line 115)—it is unclear how these limitations concretely hinder their ability to model long-range dependencies.
The overall framework of GRAMA appears to align closely with the style of DyGNNs or ST-GNNs. As such, I feel that GRAMA might be better positioned as a new type of DyGNN specifically tailored to address dynamic graph challenges (I sincerely appreciate the additional experiments on dynamic graphs provided in your response). However, this would require a substantial restructuring of the current manuscript's narrative and organization.
Can ARMA in GRAMA be replaced with other temporal modules, such as RNNs or LSTMs? If so, it might suggest that GRAMA is simply swapping the temporal component of current ST-GNNs for the less commonly used ARMA. If not, could you clarify why ARMA is indispensable? Is this due to specific design choices in GRAMA that are particularly advantageous for modeling long-range dependencies?

I deeply appreciate the authors' extensive efforts and thorough experimentation; however, making these questions clear would greatly help me in further evaluating the manuscript and adjusting the score. Thank you!

[1] A Comprehensive Survey of Dynamic Graph Neural Networks: Models, Frameworks, Benchmarks, Experiments and Challenges

[2] Spatio-Temporal Graph Neural Networks for Predictive Learning in Urban Computing: A Survey

评论- Response Part 1

2024-11-25

We thank the Reviewer for the continued discussion and engagement with us. Below, we provide our responses.

Regarding Question 1: Thank you for the question. As acknowledged by the Reviewer, our GRAMA offers several advantages compared with other GSSMs [D1,D2], namely, its improved downstream performance, its permutation-equivariance, and the benefit from sequences beyond pairwise interactions. We believe that these properties are important, because they were not provided by any one method previous to ours. Furthermore, inspired by your question, we elaborate on why these properties ultimately yield a better method: (i) Permutation-equivariance was shown to be important for long-range dependency modeling in graphs [D3, D4]. (ii) Turning the graph into a sequence as in [D1] not only compromises permutation-equivariance, but it also compromises the utilization of the input graph because it may not capture all the relations encoded by the graph. For example, sorting nodes by some policy (e.g., their node degree) is not sufficient to uniquely describe an input graph, because there could be many different graphs with the same node degrees. Moreover, this procedure can potentially make relevant pairs of nodes further distant along the sequence, depending on the policy. On the other hand, our GRAMA utilizes the input graph by using a GNN backbone, as shown in Figure 1. (iii) Using pairwise interactions relies on using powers of the adjacency matrix, which, as shown in [D5], exponentially decreases the bound of feature sensitivity between nodes after K hops. On the other hand, as shown in our theoretical results in Section 4, our GRAMA does not suffer from the same issue. Particularly, it does not rely on powers of the adjacency matrix solely for transferring information between nodes, but also on the selective and learnable ARMA mechanism. These are important differences, and we highlighted them in our revised paper. Thank you.

Regarding Question 2: As discussed in our previous responses, the goal of this paper is to focus on fundamental issues that exist in existing GSSMs: (i) the permutation-equivariant processing, and (ii) the utilization of sequences beyond pairwise interactions, via a principled design that is supported by theoretical foundations and empirical validation. The focus and motivation of GSSMs in literature, as well as our, is to harness the strength and long-range capabilities offered by recent sequential and selective mechanisms to improve the long-range modeling abilities of GNNs and to address the oversquashing problem on static graphs. As such, our paper is written according to these goals and research directions. We definitely believe that future works can study the development of such architectures for temporal data. We added the discussion to our revised paper. Thank you.

Regarding Question 3: Thank for the question. Our motivation for this work stems from the recent findings in [D6,D7] that show the strength of sequential and selective mechanisms over other sequential architectures, as suggested by the Reviewer, to allow long-range propagation. Thus, the utilization of our selective ARMA mechanism is crucial for our work, for two main reasons: (i) it allows us to learn a selective mechanism, which is not the case in other architectures like the standard RNNs mentioned by the Reviewer; (ii) There is an equivalence between SSM and ARMA methods, which allows us to bridge between these fields, and provide a solid theoretical foundation for our model, in addition to the strong performance reflected in our experiments. We thank the Reviewer for the important discussions, and we added them to our revised paper.

We would like to thank the Reviewer again for the thoughtful comments and intriguing questions. We have now uploaded a revised version of our paper that includes all the discussions made above. Overall, we think that these discussions and suggestions made by the Reviewer helped us to improve the quality of our paper. We hope that you find our responses satisfactory, and that you will consider revising your score.

评论- Response Part 2

2024-11-25

References:

[D1] Behrouz A., and Hashemi A. (2024). Graph mamba: Towards learning on graphs with state space models. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

[D2] Huang Y., Miao S., Li P .(2024). What Can We Learn from State Space Models for Machine Learning on Graphs?. arXiv preprint arXiv:2406.05815.

[D3] Pan H., Kondor R. (2022). Permutation Equivariant Layers for Higher Order Interactions. International Conference on Artificial Intelligence and Statistics (AISTATS).

[D4] Schatzki, L., Larocca, M., Nguyen, Q. T., Sauvage, F., & Cerezo, M. (2024). Theoretical guarantees for permutation-equivariant quantum neural networks. Nature Quantum Information, 10(1), 12.‏

[D5] Di Giovanni, F., Giusti, L., Barbero, F., Luise, G., Lio, P., & Bronstein, M. M. (2023). On over-squashing in message passing neural networks: The impact of width, depth, and topology. In International Conference on Machine Learning (ICML)

[D6] Gu A., Goel K., Ré C. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. International Conference on Learning Representations (ICLR).

[D7] Dao T., Gu A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. International Conference on Machine Learning (ICML).

2024-11-25

I would like to thank the authors for their continuous efforts and responses. I have adjusted my score accordingly.

2024-11-27

Dear Reviewer s9vW,

We would like to thank you for the continued and rigorous feedback, which we believe helped us to improve the quality of our work. We are also grateful for your acknowledgment of our work and for raising your score.

Best regards,

The Authors.

审稿意见

评分: 5置信度: 42024-11-12

This paper shows a design that applies SSM model through ARMA to graph domain, and claims that it can solve the long-range interaction problem in graph domain. The author designs a simple deep ARMA framework with GNN backbone inside, to store many state representations for higher order long-range information. In experiments, the author shows certain improvement to baselines.

优点

I agree that using more states that used in SSM can preserve more higher-order informations in long-range problem setting, which can be helpful for certain long-range multi-polynomial regression problems. Extending SSM to graph domain can be a good attemp.
Overall the design is easy to follow.

缺点

The paper is like a naive design of a new model without clear motivation. It appears to just extend SSM/ARMA to graph domain for a new model or a new paper, without clear motivation from the graph problem side. While long-range problem is an issue in graph domain, adapting an existing model to this domain without domain-specific designs and motivations is not helpful for the domain. This is just like a "combine A and B" paper.
While the presented framework "claims" supporting any GNN backbone, it does not give particular support/explanation of what kind of functions/graph informations that can be obtained with adding this framework. Clearly, there should be a discussion about GNN backbone vs GNN + proposed framework, in terms of their computational cost, memory requirement, function approximation ability, long-range regression error analysis and so on. Without these, this paper does not provide good content/guidance for the development of this domain.
When looking at the result, I don't see a dramatic improvement comparing to simple baselines, although the proposed framework clearly use much more representations, memory, and computation.

问题

I suggest the author focus on what problem you want to solve, and clearly demonstrate why this can solve the graph problem you want to solve, with solid theoretical support.

评论- Rebuttal Part 1

2024-11-21

We thank the Reviewer for their time and for recognizing certain aspects of our work, including the extension of structured state-space models (SSMs) and ARMA frameworks to the graph domain. We appreciate the detailed feedback and constructive criticism, which we address point-by-point in our responses below. We hope our clarifications and revisions will address your concerns and prompt a reconsideration of your evaluation and score.

Regarding Weakness 1 (Motivation for GRAMA): Thank you for the important comment. Let us kindly note that in our paper, we discuss the motivation for the construction and design of our GRAMA multiple times, especially throughout Sections 1–3. The main discussed motivations are as follows:

Many existing GNNs suffer from the oversquashing problem (Lines 34-35; Section 1)
Several proposals have been made to address this problem, from different perspectives: rewiring, weight regularization, multi-hop methods, and graph transformers (Lines 36-39; Section 1)
Graph transformers suffer from computational and practical limitations, and it was suggested in the last year to utilize Graph State Space Models (GSSMs) (Lines 40-49; Section 1).
The proposed GSSMs however suffer from the loss of permutation-equivariance or the use of pairwise interactions, while SSMs are designed for longer sequences (Lines 44-57; Section).
We propose a novel method called GRAMA that can address these limitations, maintaining permutation-equivariance and working on longer sequences, grounded back theoretical understanding of the model and an extensive set of experiments (Lines 57-76; Section 1).
Extended discussions of points 1-5 are provided in Section 2.
An extended technical discussion that leads to our GRAMA is provided in Lines 147-199; Section 3, including Figure 1 that illustrates the approach.

We understand that the Reviewer feels that the motivation can be further stressed in the paper, and to accommodate your comment, we made revisions to Sections 1-3 to better highlight the motivation and mathematically grounded concepts taken into consideration when designing our GRAMA, as well as its theoretical and practical significance. Also, we would like to kindly note that combining ideas across disciplines often drives breakthroughs—for instance, the development of fiber optics [R1] merged principles of physics and materials science, while convolutional neural networks (CNNs) [R2] combined insights from neuroscience and computational mathematics to revolutionize machine learning. Similarly, our work integrates graph learning and sequence processing approaches in a principled and theoretically sound manner, which we believe is a strength because it: (i) bridges between different fields to form a principled architecture; (ii) addresses foundational issues like oversquashing in existing GNNs; and (iii) demonstrates practical merit through comparable or better downstream performance against recent state-of-the-art methods. We have revised the paper to better emphasize these three key strengths. Thank you.

评论- Rebuttal Part 2

2024-11-21

Regarding Weakness 2 (GNN backbones): We appreciate your comment that helped us to improve the paper. Before we proceed with describing the changes made to our paper, let us note that because of the flexible construction of our GRAMA (please see Equation (6) and Figure 1), in practice, we expect that any GNN can be coupled with our GRAMA. To validate this claim, we experimented with three different and diverse backbones: GCN, GatedGCN, and GPS, in Section 5 and throughout the Appendix. Nonetheless, we agree with the Reviewer that this part of the paper can be improved, and we worked to address that during the rebuttal period. In particular, we added the following discussions to address your suggested aspects:

Combining GRAMA with GNN backbones. We revised the text in Section 3 to explain why any GNN can be combined with our framework. The explanation is that because our ARMA block is built on a sequence of graphs, it performs sequential processing for each node with its sequences, interleaved with GNN layers at each step. Thus, any GNN architecture that can process a graph, can be incorporated into our GRAMA framework, whether it is an MPNN or a Graph Transformer, as shown in our experiments.
Computational cost of GNN backbone vs. GNN backbone + GRAMA. In our original submission, we included asymptotic space and time complexity analyses, followed by a runtime (both inference and training) with GCN backbone, with varying depths of 4,8,16,32 layers. We also compare the runtimes and performance obtained by using GPS. After reading the paper again, we agree that a better linkage to Appendix E can be made. In our revised paper, we now refer to Appendix E from the main text, specifically from Section 3 (method) and Section 5 (experiments). We believe that this revision helps to improve readability and understand this aspect of our GRAMA. Thank you.
Function approximation and long-range ability: We thank you for raising this important point. We fully agree with the reviewer that this is a crucial aspect of our work. In terms of function approximation ability, our GRAMA combines a GNN backbone and our neural selective ARMA layers. In theory, it can also learn to reduce to standard residual connections that are widely used in GNNs [R3,R4]. Therefore, the overall GRAMA framework retains the approximation power of its GNN backbone. In our original submission, we included a detailed theoretical analysis of GRAMA. In particular, Theorem 1 shows the equivalence between ARMA and SSM models, which allows us to build on the strong foundations and insights gained in recent years by SSM methods like [R5, R6]. Following that, we provide Lemma 4.2 and Theorem 4.3 that discuss the stability of GRAMA, which are important for training stability. Moreover, we provide Theorem 4.4 which discusses the ability of GRAMA to perform long-range interactions, which is significant in addressing the oversquashing problem in GNNs. In our revised paper, we included the discussion above and improved the presentation of the significance of our theoretical analyses. Thank you.

评论- Rebuttal Part 3

2024-11-21

Regarding Weakness 3(Experimental results): Thank you for the comment. We would like to kindly note that our GRAMA offers significant and consistent improvements over the baselines backbone GNNs (GCN, GatedGCN, and GPS). This is reflected in all the Tables and experiments in Section 5. To be concrete, we summarize the results obtained with baseline backbones and with our GRAMA in the Table below.

Model	Diameter ( $log_{10}(MSE)$ ↓)	SSSP ( $log_{10}(MSE)$ ↓)	Eccentricity ( $log_{10}(MSE)$ ↓)	Peptides-func (AP ↑)	Peptides-struct (MAE↓)	MalNet-Tiny (Acc ↑)	Roman-empire (Acc ↑)	Amazon-ratings (Acc ↑)	Minesweeper (AUC ↑)	Tolokers (AUC ↑)	Questions (AUC ↑)
GCN	0.7424±0.0466	0.9499±9.18·10−5	0.8468±0.0028	59.30±0.23	0.3496±0.0013	81.00	73.69±0.74	48.70±0.63	89.75±0.52	83.64±0.67	76.09±1.27
GatedGCN	0.1348±0.0397	-3.261±0.0514	0.6995±0.0302	58.64±0.77	0.3420±0.0013	92.23±0.65	74.46±0.54	43.00±0.32	87.54±1.22	77.31±1.14	76.61±1.13
GPS	-0.5121±0.0426	-3.599±0.1949	0.6077±0.0282	65.35±0.41	0.2500±0.0005	92.64±0.78	82.00±0.61	53.10±0.42	90.63±0.67	83.71±0.48	71.73±1.47
GRAMA-GCN	0.2577±0.0368	0.0095±0.0877	0.6193±0.0441	70.93±0.78	0.2439±0.0017	93.43±0.29	88.61±0.43	53.48±0.62	95.27±0.71	86.23±1.10	79.23±1.16
GRAMA-GatedGCN	-0.5485±0.1489	-4.1289±0.0988	0.5523±0.0511	70.49±0.51	0.2459±0.0020	93.66±0.40	91.82±0.39	53.71±0.57	98.19±0.58	85.42±0.95	80.47±1.09
GRAMA-GPS	-0.8663±0.0514	-3.9349±0.0699	-1.3012±0.1258	69.83±0.83	0.2436±0.0022	94.37±0.36	91.73±0.59	53.36±0.38	98.33±0.55	85.71±0.98	79.11±1.19

Model / Dataset	Diameter ( $log_{10}(MSE)$ ↓)	SSSP ( $log_{10}(MSE)$ ↓)	Eccentricity ( $log_{10}(MSE)$ ↓)	Peptides-func (AP ↑)	Peptides-struct (MAE ↓)	MalNet-Tiny (Acc ↑)	Roman-empire (Acc ↑)	Amazon-ratings (Acc ↑)	Minesweeper (AUC ↑)	Tolokers (AUC ↑)	Questions (AUC ↑)
Best baseline out of all methods in Tables	-0.5981 ± 0.1145	-3.599 ± 0.1949	-0.0739 ± 0.2190	71.50 ± 0.44	0.2478 ± 0.0016	94.22 ± 0.24	91.37 ± 0.35	54.17 ± 0.37	97.31 ± 0.41	84.52 ± 0.21	80.02 ± 0.86
GRAMA (Best performing model out of 3 variants)	-0.8663 ± 0.0514	-4.1289 ± 0.0988	-1.3012 ± 0.1258	70.93 ± 0.78	0.2436 ± 0.0022	94.37 ± 0.36	91.82 ± 0.39	53.71 ± 0.57	98.33 ± 0.55	86.23 ± 1.10	80.47 ± 1.09

We therefore believe that the proposed GRAMA offers a significant improvement not only in terms of designing a principled and mathematically ground model that is equivalent to a Graph SSM model which preserves permutation-equivariance and allows long-range propagation, but also in terms of downstream performance on real-life applications and benchmarks. In our revised paper, we added the summary Tables to Appendix G in our revised paper, as well as the discussion provided here. Furthermore, in Appendix E we provided and now expanded the discussion on the efficiency of GRAMA compared with other leading methods like GPS and GMN. We believe that this helps to better highlight the contributions of our paper. Thank you.

评论- Rebuttal Part 4

2024-11-21

Regarding question: Thank you for the suggestion. Our responses to W1-W4 discuss and outline the suggestions made by the Reviewer: we address the oversquashing issue in GNNs through the perspective of architectures that stem from dynamical systems such as ARMA and SSM models. It is of key importance to note that our method is thoughtfully designed such that it can benefit from the theoretical and practical advantages of models like ARMA, and that, compared with other approaches within this field, it also retains permutation-equivariance. These advantages of our GRAMA are supported by the profound theoretical analyses in Section 4, and are verified on multiple experiments, benchmarks, and backbones in Section 5. Your guidance helped us to improve the paper, and we hope that you will reconsider your score following that. Thank you.

References:

[R1]: Kao, C. K., & Hockham, G. A. (1966). Dielectric-fiber surface waveguides for optical frequencies.

[R2]: LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE.

[R3] Chen M., Wei Z., Huang Z., Ding B., Li Y. (2020) . Simple and Deep Graph Convolutional Networks. International Conference on Machine Learning (ICML).

[R4] Di Giovanni F., Rowbottom J., Chamberlain B.P., Markovich T., Bronstein M.M. (2023). Understanding convolution on graphs via energies. Transactions on Machine Learning Research (TMLR).

[R5] Gu A., Goel K., Ré C. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. International Conference on Learning Representations (ICLR).

[R6] Dao T., Gu A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. International Conference on Machine Learning (ICML).

2024-11-29

Dear Reviewer fRxB,

We thank you for your provided review. In our responses, we have made significant efforts to address each of your comments. We believe that our extensive responses should clarify your concerns. In particular, we have:

Clarified the motivation and reasoning for our approach, including a revision to Sections 1,2, and 3, to better highlight the motivation.
Revised the text to address your comments on the GNN backbones used in our paper and on the generality of our GRAMA, as well as extended the discussion on the computational costs and the abilities of GRAMA, theoretically and practically.
Provided a focused summary of the experimental results of GRAMA, to better illustrate its significance. We also added the discussions and results to the paper in Appendix G.
Addressed your suggestion and question in our responses, covering the motivations, limitations, and results of GRAMA.

We kindly ask to receive your feedback after reading our responses and revised paper, and we are happy to clarify any comment or concern you may have regarding our paper. We have done significant work to address your review, and we think that your suggestions and guidance helped us to improve our paper -- thank you.
We hope that you found our responses satisfactory and that you will consider revising your score.

With kind regards,

The Authors.

评论- Many issues of the soundness for the designed model

2024-12-02

First, I thank the author for the detailed response, with revision of the original paper. Nevertheless, the rebuttal does not address the main problem of the designed method, while providing many "motivations" and "explanations".

The author claims that the proposed method solved the (1) long-range interaction problem and (2) the over-squashing problem. Intuitively, the method does solve certain over-squashing problem, while I personally does not think over-squashing is a valid problem, as it provided more hidden states for each node to store more historical informations with a cost of significantly increased memory cost and parameter usage. Hence to validate (2), the author need to do a fair comparison to all baselines like naive GCN and gatedGCN: you need to make sure that then use the same amount parameter, and roughly same amount of memory usage. The reason is simple: you can always increase the hidden size of GCN to increase the memory cost, while increase the information flow gate, as you have more hidden states for information storage. And for the parameter count, it is important for the function approximation ability. The author need to show that the designed ARMA block, with the increased cost on parameters and memory usage, works better than naively increase parameter counts and hidden states of simple baselines (GCN or whatever). I don't believe in the author's current experimental result without proper control over parameters and memory usage.

For the author's claim of solving problem (1), it is actually invalid. Looking carefully at Equation 6, we can easily find that the graph space level interaction is still, solely captured by the simple GCN part (the last part of Equation 6). And the first two parts of Equation 6, corresponding to ARMA, have no interactions among nodes and edges, which only increase "hidden size" of nodes representations. Hence, given infinite wide hidden states, the ability to capture long-range interaction among edges and nodes is exactly the same as the backbone.

In general, I still feel that the design is basically a combine A + B paper. I don't see enough contribution in the graph domain given that it does not directly address the information exchange problem among nodes and edges. For the contribution of solving oversquashing, the author should demonstrate that the cost of introducing ARMA is worth than simply increase hidden state and parameter count of simple, old baselines.

评论- Response to Reviewer fRxB (Part 1)

2024-12-02

Thank you for the response. Please see our responses to each of your comments below.

Regarding:

The author claims that the proposed method solved the (1) long-range interaction problem and (2) the over-squashing problem. Intuitively, the method does solve certain over-squashing problem, while I personally does not think over-squashing is a valid problem, as it provided more hidden states for each node to store more historical informations with a cost of significantly increased memory cost and parameter usage. Hence to validate (2), the author need to do a fair comparison to all baselines like naive GCN and gatedGCN: you need to make sure that then use the same amount parameter, and roughly same amount of memory usage. The reason is simple: you can always increase the hidden size of GCN to increase the memory cost, while increase the information flow gate, as you have more hidden states for information storage. And for the parameter count, it is important for the function approximation ability. The author need to show that the designed ARMA block, with the increased cost on parameters and memory usage, works better than naively increase parameter counts and hidden states of simple baselines (GCN or whatever). I don't believe in the author's current experimental result without proper control over parameters and memory usage.

We fully agree that one can always increase the size (whether it is depth or width) of the baseline network. However, doing that for the baseline does not yield better results. We have shown this in our experiments when considering deeper networks. Furthermore, please note that in our experiments, we reported that we maintain similar parameter budgets for different networks; please see Table 2 and Section 5.3, which are Long Range Benchmarks, and the parameter is 500K. Moreover, in Appendix E we have addressed both the the theoretical time and space complexity, and provided runtimes vs. accuracy comparison for multiple methods, from GCN to GPS.

Following your request, we now provide and summarize results when considering different depths/widths of the baseline GCN compared with GRAMA, in the Tables below. As can be seen, the contribution of GRAMA is not pinned on simply adding more parameters, but the principled design of GRAMA: there is a big downstream performance gap between the GCN baseline and our GRAMA, regardless of the fact that more parameters are added to the baseline.

Fixing S=2, d=256, and varying the sequence length (L) yielding an effective depth of SL for GRAMA vs. d=256 for GCN with varying number of layers (depth). The results were also reported in our original submission, we only added parameter count for your convenience.

Metrics	Method	Depth 4	Depth 8	Depth 16	Depth 32
Training (ms)	GCN	18.38	33.09	61.86	120.93
Inference (ms)		9.30	14.64	27.95	53.55
Accuracy (%)		73.60	61.52	56.86	52.42
Parameters		34,3552	60,5696	1,129,984	2,178,560
--------------	--------------	--------------	--------------	--------------	--------------
Training (ms)	GRAMA $_{GCN}$ (Ours)	75.75	141.79	463.76	1378.91
Inference (ms)		40.33	70.91	240.78	702.17
Accuracy (%)		86.33	88.14	88.24	88.22
Parameters		628,224	781,824	1,089,024	1,703,424

Fixing S=L=2 (depth=4), and varying the network width d=32,64,128,256,512 in GRAMA vs. GCN with 4 layers with varying width d=32,64,128,512

Metrics	Method	Width 32	Width 64	Width 128	Width 256	Width 512
Training (ms)	GCN	1.39	5.53	11.93	18.38	58.63
Inference (ms)		0.71	2.49	5.74	9.30	29.69
Accuracy (%)		72.29	73.42	73.63	73.60	72.88
Parameters		14,272	36,736	106,240	404,480	1,464,320
--------------	--------------	--------------	--------------	--------------	-----------	------------
Training (ms)	GRAMA $_{GCN}$ (Ours)	1.44	5.73	18.94	75.75	303.01
Inference (ms)		0.73	2.52	8.38	40.33	161.32
Accuracy (%)		85.98	86.19	86.29	86.33	86.27
Parameters		39,552	86,784	215,808	628,224	2,042,880

Moreover, we would like to kindly invite the Reviewer to read our Theoretical Properties section (Section 4), where we provide theoretical, mathematical grounding for why GRAMA works, in addition to the comprehensive and extensive benchmarking on multiple datasets, from synthetic to real-world ones. We hope that the Reviewer will appreciate the fact that we have considered and provided both theoretical and practical aspects of our work.

评论- Response to Reviewer fRxB (Part 2)

2024-12-02

Regarding:

For the author's claim of solving problem (1), it is actually invalid. Looking carefully at Equation 6, we can easily find that the graph space level interaction is still, solely captured by the simple GCN part (the last part of Equation 6). And the first two parts of Equation 6, corresponding to ARMA, have no interactions among nodes and edges, which only increase "hidden size" of nodes representations. Hence, given infinite wide hidden states, the ability to capture long-range interaction among edges and nodes is exactly the same as the backbone.

We kindly invite the Reviewer to read Section 4, where we provide ample theoretical justification for why our GRAMA method works. The Reviewer reads correctly that Equation 6 involves two parts: a selective sequence space module (ARMA) and a backbone GNN. However, this does not mean that GRAMA does not allow the long-range propagation of information. Please note that Equation 6 is a single recurrence step, while in GRAMA, we have multiple recurrences and GRAMA blocks (as can be seen right after Equation 6). This means that these spaces of sequences and graph nodes and edges are mixed in a selective and learnable manner. Furthermore, regarding your comment on the width of the hidden state, let us clarify that it is not infinite. We also provided a summary of the results that compare that and the number of parameters in the tables above. The Reviewer can see that the contribution of GRAMA does not stem from simply adding more parameters or width but from its principled design.

Regarding:

In general, I still feel that the design is basically a combine A + B paper. I don't see enough contribution in the graph domain given that it does not directly address the information exchange problem among nodes and edges. For the contribution of solving oversquashing, the author should demonstrate that the cost of introducing ARMA is worth than simply increase hidden state and parameter count of simple, old baselines.

As discussed above, the novelty of GRAMA is in the selective learned mechanism that allows the mixture between graph and sequence spaces. This is thoroughly explained in our paper, examined from a theoretical standpoint in Section 4, and later extensively evaluated in Section 5. Furthermore, following your request, we provided two Tables above that summarize results from the paper and add more examples, to demonstrate that: (1) no, adding more parameters to the baseline does not yield better or similar performance to GRAMA; and (2) the contribution of GRAMA is not simply adding more parameters. Rather, it is the principled and mathematically grounded design of our architecture.

We kindly ask the Reviewer to take these clarifications into consideration towards their evaluation and scoring of our paper. We believe that our responses address your concern, and we are happy to continue the discussion if you have additional questions.

2024-11-21

Rebuttal Summary:

We sincerely thank the reviewers for their thoughtful and detailed feedback, as well as for recognizing the key strengths of our work. Reviewer fRxB appreciated that GRAMA "preserves more higher-order information in long-range problem settings" and found the extension of SSM to the graph domain to be "a good attempt." Reviewer s9vW highlighted that GRAMA introduces a "new way of combining ARMA with GNNs," noting its ability to address "long-range interactions and permutation equivariance" with "solid experimental evidence." Similarly, Reviewer WBsY acknowledged that GRAMA "captures long-range interactions that traditional methods might miss" and is "well-written and easy to understand." Finally, Reviewer nRqm described our theoretical foundation as innovative and appreciated GRAMA's strong experimental performance, stating that it "achieves strong performance, consistently outperforming its backbone models."

We are also thankful for the actionable and constructive feedback, which has significantly improved the clarity, presentation, and scope of our paper. Specifically:

Additional Experiments:

We have expanded our experiments to include new datasets, such as ZINC-12k, OGBG-MOLHIV, Cora, CiteSeer, and PubMed, to assess GRAMA's performance in more diverse scenarios. These experiments demonstrate that GRAMA consistently enhances baseline backbones across a range of tasks, including those not solely focused on long-range dependencies.
To address computational cost concerns, we conducted ablation studies comparing GRAMA and baseline models under equivalent parameter budgets and runtime constraints. These results confirm GRAMA’s ability to maintain superior performance with minimal additional computational overhead.
Inspired by Reviewer WBsY’s suggestion, we tested GRAMA on dynamic graph datasets. Results on Chickenpox Hungary, PedalMe London, and Wikipedia Math further show GRAMA’s potential for spatio-temporal applications.

Revisions to the Paper:

In Section 3, we clarified the role of linearity and non-linearity in GRAMA blocks, explicitly addressing ambiguities in Equation (6) and Figure 1. We also clarified the relationship between moving windows and GRAMA blocks.
We improved the presentation of the experimental results.
We improved Section 4 by explicitly attributing inherited theoretical properties to ARMA/SSM frameworks while emphasizing GRAMA’s novel contributions, such as maintaining permutation equivariance and addressing oversquashing in GNNs.
We expanded discussions on GRAMA’s computational cost in the main text, linking it to results in Appendix E. Additionally, we highlighted GRAMA’s applicability to both long-range and more general tasks, offering a balanced view of its strengths and limitations.
We incorporated all added results and discussion from the rebuttal into the paper.

We hope these revisions address the reviewers' concerns and highlight the significance and robustness of our contributions, and that you will consider revising your scores. Thank you again for your constructive feedback.

2024-11-23

Dear Reviewers,

The authors have uploaded their rebuttal. Please take this opportunity to discuss any concerns you may have with the authors.

AC 元评审

2024-12-21

The paper introduces a novel and innovative framework, GRAMA, which adapts the ARMA/SSM model for sequential learning on graphs. This approach addresses key challenges in graph data, particularly permutation equivariance and long-range interactions, effectively. However, several concerns remain unresolved after the rebuttal:

While the paper demonstrates that GRAMA can capture long-range interactions, it has not sufficiently explained why combining ARMA/SSM with GNNs offers clear improvements over existing methods. The paper's contribution of transforming a graph into a sequence to address long-range interactions lacks a strong, domain-specific motivation. Given that SSMs have not significantly outperformed other methods, such as Transformers, in other domains, the paper does not convincingly justify the adaptation of SSMs to graph learning. The theoretical results in Theorem 4.4 provide some insight into GRAMA’s ability to capture long-range dependencies, but they do not sufficiently explain how this improves existing models or how the assumptions are met in the graph setting. The theoretical connection between GRAMA and previous methods needs to be more explicit and substantiated with concrete evidence.
The introduction of GRAMA increases computational costs and memory usage significantly. While the paper compares the performance of GARMA_GCN and past methods, it fails to compare more advanced models like GARMA_GATEDGCN and GARMA_GPS, which are expected to perform better. These models, based on more complex backbones, are likely to have higher computational costs and running times, which could make them less practical for large graphs. Without comparisons with these better-performing variants, it is difficult to assess the real impact of GRAMA's added complexity on its scalability and applicability to large-scale graph data.
The paper's presentation needs further refinement. Despite revisions, the excessive changes made during the rebuttal period have not been adequately organized, which hinders clarity. Additionally, Theorem 4.4 could benefit from a more formal, academic presentation to enhance its clarity and rigor, making it more accessible to readers. Improving the organization of the theoretical and model sections will help ensure the framework is clearly understood.

Consequently, while the paper presents an interesting framework, the authors have not fully addressed key concerns related to the model’s motivation, scalability, and clarity. Further evidence to support the motivation and a more structured presentation of the model and theoretical aspects will strengthen the paper's contribution.

审稿人讨论附加意见

The authors revised the paper by adding more GNN backbones for comparison, including new experiments that highlight performance and address computational aspects like training time. They also improved the paper’s clarity and organization, particularly regarding complex theoretical results. In response to reviewer concerns, the authors elaborated on the motivation for using ARMA in GNNs, clarified how GRAMA handles long-range dependencies, and provided additional results on scalability and computational cost. They also included more established benchmarks and addressed inconsistencies in baseline comparisons. Despite these improvements, the paper still requires further clarification on the motivation for certain design choices and a more thorough discussion of computational complexity to fully address the reviewers’ concerns. As a result, I recommend rejection based on the reviewers’ feedback.

最终决定Reject

2025-01-22

Reject