PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
5
4
3.5
置信度
创新性2.8
质量3.0
清晰度3.0
重要性2.3
NeurIPS 2025

Flatten Graphs as Sequences: Transformers are Scalable Graph Generators

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

A novel autoregressive model for generating attributed graphs using decoder-only transformers.

摘要

关键词
graph generationtransformersautoregressive modelinglanguage modelsLLMs

评审与讨论

审稿意见
4

This paper tackles the problem of scalable and structurally valid graph generation, a task that remains challenging for existing diffusion-based and autoregressive (AR) models due to either quadratic-to-cubic complexity or limited long-range consistency. The authors propose AUTOGRAPH, a framework that bridges graph generation with large language models (LLMs) through a novel graph-to-sequence transformation:

  1. Segmented Eulerian Neighborhood Trails (SENTs) – an extension of segmented Eulerian trails that augments each trail segment with local neighborhood tokens, ensuring complete edge coverage while preserving subgraph-induced structures.
  2. Lossless flattening and tokenization – a formal procedure that converts SENTs into token sequences with special delimiters, accompanied by theoretical proofs guaranteeing one-to-one recoverability between graphs and sequences.
  3. Autoregressive language modeling – any decoder-only transformer can now treat graph generation as next-token prediction, yielding O(n) time and memory complexity via an efficient SENT sampling algorithm.
  4. Empirical validation – using a 12-layer LLaMA backbone, AUTOGRAPH achieves:
    • Up to 100× inference speed-up and 3× training speed-up over diffusion baselines on Planar and SBM datasets.
    • Competitive performance on transfer learning and substructure-conditioned generation, demonstrating the method’s flexibility.

Overall, AUTOGRAPH offers a principled and efficient pathway to leverage modern LLMs for graph generation, effectively mitigating negative transfer issues and outperforming prior diffusion and AR approaches in both accuracy and scalability.

优缺点分析


Strengths

  • S1. Lossless graph-to-sequence transformation
    The paper moves beyond prior random-walk-based graph representations by introducing a lossless conversion—Segmented Eulerian Neighborhood Trails (SENTs)—that preserves complete edge coverage and subgraph-induced structures. This theoretically grounded mapping enables direct, reversible use of language models for graph generation without sacrificing structural validity.

  • S2. Comprehensive empirical evaluation
    The authors conduct meticulous head-to-head comparisons against both diffusion-based and conventional autoregressive baselines across four benchmarks encompassing small, medium, and large graphs. Results consistently demonstrate AUTOGRAPH’s superiority in accuracy (MMD, VUN) and efficiency (up to 100× faster inference), underscoring the method’s robustness to graph size and domain diversity.

  • S3. Formal guarantees and linear-time sampling
    Theoretical proofs establish a one-to-one correspondence between SENT sequences and their source graphs, while the proposed sampling algorithm operates in linear time O(m)O(m) with respect to edge count—crucial for scalability.

Weaknesses

These are elaborated under the Questions section below.

问题

While the manuscript exhibits many notable strengths, I would appreciate clarification on the following points:


  1. Positioning of prior work on graph-to-sequence representations
    Introduction section occasionally reads as if no effective graph-to-sequence approach existed before this work. However, random-walk-based representations are acknowledged later in the Related Work section. It would be clearer—and fairer to prior art—if Introduction section explicitly positioned SENT as an improved (lossless and generation-oriented) extension of these earlier methods, rather than implying the absence of any sequence paradigm.

  1. Interpretability of evaluation metrics
    Although Maximum Mean Discrepancy (MMD) and VUN metrics are standard, readers unfamiliar with graph generation benchmarks may struggle to intuitively grasp what each metric indicates. A brief high-level explanation of the intuition behind each metric, perhaps in an appendix footnote or a short paragraph, would greatly improve result interpretability.

  1. Speed comparison beyond diffusion baselines
    The efficiency analysis highlights impressive speed-ups over diffusion-based models (Table 2). Is it feasible to provide analogous wall-clock comparisons against other baselines? If not, could the authors explain the practical obstacles? Such numbers would further substantiate AUTOGRAPH’s scalability claims across paradigms.

  1. Validity scores in molecular generation (Table 4)
    In the molecular benchmarks, AUTOGRAPH achieves strong uniqueness and novelty but appears to match—or slightly underperform—some baselines in validity. To the best of my knowledge, the validity is the most critical criterion for molecule generation, so the authors need to report the additional discussion.

Addressing these points would further strengthen the paper’s clarity and persuasiveness.

局限性

All key limitations have been included within the Questions above.

最终评判理由

Many of my minor concerns have been sufficiently alleviated. Therefore, I have revised my initial score upward by one point, and my final recommendation is a score of 4.

格式问题

N/A

作者回复

We thank the reviewer for their constructive feedback, which we believe will strengthen our paper. Below, we address each point in detail.

W1. Positioning of prior work on graph-to-sequence representations

We thank the reviewer for their suggestion. Our intention was not to disregard prior work but to highlight our specific contributions within the scope of graph generation. We will revise the Introduction to more clearly and fairly position our work. Specifically, we will explicitly frame SENT as an advancement of previous random-walk-based sequence representations, emphasizing that its key innovations—being lossless and designed for prompt-based, substructure-conditioned generation that can be seamlessly integrated into language models like decoder-only transformers, addressing critical limitations of earlier generative methods. If the reviewer has any further concrete suggestions, we are more than happy to incorporate them.

W2. Interpretability of evaluation metrics

While we have already provided a short description about the evaluation metrics in Lines 278-291, we agree that providing intuition for these metrics will make our results more accessible. To improve clarity, we will expand our subsection E.2 in the Appendix to briefly explain the high-level meaning of our primary evaluation metrics. For instance:

MMD (Maximum Mean Discrepancy): This measures the distributional similarity between generated and real graphs. Lower MMD scores indicate that the generated graphs' statistics (e.g., degree, clustering) more closely match the training data. Some details and references have already been provided in Appendix E.2.

VUN (Validity, Uniqueness, Novelty): This assesses sample quality. Details have already been provided in Appendix E.2.

W3. Speed comparison beyond diffusion baselines

We emphasize that Table 2 already provides a speed comparison across three distinct paradigms of graph generation models:

  • DiGress: A representative diffusion model.
  • GRAN: A state-of-the-art autoregressive model.
  • ESGG: A hybrid diffusion and autoregressive model specifically designed for efficient generation.

Our results clearly demonstrate that AutoGraph achieves inference speed comparable to GRAN, which is a leading autoregressive model, while simultaneously delivering significantly better sample quality (VUN scores) across various datasets. This comparison already substantiates AutoGraph's scalability and efficiency across different widely-used generation paradigms.

W4. Validity scores in molecular generation (Table 4)

AutoGraph's performance should be viewed through the lens of it being a general-purpose graph model competing against highly domain-specific methods (e.g., SMILES-based models).

The fact that AutoGraph, by learning from graph data alone, can achieve validity scores competitive with these specialized models is a strong demonstration of its learning capabilities. It successfully infers the complex, implicit rules of molecular construction. Furthermore, some models, such as MCTS a search-based method, that guarantee 100% validity do so at the cost of poor distributional similarity (e.g., a low FCD score). From a practical perspective, one can easily reject non-valid graphs to achieve near-perfect validity while generating over-simplified and basic molecules that have low distributional similarity (FCD score) with the training data, which is of low practical interest. AutoGraph achieves a superior overall balance. In short, both validity and FCD should be considered simultaneously when interpreting the model's performance. We will include a discussion in our revision.

评论

Thank you for your thorough and sincere responses to my comments. I found your clarifications helpful and reassuring, and they have addressed the main concerns I initially raised. Based on your detailed rebuttal, I am now more confident in the paper’s contributions and will update my score accordingly.

评论

We thank the reviewer for considering our rebuttal and their timely response. We are happy to continue the discussion if they have any further questions.

评论

Dear reviewer,

Following the NeurIPS 2025 guidelines, I kindly encourage you to read and respond to author rebuttals as soon as possible. Please engage actively in the discussion with authors during the rebuttal process, update your review by filling in the "Final Justification" and acknowledge your involvement.

Thank you, Your AC

审稿意见
5

This paper proposes a graph generation method called AutoGraph, which serializes graph structures by treating sampled graph trials as tokens, enabling the use of decoder-only Transformer models for graph generation tasks. The authors introduce a newly designed structure named SENT (Segmented Eulerian Neighborhood Trail) to ensure that the prefix of the generated sequence corresponds to an induced subgraph. Thanks to these innovations, the proposed model demonstrates efficient generation, strong transferability, and controllable substructure conditioned generation capabilities across multiple synthetic and real-world graph datasets. Compared to diffusion-based models, the proposed approach achieves 100× faster generation and 3× faster training.

优缺点分析

Strengths

  1. The authors propose AutoGraph, the first method that serializes graph structures to enable direct application of language models to graph generation.
  2. The authors carefully design the SENT structure and theoretically prove that sampling SENTs allows for graph serialization, while ensuring that the prefix of the resulting sequence corresponds to an induced subgraph.
  3. The chosen method of serializing and tokenizing graphs using SENT guarantees that isomorphic graphs are mapped to identical token sequences, which facilitates better generalization.
  4. The proposed random trial sampling algorithm with breaks is cleverly designed and enables efficient sampling of SENTs.
  5. AutoGraph achieves significant inference and training speedups while maintaining almost state-of-the-art generation quality, demonstrating strong scalability.
  6. AutoGraph also shows impressive transferability and substructure conditioned generation capabilities without requiring additional finetuning.

Weaknesses

  1. The main advantage of the proposed AutoGraph method lies in its generation speed, while it shows some limitations in generation quality and does not consistently outperform other methods across all evaluation metrics:
  • As shown in Table 1, AutoGraph achieves the best VUN performance on the SMB dataset. However, it underperforms compared to methods like DiGress and GRUM in metrics such as node degree distributions (Deg.), clustering coefficients (Clus.), and orbit count statistics (Orbit). This may indicate that AutoGraph emphasizes the structural feasibility of the generated graphs.
  • According to Table 4, AutoGraph scores lower than DiGress and JT-VAE on the FCD metric in the MOSES dataset, and lower than MCTS on the VALID metric in the GuacaMol dataset. This suggests that AutoGraph still has room for improvement in generating realistic molecular distributions.
  1. The interpretability of the token sequences generated by AutoGraph is less intuitive compared to representations like SMILES, and may not be suitable for all types of graph structures.
  2. For dense graphs, where each node may have a large number of neighbors, the authors’ strategy of explicitly encoding adjacent nodes could lead to an excessive number of tokens, potentially affecting generation efficiency.

问题

  1. What is the difference between AutoGraph-Graph and AUTOGRAPH (Pretrained) in the last two rows of Table 4?
  2. Currently, AutoGraph uses a 12-layer LLaMA model to align with GPT-2’s smallest variant, which is still relatively small in scale. Have the authors evaluated how AutoGraph scales with larger language models?
  3. Does the proposed method support generation conditioned on combinations of different motifs as substructure condition? Currently, the experiments only demonstrate generation conditioned on repetitive occurrences of the same motif.

局限性

The authors note that their experiments are still conducted on relatively small-scale benchmarks and attribute this limitation to the current lack of comprehensive graph generation benchmarks and well-curated pre-training datasets in the community.

最终评判理由

I appreciate the authors' detailed clarification, which has addressed all of my concerns. I will keep my original score.

格式问题

No

作者回复

We thank the reviewer for their constructive feedback, which we believe will strengthen our paper. Below, we address each point in detail.

W1: The main advantage of the proposed AutoGraph method lies in its generation speed

The advantage of AutoGraph is more than the generation speed. We believe scalability is one of the most crucial factors for success in many recent machine learning applications. Our work paves the way for scaling graph models to very large and diverse datasets, with potential applications in molecular generation, general graph understanding tasks, and even multi-modal LLMs.

W1.1: in Table 1, AutoGraph achieves the best VUN performance on the SMB dataset. However, it underperforms compared to methods like DiGress and GRUM in metrics such as node degree distributions (Deg.), clustering coefficients (Clus.), and orbit count statistics (Orbit). This may indicate that AutoGraph emphasizes the structural feasibility of the generated graphs.

This performance profile is an intentional result of our hyperparameter choice, which highlights the controllability of AutoGraph's generation process. The trade-off between distribution matching (e.g., Deg., Clus. Orbit) and sample quality (VUN) is explicitly managed by the top-k sampling parameter.

  • On the Planar and SBM datasets, we tuned top-k to optimize for VUN, demonstrating AutoGraph's ability to generate valid and novel graphs, where it excels.
  • On the Proteins and Point Cloud datasets, we tuned top-k to optimize for MMD on the validation set, where AutoGraph significantly outperforms DiGress and GruM.

This demonstrates that users can steer AutoGraph to prioritize either distribution fidelity or sample quality, depending on the specific application. We discuss this trade-off in detail in Appendix F.5 (Effect of top-k sampling).

W1.2: Table 4, AutoGraph scores lower than DiGress and JT-VAE on the FCD metric in the MOSES dataset, and lower than MCTS on the VALID metric in the GuacaMol dataset. This suggests that AutoGraph still has room for improvement in generating realistic molecular distributions.

AutoGraph represents a significant advance for general graph generative models in the specialized domain of molecular generation.

  • Regarding MOSES: We kindly remind the reviewer that on this benchmark, lower FCD indicates better performance. As a result, AutoGraph outperforms DiGress and JT-VAE in terms of FCD.
  • Regarding GuacaMol: It is critical to analyze the trade-offs. MCTS is a search-based algorithm engineered to achieve 100% chemical validity. However, its extremely poor FCD score shows it fails to capture the underlying data distribution. In contrast, AutoGraph strikes a much better balance, achieving a strong FCD score while maintaining high validity.

We believe this shows that AutoGraph is one of the first general-purpose graph models to achieve performance competitive with highly specialized, SMILES-based approaches.

W2: The interpretability of the token sequences generated by AutoGraph is less intuitive compared to representations like SMILES, and may not be suitable for all types of graph structures.

Our SENT representation is both more general and highly intuitive.

  • While SMILES is an effective representation for chemists, it is a domain-specific language strictly limited to molecules. A primary contribution of our work is developing SENT as a general-purpose graph representation applicable to any type of non-attributed or attributed graph, as detailed in Sections 2.4 and 2.5.
  • We respectfully suggest that SENT sequences are quite intuitive. They directly represent a trail through the graph, with explicit tokens for actions like moving to a neighbor, restarting with a new node, or indicating whether current node is a neighbor of a previously visited node. This provides a clear, step-by-step generative trace of the graph's construction.

W2. For dense graphs, where each node may have a large number of neighbors, the authors’ strategy of explicitly encoding adjacent nodes could lead to an excessive number of tokens, potentially affecting generation efficiency.

SENT is intentionally designed for sparse graphs, which represent the vast majority of real-world datasets in key domains like chemistry, social networks, and biology. While specialized methods might be more efficient for denser graphs, our approach is tailored to the problem space where generative modeling is most impactful. We also note that AutoGraph performs well on the SBM dataset, which contains relatively dense community structures.

Questions:

  1. What is the difference between AutoGraph-Graph and AUTOGRAPH (Pretrained) in the last two rows of Table 4?

AutoGraph (Pretrained) refers to the model that was first pre-trained on the large-scale PubChem-10M dataset and subsequently fine-tuned on the downstream GuacaMol dataset. AutoGraph-Graph is the model trained on GuacaMol from scratch. This is detailed in Lines 351-353 of the main paper.

  1. Currently, AutoGraph uses a 12-layer LLaMA model to align with GPT-2’s smallest variant, which is still relatively small in scale. Have the authors evaluated how AutoGraph scales with larger language models?

Due to computational constraints, our main scaling experiments were conducted on the small Planar dataset (see Appendix F.5, Table 14), where we found our current 12-layer model to be sufficient.

However, we strongly believe, in line with the well-established scaling laws in NLP, that training larger AutoGraph variants on massive graph datasets would unlock significant performance gains. Investigating these scaling laws for graphs is a high-priority and exciting direction for future work.

  1. Does the proposed method support generation conditioned on combinations of different motifs as substructure condition? Currently, the experiments only demonstrate generation conditioned on repetitive occurrences of the same motif.

Yes, absolutely. The model handles this complex conditioning out-of-the-box, without any modifications.

We performed this exact experiment on two distinct motifs that were studied before, known as the linker design task. Due to space limitations, we placed the full results in Appendix F.4 and included visualizations in Figure 6.

Crucially, AutoGraph performs this multi-motif conditioned generation effectively without any additional fine-tuning, showcasing the power and flexibility of our prompting framework.

评论

I appreciate the authors' detailed clarification, which has addressed all of my concerns. I will keep my original score.

评论

We thank the reviewer for their timely response. We are happy to continue the discussion if they have any further questions.

审稿意见
5

This paper is focused on the topic of graph generation using Transformers and uses an existing principle of recent interest that transforms graphs to sequences. It presents AUTOGRAPH, an autoregressive graph generation framework that converts graphs into token sequences to enable direct use of decoder-only Transformers for efficient and scalable generation. The key to their approach in this paper is the concept of Segmented Eulerian Neighborhood Trails (SENTs) which is a reversible graph to sequence transformation that preserves topological structure while producing token sequences applicable or similar to language modeling. This method also ensures subgraph-aware information. The paper demonstrates the effectiveness of AUTOGRAPH on synthetic and molecular benchmarks, showing improved efficiency, performance gains, and support for substructure-conditioned generation without fine-tuning. As such, this work aims to bridge the gap between graph generation and language modeling and perhaps addresses improvements on methods concerned with flattening graphs to sequences which can be well-defined and lossless.

优缺点分析

Strengths:

  • A key strength of this paper is their progress on graph to sequence conversion methods which has been of much interest in recent literature.
  • The flattening/tokenization process involves a more principled encoding using eulerian trails where each token corresponds to a structurally interpretable graph event. This is in contrast to multiple prior works which may use heuristics, neighborhood or hop encoding and templates that are manually engineered and target for use of those sequences as prompts in LLMs.
  • The SENT sequences based method provides improved validity in graph generation.
  • The construction of SENT sequences are transparent and interpretable, and is complemented with theoretical analysis.
  • The proposed method achieves significantly faster inference and training speedups compared to previous methods.

Weaknesses:

  • The paper includes an extension of SENT tokenization to (discrete) attributed graphs, but many graph tasks may be non-discrete features.
  • While this is not a weakness per se and maybe not the focus of this paper, the proposed graph sequence approach has not been evaluated on general graph tasks which prior graph tokenization approaches seem to apply to.
  • Closely related recent works on graph to sequence methods such as [1], [2] are not discussed.

[1] Fatemi, B., Halcrow, J. and Perozzi, B., 2023. Talk like a graph: Encoding graphs for large language models. [2] Chen, R., Zhao, T., Jaiswal, A., Shah, N. and Wang, Z., 2024. Llaga: Large language and graph assistant.

问题

  • Can the proposed SENT based graph to sequence representation be adapted for general graph modeling/understanding/downstream tasks (e.g., classification, regression)?
  • If yes, how can the SENT sequencing apply to large size graphs where generally subgraph or partial graph is presented in a GNN network in existing works?

局限性

yes

最终评判理由

The rebuttal has addressed the questions I raised in the initial review and the authors agree to revise the manuscript appropriately. I will keep my original positive score for the merits and contribution of the paper.

格式问题

NA

作者回复

We thank the reviewer for their constructive feedback, which we believe will strengthen our paper. Below, we address each point in detail.

W1. many graph tasks may be non-discrete features.

We agree with the reviewer that handling continuous features is an important challenge. This is a known limitation for all discrete, token-based autoregressive models.

Fortunately, this is a well-studied problem with established solutions. A standard and effective approach is to employ a vector quantization autoencoder, such as a VQ-VAE 11 , to discretize continuous attributes into a finite set of tokens. AutoGraph's framework is modular, allowing for the straightforward integration of such a quantization front-end.

[1]: Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." NeurIPS 2017.

W2. While this is not a weakness per se and maybe not the focus of this paper, the proposed graph sequence approach has not been evaluated on general graph tasks which prior graph tokenization approaches seem to apply to.

We thank the reviewer for raising this point. While our primary focus was on establishing a new paradigm for graph generation, a core motivation for connecting graphs to language models is precisely to unlock their power for understanding tasks. Our SENT representation is well-suited for this. For any graph, one can generate one or multiple SENT sequences and feed them to the trained AutoGraph model. The resulting (averaged) hidden-state representations from the transformer layers can then be used for downstream tasks. We will add a paragraph discussing this promising direction for future work.

W3. Closely related recent works on graph to sequence methods such as 11 , 22 are not discussed.

We are grateful to the reviewer for bringing these relevant works to our attention. We have reviewed them and will incorporate a detailed discussion comparing their graph-to-sequence methodologies with our own in the revised related work section.

Questions:

Can the proposed SENT based graph to sequence representation be adapted for general graph modeling/understanding/downstream tasks?

As outlined in our response to W2, the SENT representations are easily adaptable for graph understanding tasks like classification and regression. By processing a graph's sequence through our trained model, the resulting transformer hidden states can serve as effective graph embeddings. This is similar to how GPT-like models are used for downstream tasks in NLP. While these embeddings are not specifically designed for graph understanding tasks, they should still be effective up to some modifications and fine-tuning, as supported by the fact that AutoGraph learns to generate Planar and SBM graphs without supervision.

If yes, how can the SENT sequencing apply to large size graphs where generally subgraph or partial graph is presented in a GNN network in existing works?

In general, SENT sequencing can be directly applied to large sparse graphs as demonstrated in our Point Cloud and NetworkX datasets. For even larger graphs, one can represent the graph with a set of subgraphs, from which one can sample SENT sequences. The resulting embeddings can be aggregated to form a representation of the entire graph, analogous to subgraph-based GNN evaluation protocols.

评论

Dear reviewer,

Following the NeurIPS 2025 guidelines, I kindly encourage you to read and respond to author rebuttals as soon as possible. Please engage actively in the discussion with authors during the rebuttal process, update your review by filling in the "Final Justification" and acknowledge your involvement.

Thank you, Your AC

评论

Dear reviewer,

Following the NeurIPS 2025 guidelines, I kindly encourage you to read and respond to author rebuttals as soon as possible as the reviewer-author discussion period is coming to the end. Please engage actively in the discussion with authors during the rebuttal process, update your review by filling in the "Final Justification" and acknowledge your involvement.

Thank you, Your AC

评论

I thank the reviewers for their response and on the revision of the manuscript with the prior close works as well as applicability of the proposed method on general graph tasks.

审稿意见
4

This paper presents AUTOGRAPH, an autoregressive framework for generating graphs using decoder-only transformers. The core contribution lies in a reversible flattening procedure that transforms graphs into sequences, enabling the use of language modeling. The authors claim that this approach is scalable and demonstrate its effectiveness across various synthetic and molecular graph generation benchmarks.

优缺点分析

[Strengths]

  • The paper is clearly written and well-organized.
  • It provides comprehensive experimental evaluations across four general graph generation benchmarks and two molecular graph generation benchmarks.

[Weaknesses]

  • Overstatement of novelty: The claim that “AUTOGRAPH establishes the first fundamental connection between graph and language modeling” (in the conclusion) appears overstated. Prior works such as [1,2,3] have already employed decoder-only transformers for attributed graph generation. In particular, GEEL [1] shares a similar motivation and framework to AUTOGRAPH.
  • Omission of baseline metrics: The omission of MMD metrics for GEEL in Tables 1 and 3 is concerning, especially since the original GEEL paper includes these values for the same benchmarks. Given that GEEL is a highly relevant work as it shares the same motivation and is a walk-based autoregressive graph generative model, its complete evaluation should be included for fair comparison.
  • Insufficient implementation details: Key implementation details are missing, such as which version of LLaMA was used, the number of training epochs, and whether the codebase will be made publicly available.

问题

  • Clarification on token definition: The paper states that graphs are “composed of tokens” (line 37) and that it “explicitly represents graphs as token sequences” (lines 230–232). Could you clarify what is meant by “tokens” in this context? For instance, GEEL [1] represents each edge as a token—arguably a more direct analogy to tokens in language modeling—without requiring specialized architectures. How does AUTOGRAPH’s node-based representation differ from or improve upon this? Additionally, what does it mean to “interpret the sequences as language” (lines 187–189)?
  • Pre-training data scale: The graph generation benchmarks used for training involve relatively small datasets to train LLMs. Was this data volume sufficient for pre-training a large model like LLaMA? How did you address the challenges posed by the limited scale of training data?

局限性

yes

最终评判理由

The author rebuttal addressed my scores and I raised the score to borderline accept.

格式问题

No

作者回复

We thank the reviewer for their constructive feedback, which we believe will strengthen our paper. Below, we address each point in detail.

Overstatement of novelty

We appreciate the reviewer raising this point, as it allows us to clarify the specific nature of our contribution. Our full claim is that "AutoGraph establishes the first fundamental connection between graph and language modeling–where graphs are losslessly represented as token sequences, and prefixes in these sequences serve as meaningful patterns in both paradigms."

While prior works, including GEEL 11 , have used autoregressive models, they do not establish this specific connection. The crucial distinction is that our representation ensures that the sequences are subgraph-induced (Theorem 2.14), meaning that all sequence prefixes generate induced subgraphs. This property is what enables high sample novelty and quality as well as prompt-based, substructure-conditioned generation without fine-tuning, a cornerstone of modern language models that previous graph autoregressive models lack.

To summarize, our key novel contributions, particularly in contrast to GEEL, are:

  1. We provide a thorough theoretical framework for studying walk-based generative models, identifying the "subgraph-induced" property as essential for meaningful graph generation as well as identifying necessary and sufficient conditions for sequence representations that satisfy such property.

  2. Our resulting sequence representation is the first to directly support substructure conditioning via prompting, aligning graph generation with language modeling capabilities.

  3. We are the first to show that an autoregressive model can scale effectively to large and diverse graph datasets such as NetworkX and PubChem-10M, marking a critical step towards graph foundation models.

Regarding references 2,32,3 , we would be grateful if the reviewer could provide the full citations. We are committed to discussing all relevant prior work and will gladly add them to our revision. If the reviewer has any further concrete suggestions, we are also happy to incorporate them into our revision.

Omission of GEEL's MMD metrics

We understand the reviewer's concern about the missing MMD metrics for GEEL. We omitted them initially due to a critical, undocumented difference in evaluation: GEEL's public code uses a Gaussian EMD kernel for MMD, whereas all recent works (GRAN, SPECTRE, DiGress, and ours) use a Gaussian total variation kernel. This makes a direct comparison with their reported numbers misleading.

Following the reviewer's suggestion, we have invested significant effort in reproducing their results with the correct kernel. Despite challenges with the official codebase (which required debugging and contained missing files for the Point Cloud dataset), we successfully ran experiments for the Planar, SBM, and Proteins datasets using the same hyperparameters as reported in their paper.

The results, presented below, show that GEEL largely memorizes some subset of the training data (indicated by near-zero VUN and Novelty scores and low uniqueness, aligning with the numbers reported in the original paper) and thus fails to generalize. Additionally, GEEL struggles to scale to the relatively larger Proteins dataset with very high MMD metrics.


Planar

MethodDeg.OrbitClus.Spec.VUNNoveltyUniqueness
Training set0.00020.03100.00050.0038------
GEEL0.00390.00130.00620.02340.00.090.0
AutoGraph0.00040.06050.00030.006487.5100.0100.0

SBM

MethodDeg.OrbitClus.Spec.VUNNoveltyUniqueness
Training set0.00080.03320.02550.0027------
GEEL0.01060.06160.00230.03815.07.587.5
AutoGraph0.00770.05190.04390.004092.5100.0100.0

Proteins

MethodDeg.OrbitClus.Spec.NoveltyUniqueness
Training set0.00030.00680.00320.0005----
GEEL0.21100.37530.17680.168984.6100.0
AutoGraph0.00040.02440.00560.0013100.0100.0

These results further underscore the novelty and significance of AutoGraph. We will add these new results and our analysis to the revised manuscript to ensure a fair and complete comparison.

Insufficient implementation details

All hyperparameter details are provided in Appendix E.4. We also clarify that both Llama 1 and 2 use the same decoder-only transformer architecture regardless of their version, differing only in model hyperparameters, and our model uses exactly this architecture. To ensure full reproducibility, we will release all documented code and pretrained model weights upon publication, as stated in Appendix B.1.

Questions:

Clarification on token definition

We kindly direct the reviewer to Section 2.4 (Tokenization of SENT) which describes tokens and our tokenization process. In short, a token can represent a node index, a special token, a node label, or an edge label in our sequential representation.

Unlike GEEL, our SENT sequences are subgraph-induced. This property is the key reason AutoGraph supports substructure-conditioned generation and achieves high VUN, while other sequence representations, such as GEEL and our SET representation, tend to memorize training samples, as evidenced by GEEL's low VUN and novelty scores in our new experiments.

Pre-training data scale

Our model is a significantly smaller variant of Llama with 113M parameters (Line 271), not the multi-billion parameter versions, making it suitable for standard dataset sizes.

In addition, as our SENT sequences are random sequences, we resample them at each iteration. This acts as a powerful form of data augmentation, mitigating overfitting even on smaller datasets like Planar. This makes AutoGraph a versatile and powerful model, effective in both small- and large-data regimes.

评论

Sorry for the missing reference. I attached them.

[1] A simple and scalable representation for graph generation. Jang et al., ICLR 2024

[2] Graph generation with K2-trees. Jang et al., ICLR 2024

[3] Graphgen: A scalable approach to domain-agnostic labeled graph generation. Goyal et al, Proceedings of the web conference, 2020

[4] Improving graph generation by restricting graph bandwidth. Diament et al., ICML 2023.

As the rebuttal addresses my concerns, I raised my score.

评论

We thank the reviewer for considering our rebuttal and their additional references. Please note that we already discussed [1,3,4] in our related work section, and we will discuss [2] in our revision. We are happy to continue the discussion if they have any further questions.

评论

Dear reviewer,

Following the NeurIPS 2025 guidelines, I kindly encourage you to read and respond to author rebuttals as soon as possible. Please engage actively in the discussion with authors during the rebuttal process, update your review by filling in the "Final Justification" and acknowledge your involvement.

Thank you, Your AC

评论

Dear reviewer,

Following the NeurIPS 2025 guidelines, I kindly encourage you to read and respond to author rebuttals as soon as possible as the reviewer-author discussion period is coming to the end. Please engage actively in the discussion with authors during the rebuttal process, update your review by filling in the "Final Justification" and acknowledge your involvement.

Thank you, Your AC

最终决定

This paper studies the graph generation problem by transforming it to a graph-to-sequence problem. Specifically, the authors proposed a new invertible tokenization to transform a graph to a sequence representation. Empirical evaluations were conducted on large graphs and molecular graphs. Though the performance on molecular graphs was not particularly impressive, the additional experiments on transferrability and substructure conditioned generation demonstrated promising results.

One critical concern raised by multiple reviewers during rebuttal was the discussion of related work and overclaim. I hope the authors could address those in the revision.

Two reviewers gave borderline reject but raised to borderline accept after rebuttal. As all the reviewers feel positive about the paper, I vote to accept this paper.