Pard: Permutation-Invariant Autoregressive Diffusion for Graph Generation
Combining autoregressive method with diffusion for modeling graph distribution with SOTA graph generation performance.
摘要
评审与讨论
The paper introduces PARD, a graph generation model that combines autoregressive and diffusion models. Traditional autoregressive models are effective but sensitive to order, while diffusion models are permutation-invariant but need many denoising steps and extra features. PARD overcomes these issues by generating graphs block-by-block using a partial order of nodes and edges. It employs a shared diffusion model with an equivariant network and a higher-order graph transformer, supporting parallel training like GPT. PARD achieves state-of-the-art performance on various datasets, including large ones like MOSES, without extra features.
优点
- This work presents a successful showcase of the combination of autoregressive modeling with diffusion model on graph.
- The proposed partial order ensures the permutation-invariance in the autoregressive generation process.
- Impressive experimental results show the effectiveness and efficiency.
缺点
- It is unclear how the diffusion model is employed in PARD. Sec 3.1 and the second part of Eq. 6 are not quite relevant to each other. Can you elaborate on that?
- Please provide proof or reference for some statements. e.g.: 2-FWL expressivity for the proposed higher-order transformer
- Please provide the results of other baselines on QM9 if possible.
问题
- Seems that in Fig. 2 the prior of the diffusion process is not the same as the second part of Eq. 6. What is the choice of the graph distribution at timestep T?
- The total number of diffusion steps is directly related to number of blocks. I anticipate the generic graphs will have more blocks than QM9. Can you show the total number of diffusion steps for other datasets?
- Some of the discussions are redundant can be moved to appendix, like the comparison to DiGress and GRAN in Sec. 3.1 and Sec 3.2, the energy view in Sec. 3.4. The running time comparison can be moved to the experiment section.
局限性
There is no discussion about the limitation in the manuscript.
Thank you for these detailed questions and feedback for improving the presentation. Let us provide detailed response to all your questions.
- It is unclear how the diffusion model is employed in PARD. Sec 3.1 and the second part of Eq. 6 are not quite relevant to each other. Can you elaborate on that?
Sure! As you know, diffusion requires input and output to have exactly the same size (same number of nodes and edges), which is shown in Sec 3.1. Hence, to use diffusion for each block’s conditional distribution, we should pad the input (which has B blocks) to the same size of the output (which now has B+1 blocks), that is why in Eq. 6 the second part is actually conditional on the target block’s size, and the first part of Eq.6 captures the distribution of next (i.e. (B+1)-th) block’s size (|.| denotes size). The padding procedure is also shown in Fig. 2, where dashed edges and nodes are added (virtual nodes and edges). After padding, we use the second part of Eq. 6 to model the diffusion process, which is the same as procedures introduced in Sec. 3.1, except that we DO NOT modify all previously generated B blocks: noise is only introduced in the target (B+1)-th block. Please let us know whether this is clear, and we would like to hear your feedback on how to make this easier to understand. Note that you can also find/study the exact implementation in the provided anonymous repo, under pard/parallel/task.py.
- Proof for the statement: 2-FWL expressivity for the proposed higher-order transformer.
As the higher-order transformer combines PPGN and Transformer, it shares the same expressivity as PPGN. The reason is simple: the Transformer part (with weaker expressivity) can implement identity mapping, and does not affect the expressivity of the PPGN part. For the 2-FWL expressivity of PPGN, please see the [paper] (https://arxiv.org/abs/1905.11136) for the detailed proof.
- Please provide the results of other baselines on QM9 if possible.
For QM9, we adopt the more challenging setting that considers explicit hydrogens, as introduced in DiGress. We avoid the easier setting because most baselines, including DiGress and SwinGNN, can achieve 99% accuracy on it, rendering it an ineffective testbed due to the lack of distinction between models. However, we do not have the performance numbers for other baselines as they did not report their numbers in this more challenging setting. Given the time limitation of the rebuttal period, we apologize that re-running all baselines for the new setting is unrealistic, and hope you can understand.
- The total number of diffusion steps is directly related to the number of blocks. I anticipate the generic graphs will have more blocks than QM9. Can you show the total number of diffusion steps for other datasets?
As shown in Appendix Table 6, we can achieve the same result using only 140 total diffusion steps in QM9. To demonstrate the number of blocks for a generic setting, we report the numbers for the grid dataset, which has the largest graph size (100~400). For this dataset, there is a maximum of 27 blocks, with an average of 18.625 blocks. With the default 40 steps we used, the total number of diffusion steps is 745. Note that we haven't conducted an ablation study to reduce the number of diffusion steps as shown in Table 6, and we expect that a smaller number of diffusion steps can be used without sacrificing performance.
- Some of the discussions are redundant can be moved to appendix
Thank you for your suggestion, we fully agree that the part related to DiGress and GRAN can be moved to the appendix, and we will make this adjustment. Please let us know if you think other interesting results should be moved to the main section.
- There is no discussion about the limitation in the manuscript.
We briefly discuss the limitation in the Neurips checklist section, but we would like to elaborate further here. One major limitation is that each block currently uses the same number of diffusion steps, which is not ideal and can result in unnecessary sampling steps. Empirical evidence shows that approximately 50% of blocks can be predicted directly without any sampling steps, as they do not encounter transformation problems (introduced in Sec. 3.3). It would be highly beneficial to develop a hardness measure (such as graph energy) for the symmetry problem and use this measure to determine the appropriate number of diffusion steps for each block. This approach could significantly reduce the total number of diffusion steps.
If you feel that we have addressed most of your concerns, please consider supporting us in achieving a better overall score. Thank you for your time.
Thank you for the response. I want to mention that one of the main contributions of this work is its improvement in the efficiency. Given the discussion, the inference time of the proposed method highly relies on the number of blocks and does not necessarily offer efficiency on large graphs in the current version. However, the idea it self is interesting and the results are promising. Therefore, I have increased my score to 6 to support this work.
Thank you for your further support. We want to mention that for larger graph, our Pard should still outperforms DiGress in both efficiency and performance. When consider larger graph, each block's number of diffusion steps can be further reduced as there can be dramatically less symmetry inside. In fact, we have found that DiGress struggle to perform good in the Grid dataset (the largest dataset we considered) given the rich symmetry inside. Our Pard breaks symmetry easily with the AR mechanism inside, such that later blocks will have less and less symmetry and lots of blocks does not need any diffusion steps (50% in QM9 dataset). In the final version of the paper, we will add the ablation study on Grid, to show that the effectiveness in both efficiency and performance comparing to single-block version DiGress. At the end, we appreciate the discussion you provided, and we are grateful for the feedback to improving our paper.
This paper proposes a graph generation method that combines AutoRegressive (AR) models and diffusion models. By utilizing a unique partial order, it addresses the issue of non-exchangeable probabilities in AR models and the efficiency problem in diffusion models.
优点
- The proposed block-wise AR diffusion model in this paper offers a new idea for graph generation, particularly by introducing the use of weight-degree to differentiate blocks.
- The limitations of equivariant networks demonstrated in this paper also hold value for further exploration and resolution within the community.
- The overall structure and writing of the paper are relatively clear.
缺点
-
There is a part in the paper that I believe needs to be clarified more clearly to ensure logical coherence. Why does diffusion based on equivariant network solve the flaw in equivariant modeling? I think besides the analogy of tempering iron (or higher/lower energy), more mathematical proofs are needed.
-
Ablation of PPGN is necessary to demonstrate its effectiveness.
-
Following the experimental settings of GDSS, NSPDK is also an important metric for QM9 and ZINC250K.
问题
- Is there a reasonable explanation for the significant improvement of FCD on ZINC250K compared to other baselines? Similarly, why is there such a large difference in performance between Scaf. and baseline methods on MOSES?
局限性
Yes
Thank you for giving positive feedback to our paper, we would like to address all your questions in detail.
- Why does diffusion based on an equivariant network solve the flaw in equivariant modeling?
The underlying magic behind the randomness introduced in the diffusion process is relatively straightforward. During each training and inference step, noise is added to the current graph structure (assumed to have many symmetries) through resampling. This process modifies the graph to a different structure with fewer symmetries. Consequently, the number of sampling steps is crucial in controlling the degree of symmetry-breaking. A larger number of sampling steps increases the likelihood of successfully breaking the current symmetries, thereby making it easier to denoise to the target graph. This concept shares similarities with the findings of Abboud et al. in their paper "The surprising power of graph neural networks with random node initialization." We acknowledge the need for a formal proof of its connection to graph energy, and we have highlighted this in our paper to encourage future research. Note that our paper is the first work mentioning the issue of equivalent generative models, and pointing out the connection to diffusion. We hope this can motivate future research for equivalent generative modeling.
- Ablation of PPGN is necessary to demonstrate its effectiveness.
We have done an ablation study on the QM9 dataset. Please see the following table.
| QM9 | Pard w. Transformer | Pard w. PPGN | Pard w. PPGNTransformer |
|---|---|---|---|
| Maximum Hops | 3 | 3 | 3 |
| Total Diffusion Steps | 140 | 140 | 140 |
| Validity | 26.3 | 96.6 | 97.1 |
| Uniqueness | 94.5 | 96.3 | 96.0 |
| Mol stability | 17.6 | 84.67 | 86.2 |
| Atom Stability | 81.4 | 98.2 | 98.4 |
Interestingly, this indicates that PPGN is essential for diffusion with autoregression. As the transformer architecture is used in full-graph diffusion like DiGress, we hypothesize that it is less sensitive when using diffusion directly without any autoregression. Hence, we did another ablation by setting maximum hops . At , all nodes have a fixed degree of 1, resulting in a single block per graph, equivalent to full graph diffusion (only) without autoregression.
| QM9 | Pard w. Transformer | Pard w. PPGN | Pard w. PPGNTransformer |
|---|---|---|---|
| Maximum Hops | 0 | 0 | 0 |
| Total Diffusion Steps | 140 | 140 | 140 |
| Validity | 93.1 | 93.3 | 93.8 |
| Uniqueness | 96.3 | 96.7 | 96.9 |
| Mol stability | 74.7 | 77.1 | 76.4 |
| Atom Stability | 97.6 | 97.9 | 97.7 |
The above result confirms that performance is not much sensitive to architecture when no autoregression is employed.
- For experimental settings of GDSS, NSPDK is also an important metric for QM9 and ZINC250K.
We apologize for unintentionally missing the NSPDK metric. We now implemented the NSPDK metric, and computed it for QM9 and ZINC250K. Notice that we do not have the number for baselines in QM9 dataset, as we use the harder QM9 setting considering explicit hydrogens, which is introduced in DiGress. For reference, SwinGNN achieves 2.01e-4 NSPDK for a simpler setting without hydrogens. We will update the table in the final version to include the NSPDK metric.
| NSPDK | QM9 | ZINC250k |
|---|---|---|
| Pard | 2.40e-04 | 7.10e-04 |
| DiGress | - | 8.75e-3 |
| SwinGNN | - | 1.64e-03 |
| GDSS | - | 1.80e-02 |
- Is there a reasonable explanation for the significant improvement of FCD on ZINC250K compared to other baselines? Similarly, why is there such a large difference in performance between Scaf. and baseline methods on MOSES?
Our method's superior performance on these metrics can be attributed to the synergistic combination of autoregressive (AR) and diffusion models, which allows us to better capture the underlying molecular distribution. For both ZINC250K and MOSES datasets, we used the standardized evaluation code provided in the MOSES repository to compute FCD and Scaf. metrics, ensuring consistency with baseline methods. We are confident in the reproducibility of our results and can provide model checkpoints.
However, we are not very familiar with some molecule-specific metric like Scaf. We acknowledge a deeper understanding of molecule-specific metrics like Scaf. would be beneficial. Future collaboration with domain experts in molecular science could provide valuable insights into these performance differences and help refine our approach further.
We hope we have addressed your concerns and questions, and thank you again for your detailed feedback. If you feel more confident with these additional ablations and explanations, please help us in achieving a better score. Thank you.
As the discussion deadline is approaching, we kindly ask whether you still have additional questions. Given that we have answered all questions you have now, and demonstrated extensive ablations, we would like to hear you back for your thoughts.
After reviewing the additional experimental results, I maintain my positive score.
The work proposes a new graph generative model based on an autoregressive procedure. It proposes an approach to deciding a partial order of graph nodes according to their degrees in a node-removal procedure. Based on the partial order, the work devises a new graph generative model.
优点
The graph algorithm of deciding a partial order of graph nodes would be interesting if such an algorithm does not exist in the literature of graph theory.
缺点
The work lacks justification. As the field has moved to generative methods with discrete-diffusion models, which are already permutation-invariant, it is less clear about the advantage of designing a complex autoregressive model to satisfy the permutation-invariant property.
The advantage of the model is not obvious even considering only autoregressive models. Note that Chen et al. [9] have an approach of "optimizing" node orders for the generative model and show that the likelihood calculation is more accurate with their approach than a pre-determined order. How does the work justify its advantage over such an approach?
The analysis in 3.3 does not seem to be reasonable. The probability calculations are indeed the same for nodes in the same orbit, but they may get different connections in the sampling procedure and then break the symmetry. The analysis in 3.3 is well known, and it is not a concern for generative models. In some diffusion-based generative models, the starting graph is a graph with no edges, then all nodes are in the same orbit, but it is not an issue at all because the edge sampling process will break the symmetry.
Without clear justification, I don't know where performance improvements are from (maybe architecture improvement?). I feel that the work should have a thorough investigation of the model.
问题
How do you justify the advantage of using an autoregressive model with partial order?
局限性
The proposed model seems to have a long running time because it needs to run a diffusion model at the generation of each block.
Thank you for your questions. We have conducted extensive new ablation studies. We want to show that our analysis/motivation is not just a fancy story, but indeed the primary driver behind performance.
- Lack of justification: it is less clear about the advantage of designing a complex autoregressive model to satisfy the permutation-invariant property.
The reviewer questions why we introduce an autoregressive (AR) approach to a diffusion model, given that the default diffusion model already captures the permutation-invariant property. In DiGress, we observed that full-graph diffusion, while permutation-invariant, requires (1) many sampling/denoising steps to break symmetry and (2) additional features like eigenvectors to further break symmetry. This shows that directly capturing the FULL joint distribution and solving the transformation difficulty (in Sec. 3.3) via diffusion is challenging. Additionally, AR methods still dominate LLMs, indicating their potential to benefit diffusion models. As stated in the abstract, this paper aims to combine the strengths of both approaches: AR’s training and inference efficiency and diffusion’s data efficiency through permutation-invariance.
To verify our analysis quantitatively, we conduct an ablation study on the maximum hops , which controls the degree of autoregression. At , all nodes have a fixed degree of 1, resulting in a single block per graph, equivalent to full graph diffusion (only) without autoregression. Increasing produces more blocks with smaller average block size, indicating more AR steps.
| QM9 | Pard (no AR) | Pard | Pard | Pard |
|---|---|---|---|---|
| Maximum hops | 0 | 1 | 2 | 3 |
| Average number of blocks | 1 | 4.3 | 5.6 | 7.75 |
| Diffusion steps per block | 140 | 32 | 25 | 20 |
| Total diffusion steps | 140 | 140 | 140 | 140 |
| Validity | 93.8 | 97.1 | 96.7 | 97.0 |
| Uniqueness | 96.9 | 96.5 | 96.2 | 96.1 |
| Mol stability | 76.4 | 86.1 | 85.4 | 86.3 |
| Atom Stability | 97.7 | 98.3 | 98.3 | 98.4 |
This ablation study maintains 140 total diffusion steps across all trials, using the same model architecture, diffusion algorithm, and training settings. The significant improvement from to confirms that our enhancement stems from breaking the full joint distribution into several conditional distributions, effectively combining diffusion and AR approaches.
- The likelihood calculation is more accurate with Chen[9]’s approach than a predetermined order. How does the work justify its advantage over such an approach?
The reviewer is concerned that the predefined partial order is not optimal, given that Chen[9] shows that a learned order can be better than a predefined node order like BFS and DFS. We want to clarify that we are not arguing that our predefined partial order is optimal, instead, we claim that partial order should be used to replace the absolute order in an autoregressive approach as it is fully deterministic and a canonical partial order can be easily defined, which is the cornerstone of the permutation-invariant property we achieved. On the contrary, the absolute node order contains randomness (as finding a canonical node order is as hard as solving a graph isomorphism test), cannot achieve permutation invariance and hence is data inefficient. Our experiments comparing with many AR methods already verified this.
Nevertheless, our predefined partial order may not be optimal. In future, a reasonable follow-up work could be to learn a task-dependent partial order based on graph structures, e.g. one can consider the variational approach in Chen[9]. We look forward to seeing better partial order(s) being discovered.
- The analysis in 3.3 is well known, and it is not a concern for generative models, as the sampling process will break the symmetry.
The analysis in 3.3 is indeed known in the link prediction setting, but this is the first time being mentioned for graph generation. If you know some paper has mentioned it in generation, we are happy to add references.
Indeed, sampling breaks the symmetry and makes the transformation achievable in certain conditions. We have clearly stated this in Sec. 3.4, which motivates us to use diffusion for capturing block conditional distribution. Nevertheless, as sampling is the only way to help symmetry breaking, lots of sampling steps are needed for a symmetry-intense setting. Splitting a FULL joint distribution to several conditional distributions simplifies it, with each conditional distribution containing less symmetry, which is easier to capture. Hence we show not only a significant improvement in generation quality, but also a great reduction in sampling steps.
When , more diffusion steps breaks symmetries more, thus one may argue that we just need to increase diffusion steps. Thus, we conduct an additional ablation for steps when . As shown, there is still a large gap to Diffusion+AR.
| Maximum Hops | 0 | 0 | 0 | 0 | 1 |
|---|---|---|---|---|---|
| Total Diffusion Steps | 70 | 140 | 280 | 490 | 140 |
| Validity | 92.1 | 93.8 | 94.3 | 95.2 | 97.1 |
| Uniqueness | 96.4 | 96.9 | 96.5 | 96.9 | 96.5 |
| Mol stability | 74.0 | 76.4 | 79.3 | 79.2 | 86.1 |
| Atom Stability | 97.4 | 97.7 | 97.9 | 98.0 | 98.3 |
- Pard seems to have a long inference running time.
In diffusion, inference time depends on (1) total number of diffusion steps and (2) resource usage per step. Pard outperforms full-graph diffusion in both aspects:
- Pard uses 1/4 steps (140 vs. 500) and significantly outperforms DiGress.
- DiGress denoises the whole graph at each step, while Pard only denoises the next block (significantly smaller) with less resource per step.
- Pard, with its AR mechanism, allows previously generated blocks to be cached, accelerating inference similar to KV cache in LLM inference.
We hope these ablations showcase the advantage of Pard and the intrinsic source of the performance improvement clearly. We are happy to address any concerns and hope you can consider reevaluating our work.
Can you further elaborate on how the proposed model uses fewer steps than DiGress? If my understanding is correct, the generation of each block would require multiple denoising iterations. Suppose there are 5 blocks, and the total number of iterations is 1/4 of DiGress, then on average each block takes 1/20 of DiGress iterations. Is this correct?
If the generated blocks are cached, does it mean that their representations are not updated by later generated structures?
Thank you for the timely response. For your additional two questions:
-
Yes, each block only takes 1/20 of DiGress iterations. Please see the first ablation study table, you will find the 'Diffusion steps per block' indicates the number of steps we used for each block. Total diffusion steps = num of blocks * diffusion steps per block.
-
Yes, all generated blocks are fixed, and are used to be conditioned on to generate the next block. The noise and randomness is only introduced in the target block. In Section 4.2, we leverage this property to enable parallel training of all blocks, similar to how next-token predictions in LLM training are parallelized.
Please let us know for any further questions!
I feel a little strange that the generated block is not updated. Initial blocks should have a lot of symmetry, and late blocks break the symmetry by adding new structures. If the representation of initial blocks is not updated, they will provide less information than they should. Is that true?
You are correct: the symmetry is not evenly distributed in all blocks. Actually, we have found that about 50% blocks have no symmetry inside: directly predict the next block gives 100% accuracy and no noise is needed at all. This means that the 1/4 total diffusion steps of the DiGress is not the optimal: AT LEAST you can achieve 1/8 fraction of total diffusion steps comparing with DiGress and still outperform it (as there are about 50% blocks does not need any diffusion steps). In fact, in the response to the last reviewer, we have mentioned the following:
One major limitation is that each block currently uses the same number of diffusion steps, which is not ideal and can result in unnecessary sampling steps. Empirical evidence shows that approximately 50% of blocks can be predicted directly without any sampling steps, as they do not encounter transformation problems (introduced in Sec. 3.3). It would be highly beneficial to develop a hardness measure (such as graph energy) for the symmetry problem and use this measure to determine the appropriate number of diffusion steps for each block. This approach could significantly reduce the total number of diffusion steps.
However, since we haven't yet found a straightforward solution, we decided to assign the same number of diffusion steps to each block. If the reviewer has concerns about our results, we encourage you to run the provided source code, as our results are fully reproducible. If any issues arise, we are also able to provide checkpoints for further verification.
We are more than willing to provide any additional information you require and kindly ask that you reconsider our work. We believe that our paper does not exhibit "technical flaws, weak evaluation, inadequate reproducibility, or incompletely addressed ethical considerations."
We realized that you may also have another question: the parallel version of our algorithm shares representations across blocks, hence the earlier generated blocks provided less information to the future block as their representation cannot be updated. Actually, we have fully described the answer to this question in our original paper's section 4.2: why we want to design a parallel version of our algorithm.
Initially, our algorithm is not parallel at all: representations are not shared across steps, and each step uses its own graph representation to gain the best performance. Nevertheless, we realized that this approach has a huge suffer in training time: assume that you have K blocks in total, you get roughly K times more training time than the DiGress as you have K times more input graphs by viewing each conditional step's input graph as an individual graph. We solve the training time issue with our proposed nontrivial, novel and critical parallel training algorithm, such that all steps share the same blocks representation. Notice that this is an extremely important design: the advantage of causal transformer over RNN is the ability to parallel train all next-token predictions. And of course, you get less information from the previous tokens for the future token predictions: that's why Bert is preferred for document embedding instead of decoder-only GPT. Nevertheless, given the efficiency of GPT for capturing the whole distribution and parallel training, it is still the dominant approach of LLM, and the problem you mentioned does not block its effectiveness.
We suggest the reviewer to take a look at the section 4.2 for the detailed designs and explanations.
As the discussion deadline is approaching, we kindly ask whether you still have additional questions. Given that we have answered all questions you have now, and demonstrated extensive ablations, we would like to hear you back for your thoughts.
Dear Reviewer adow, We hope this message finds you well.
As the discussion deadline is rapidly approaching (in approximately 10 hours), we are reaching out to respectfully request your final feedback on our rebuttal. You provided the lowest score with soundness 1 (poor) and contribution 2 in the category of "technical flaws, weak evaluation, inadequate reproducibility, or incompletely addressed ethical considerations." We greatly appreciate the thorough review and the numerous questions you raised, as they have helped us improve our work.
In response, we have diligently addressed all of your questions in detail in many rounds, providing extensive ablation studies to support our claims and clarify any potential misunderstandings. We believe this additional information directly addresses the concerns that led to the low score. Given the significance of your evaluation and the substantial effort we've put into addressing your queries, we kindly request that you review our responses and provide your updated assessment. Your feedback is crucial for a fair evaluation of our work and will greatly assist the area chair in making an informed decision.
We understand that time constraints can be challenging, but we would be incredibly grateful if you could take a moment to review our rebuttal and adjust your feedback accordingly. Your expertise and insights are invaluable to this process, and we look forward to your response.Thank you for your time and consideration. We appreciate your dedication to maintaining the high standards of our field.
This paper proposes to integrate autoregression models with diffusion models seamlessly to harnesses the effectiveness and efficiency of the autoregressive model while maintaining permutation invariance without order sensitivity. It also proposes architectural improvement to make the model and algorithm efficient and scalable. The presentation is smooth and the experimental results on both molecular and general graph generation demonstrate its effectiveness.
优点
It proposes a novel graph decomposition method considering not individual node and its degree but subsets of nodes with structual similarity. In this way, it removes node order sensitivity in the graph but only needs to maintain the order of the blocks. Within each block, the diffusion model focuses on a much smaller graph and thus has the efficiency to generate a denoised graph.
缺点
It would be better if the authors can provide some insights about the hyperparameter the maximum of hops .
问题
It would be better if the authors can provide some insights about the hyperparameter the maximum of hops .
局限性
The authors have adequately mentioned several limitations of their work which sound quite reasonable.
Thank you for giving positive feedback to our paper, we would like to further answer your question in detail.
Provide some insights about the hyperparameter the maximum of hops
Let us first lay out 's impact on the model: There is a relation between and block-size (hence the number of autoregression steps). As increases, the "weighted degree" of nodes diversify more with fewer equivalences, which induces a partial order with smaller blocks (i.e. fewer number of nodes and edges inside each block). Thus, the larger the hops is, the larger is the number of blocks and hence the number of autoregression steps.
Originally, we did not do ablation on the maximum hops during our experiments. We have an intuition that there would be a tradeoff, such that the smaller the block size, the easier it would be to learn within block diffusion, but the harder the autoregressive (AR) generation of the next block would be due to the exposure bias and the consequent error propagation of AR methods (If a block is mistakenly predicted during inference, the error will be propagated and the future blocks conditioned on this one will be impacted---this is true for all AR methods like LLMs). As the maximum hops essentially controls the combination ratios of diffusion model and autoregressive approach, we would like to choose it to make the number blocks to be 3 to 5 times smaller than the total number of nodes (such that each block has 3 to 5 number of nodes on average). Hence, for all molecular graphs, we set . For all non-molecular graphs, we set , as this provides a reasonable number of blocks based on our intuition.
Nevertheless, we would also like to formally and quantitatively answer the influence of . Hence we have conducted an ablation study on the QM9 dataset, changing from 0 to 4. Notice that when , all nodes have a fixed degree 1, hence it only provides a single block, which is equivalent to full graph diffusion (only) without any autoregressive approach. As , block size (count) tends to decrease (increase) and we have a diffusion-AR hybrid as proposed.
| Pard | Pard | Pard | Pard | Pard | |
|---|---|---|---|---|---|
| Maximum hops | 0 | 1 | 2 | 3 | 4 |
| Average number of blocks | 1 | 4.3 | 5.6 | 7.75 | 7.83 |
| Diffusion steps per block | 140 | 32 | 25 | 20 | 18 |
| Total diffusion steps | 140 | 140 | 140 | 140 | 140 |
| Validity | 93.8 | 97.1 | 96.7 | 97.0 | 97.1 |
| Uniqueness | 96.9 | 96.5 | 96.2 | 96.1 | 95.9 |
| Mol stability | 76.4 | 86.1 | 85.4 | 86.3 | 85.7 |
| Atom Stability | 97.7 | 98.3 | 98.3 | 98.4 | 98.3 |
As you can see from this ablation on , we have found:
- Combining autoregressive and diffusion works significantly better than diffusion-only (=0) approach.
- The performance seems to be robust with respect to the maximum hops (at least in range 1-4).
We hope that we have fully addressed your concerns, and we are looking forward to your further input and your support of the paper in score.
As the discussion deadline is approaching, we kindly ask whether you still have additional questions. Thanks again for supporting us.
We thank all the reviewers for their feedback and suggestions on our work. Here we first re-iterate the key contributions of our work, and then summarize the list of additional experiments we performed.
Why a Hybrid (AR+Diffusion) Approach for (Graph) Generation?
Our proposed Pard for graph generation is the first of its kind in bringing together autoregressive (AR) and diffusion-based generation. We want to emphasize that this approach stands on clear motivation, in other words, we have not simply “slap” diffusion on AR for no obvious reason.
Our contributions and rationale/justification for the effectiveness of PARD are: 1) We establish a novel partial order of nodes that is permutation-invariant. Thanks to this novel partial order, autoregression becomes permutation-invariant, which is not the case for ANY prior graph autoregressive method in the literature. This addresses the challenging pain point of AR of being data inefficient and having bad generalization. 2) With the order being partial, the graph is split into blocks, such that we decompose the complex joint distribution is decomposed into easier conditional probabilities. Next, we study how to model each block's conditional distribution. For the first time, we have identified the fundamental problems of equivariant model for generation: general graph transformation without symmetry breaking is impossible for ANY equivariant model. This further motivates us to choose diffusion approach to model each block's conditional distribution: the magic of randomness and annealing process. Following the line of reasoning/designs step by step, we found that the combination of AR and diffusion is not accidental, but an inevitable choice. Even more, this inevitable combination successfully combines the strength of both approaches while getting rid of their shortcomings:
- With being permutation invariant (+ advantage of diffusion), it generalizes better than AR and being much data efficient (- downside of AR).
- With decomposing the hard joint probability to simpler conditional distributions (+ advantage of AR), it uses significantly less diffusion steps while dramatically outperforms pure diffusion method (- downside of diffusion). What is more, each inference step uses less cost as it does not access to the full graph but just the generated part, and can use cache (like the KV cache in LLM inference) to avoid repeated computations (+ advantage of AR).
The impressive performance improvement (by reviewer 4) strongly shows the effectiveness of Pard. And many detailed advantages have been observed by reviewers:
- Reviewer 3 has summarized it all well: “By utilizing a unique partial order, it addresses the issue of non-exchangeable probabilities in AR models and the efficiency problem in diffusion models.”
- Reviewer 1’s understanding is also spot-on: “... it removes node order sensitivity …. Within each block, the diffusion model focuses on a much smaller graph and thus has the efficiency to generate a denoised graph.”
- Reviewer 4 states: “successful showcase of the combination of autoregressive modeling with diffusion model”
Additional Ablations and Experiments
Nevertheless, we have further done extensive ablation studies to verify Pard's rationality and designs. There is an interplay between the number of AR vs. diffusion steps in Pard, that is indirectly controlled by maximum hops . As increases, the weighted degree of nodes diversify and hence we obtain fewer node equivalences = smaller blocks = more AR steps. In contrast, when , we have only a single block, which corresponds to diffusion-only model (no AR).
-
In the first ablation on QM9 dataset, we keep the total number of diffusion steps (across all AR steps) fixed, and study how performance changes for varied from 0 to 4.
-
In the second ablation, we increase the number of diffusion steps when to see if more diffusion within a single block helps achieve better performance without AR, and compare that to (i.e. diffusion+AR hybrid).
Based on reviewer feedback, we also do ablation regarding architecture. Recall that Pard combines Transformer with PPGN (coined PPGNTransformer) in a novel design.
-
We report performance results for Pard w/ PPGN only and Transformer only as compared to the original PPGN+Transformer.
-
We also report results for an additional molecular metric as requested by a reviewer.
We hope that the additional experiments, discussions, and clarifications lead to a better appreciation of the contributions of our work. Graph generation is a hard problem, gaining popularity only in the last 5 or so years, and we strongly believe that Pard sets a new milestone in this area with its rigorous design and SOTA empirical performance.
The paper introduces a novel framework that combines autoregressive (AR) and diffusion models for graph generation, effectively leveraging the strengths of both approaches while mitigating their inherent weaknesses. Reviewer SjHr praised the paper for its “novel graph decomposition method” that addresses node order sensitivity, noting the efficiency gains in graph denoising and describing the presentation as “smooth” with high impact contributions. Reviewer 2orD highlighted the innovative use of partial ordering to tackle “non-exchangeable probabilities in AR models,” deeming the work to be “technically solid” and offering a fresh perspective on graph generation. Reviewer KWRX echoed these sentiments, acknowledging the paper as a “successful showcase” of AR and diffusion models in tandem and underscored the “impressive experimental results” that demonstrate state-of-the-art performance on various datasets. Despite some concerns raised by Reviewer adow regarding the justification of combining AR with diffusion, the authors provided extensive ablation studies and clarifications, showing that their approach not only reduces the number of diffusion steps but also significantly outperforms existing models. The paper’s rigorous design, clear motivation, and robust empirical results make it a strong candidate for acceptance.