Cross-Domain Graph Data Scaling: A Showcase with Diffusion Models
We introduce a universal graph structure augmentor for cross-domain data scaling.
摘要
评审与讨论
The authors propose UniAug—a universal graph structure augmenter based on a discrete diffusion model. It enables cross-domain pretraining on large-scale graph data and can be seamlessly plugged into downstream tasks to enhance performance on node-level, edge-level, and graph-level tasks through structural augmentation. The results demonstrate that graph data can also benefit from the "big data + big model" scaling effect, similar to what has been observed in text and image domains.
优缺点分析
Strengths:
-
First cross-domain graph augmentation paradigm leveraging data scaling: Demonstrates that large-scale heterogeneous graphs can be unified for pretraining and effectively transferred across domains.
-
Structure-feature decoupling: Focuses solely on modeling graph structure, addressing the challenges of inconsistent feature dimensions or semantics across different graphs.
-
Self-conditioned discrete diffusion with guided generation: Ensures efficient and stable learning on sparse graphs; the guided head aligns generation with downstream tasks, mitigating noise.
-
Plug-and-play compatibility: Requires no modification to downstream GNNs and can be seamlessly integrated with existing Graph Data Augmentation (GDA) methods.
Weaknesses
-
Due to computational constraints, the authors trained only a moderately sized diffusion model and did not systematically investigate how scaling model parameters or depth would affect performance.
-
The discrete diffusion process involves many sampling steps, resulting in slow generation; the authors list “introducing fast sampling” as future work.
-
UniAug does not model node/edge features at all, instead relying on “directly concatenating the original features back.” This may lead to information loss or require extra processing in tasks that depend on attributes like chemical bonds or textual properties.
-
During pretraining, pseudo-labels are generated via clustering based on graph statistics. However, the authors did not conduct ablation studies on the quality of these pseudo-labels or the model’s robustness to label noise.
-
The authors manually filtered out “abnormal graphs” using thresholding, yet regions of “sparse distribution” still remain (as shown in Fig. 2).
-
Although described as “plug-and-play,” each task still requires training a Guidance MLP Head, and the experiments make use of supervision signals (e.g., node/graph labels, CN-based heuristics).
问题
1.What were the thresholds and rules for filtering out “abnormal graphs”? Are the dataset licenses/publications compatible with large-scale public pretraining use?
-
Which graph statistics were used for clustering? How were hyperparameters (e.g., number of clusters K, distance metrics) chosen?Where are the experiments evaluating model sensitivity to pseudo-label noise? Can the authors report performance curves under varying noise levels?
-
In domains heavily reliant on node/edge attributes (e.g., molecular bonds, knowledge graph entity types), how do the authors ensure that modeling only structure does not omit critical information?
-
What are the number of layers, hidden dimensions, and total parameter count of the diffusion network? How does this compare to mainstream “foundation models for graphs”? Did the authors attempt to scale up the model or data size? If limited by compute, can they report scaling law results across small, medium, and large setups?
-
On average, how many diffusion steps and how many seconds are required to generate one augmented graph? How does this compare to existing augmenters like GDA or EDGE in terms of speed? Has the feasibility of one-step or few-step sampling methods (e.g., consistency distillation) been explored?
-
How much labeled data is required to train a Guidance Head per task? What is the performance under low-label (<10%) or zero-shot settings? Has using a shared Guidance Head across multiple tasks been tested?
-
Why are recent generative models (e.g., GraphFM, GraphGPT) and attribute-aligned pretraining methods (e.g., GraphMVP, GraphMAE) not directly compared? For some tasks, only top-k accuracy is reported while ROC-AUC or MAE is omitted. Does this omission affect the generality of the conclusions?
局限性
Please see above
最终评判理由
I will keep my score
格式问题
N/A
Q1. What were the thresholds and rules for filtering out “abnormal graphs”? Are the dataset licenses/publications compatible with large-scale public pretraining use?
We thank the reviewer for these important questions. To ensure the quality of our pre-training corpus, we applied a set of rules to programmatically filter out graphs with pathological or outlier structures. The thresholds we used are:
- Degree Mean: ≤3
- Degree Variance: ≤3
- Graph Density (for graphs with ≥100 nodes): ≤0.1
- Entropy: ≤10
These values were chosen based on an empirical analysis of the full data collection to remove clear outliers (e.g., star graphs, overly dense or sparse graphs) while retaining a diverse set of typical real-world networks.
We have meticulously verified the license for each data source and confirm that all datasets are compatible with large-scale public pre-training use.
Q2. Which graph statistics were used for clustering? How were hyperparameters (e.g., number of clusters K, distance metrics) chosen?Where are the experiments evaluating model sensitivity to pseudo-label noise? Can the authors report performance curves under varying noise levels?
We thank the reviewer for these insightful questions about our methodology. We provide the requested details below
- Graph Statistics Used: As detailed in Section 3.2, we generated graph-level representations using six standard graph properties: number of nodes, density, network entropy, average degree, degree variance, and scale-free exponent.
- Hyperparameter Selection: We employ the K-Means clustering algorithm on the graph statistics. The number of clusters (K) was chosen via a principled, two-step process. First, we identified candidate values for K that yielded good cluster separation. Then, from these candidates, we selected the final value of K=10 by maximizing the mean Silhouette Coefficient, ensuring the most stable and meaningful grouping in the data.
Regarding the “label noise”, unlike in semi-supervised learning where pseudo-labels are noisy estimates of ground-truth classes, their purpose here is to provide a high-level structural conditioning signal. They group graphs with similar topological properties to encourage our diffusion model to learn a structured latent space. The goal is not to predict a "correct" cluster ID, but to leverage this grouping to enhance the model's ability to capture diverse structural patterns, a strategy also noted in [1].
Q3. In domains heavily reliant on node/edge attributes (e.g., molecular bonds, knowledge graph entity types), how do the authors ensure that modeling only structure does not omit critical information?
We thank you for this critical question, as it allows us to clarify a key detail of our method and highlight one of its surprising strengths. We retains original node attributes for downstream tasks but does not use edge attributes.
We recognize that edge attributes are vital in domains like molecular chemistry. However, our results in Table 3 show something remarkable: even on the edge-critical molecular regression task, UniAug outperforms other general-purpose baselines and achieves performance highly competitive with DCT—a model specifically pre-trained on molecules with their rich edge features. This powerful result demonstrates that our structure-focused pre-training is effective that it can learn to compensate for the absence of explicit edge features, likely by capturing their influence through topology alone.
Q4. What are the number of layers, hidden dimensions, and total parameter count of the diffusion network? How does this compare to mainstream “foundation models for graphs”? Did the authors attempt to scale up the model or data size? If limited by compute, can they report scaling law results across small, medium, and large setups?
We thank the reviewer for these important questions about model scale and its comparison to other large graph models.
- Model Size: Our diffusion network uses 4 layers, a hidden dimension of 128, and has approximately 2 million parameters.
- Comparison: The field of "foundation models for graphs" is still emerging, with few established benchmarks for model size. Comparing to recent works, our model's parameter count is comparable to that of OFA [2], and significantly larger than Graphany [3].
While a full scaling analysis for larger models was beyond our available computing resources, the layer-wise study above serves as a preliminary result. As shown in the table below, we evaluated generation quality for varying numbers of layers. Performance on key structural metrics peaked at 4 layers and then began to degrade. This indicates a "sweet spot" for model capacity on this task, suggesting larger models may be prone to overfitting or optimization difficulties.
| Num Layers | Degree | Spectral | Clustering |
|---|---|---|---|
| 2 | 0.3851 | 0.4865 | 1.25 |
| 3 | 0.3344 | 0.4659 | 1.15 |
| 4 | 0.2648 | 0.4592 | 1.11 |
| 5 | 0.3378 | 0.4917 | 1.19 |
Q5. On average, how many diffusion steps and how many seconds are required to generate one augmented graph? How does this compare to existing augmenters like GDA or EDGE in terms of speed? Has the feasibility of one-step or few-step sampling methods (e.g., consistency distillation) been explored?
We thank the reviewer for these excellent questions about the efficiency of our augmentation process. Our standard process to generate one augmented graph uses 128 diffusion steps. On an NVIDIA A6000 GPU, this takes approximately 0.2 seconds per graph. In terms of speed, this is directly comparable to EDGE. As expected, this is slower than simple, non-diffusion methods, but our method provides much richer and more tailored structural augmentations.
Adapting complex techniques like consistency distillation, which are primarily designed for continuous data, to our discrete graph diffusion framework is a significant research challenge in its own right. However, we have explored a simpler approach to acceleration. Our preliminary experiments show that reducing the number of sampling steps to just 32 (a 4x speedup) still yields downstream performance that is highly competitive and comparable to using the full 128 steps.
Q6. How much labeled data is required to train a Guidance Head per task? What is the performance under low-label (<10%) or zero-shot settings? Has using a shared Guidance Head across multiple tasks been tested?
We thank the reviewer for these excellent questions about the data requirements and functionality of the guidance Head. The Guidance Head is highly flexible and does not strictly require labeled data. Its implementation depends on the downstream setting:
- Supervised Setting: When task labels are available, the Guidance Head can be a simple, lightweight predictor trained on that data.
- Low-Label / Zero-Shot Setting: A key strength of our framework is its performance in label-scarce scenarios. In these cases, the Guidance Head can be implemented as a simple, unsupervised heuristic. As we demonstrate in our appendix (Table 18), using an established heuristic like Common Neighbors for link prediction—which requires zero labels—yields excellent results. This highlights our method's effectiveness in zero-shot contexts.
Regarding sharing a Guidance Head across tasks, we clarify that this would contradict the core design of our framework. The fundamental purpose of the Guidance Head is to be dataset-specific. It's a specialized, lightweight module that steers the universal, pre-trained diffusion model to generate augmentations tailored to the distribution of the targeting downstream graphs.
Sharing one head across different tasks would force it to learn a single, compromised guidance signal for distinct distributions. This would prevent our model from producing the specialized, high-quality augmentations that are key to its performance. The power of our approach lies in combining the generalization of the core diffusion model with the specialization of the guidance Head.
Q7. Why are recent generative models (e.g., GraphFM, GraphGPT) and attribute-aligned pretraining methods (e.g., GraphMVP, GraphMAE) not directly compared? For some tasks, only top-k accuracy is reported while ROC-AUC or MAE is omitted. Does this omission affect the generality of the conclusions?
We thank the reviewer for these questions about our experimental setup. We aimed to provide a comprehensive comparison against relevant and established baselines.
- Recent Generative Models:
- GraphFM was not included as it is a concurrent work that was not publicly available or formally published at the time of our experiments.
- GraphGPT represents a different research paradigm focused on instruction-tuning large language models (LLMs) for graph-related tasks. A direct comparison is not meaningful due to the fundamentally different architectures, training methods, and objectives.
- Attribute-Aligned Pre-training: We want to clarify that we do compare extensively against state-of-the-art attribute-aligned methods. As shown throughout our results, our baselines include prominent models like GraphMAE, BGRL, JOAO, and D-SLA (Tables 2, 3, 4, 5).
Regarding evaluation metrics, we strictly adhere to the established benchmark to ensure a fair and direct comparison with prior work. The settings are clearly described in the experiment section.
[1] Gao, Shanghua, et al. "Large-scale unsupervised semantic segmentation."
[2] Liu, Hao, et al. "One for all: Towards training one graph model for all classification tasks.”
[3] Zhao, Jianan, et al. "Graphany: A foundation model for node classification on any graph.”
We appreciate your feedback and would like to clarify any misunderstandings. We would sincerely appreciate your reconsideration of the score based on these contributions.
Authors present UniAug, which uses a diffusion model to learn fundamental graph properties, and create a general-purpose model that understands graph structures across a wide variety of domains. By training on cross-domain graphs, and being agnostic to the node features, UniAug can move to downstream applications where they preserve original node features but have new neighborhood structures. Authors show performance improvements in node classification, link prediction, and graph property prediction.
优缺点分析
Strengths: Datasets are comprehensive, domains are diverse, and improvements are mostly consistent across datasets. Authors also tackled all the major graph tasks. Homophilic and heterophilic graph datasets are also covered.
Weaknesses: Improvements are generally not large, which could be offset by the fact that there are so many domains and datasets. Understanding the sensitivity of hyperparameters is missing, though called out as due to limited resources. Paper is also largely empirical, and given that the consistent improvements across homophilic and heterophilic graphs, I think a theoretical foundation for why UniAug works across these graph structures would bring confidence to the reader.
问题
My main question is how much total time is spent on training UniAug and applying it to a specific graph task. I saw some data on the impact of pre-training on the tasks but was only able to gather the trend that more pre-training results in higher metrics. This is important because readers should be able to understand the computational cost of applying UniAug. While the diffusion model can be applied, we still have to train the MLP guidance head.
局限性
Yes
最终评判理由
Authors have sufficiently answered most of my questions to bump up to borderline accept. Some issues remain with metrics and theoretical foundation, so I won't go higher than that.
格式问题
None
Q1. Improvements are generally not large, which could be offset by the fact that there are so many domains and datasets.
We thank the reviewer for raising this concern. We respectfully argue that the improvements from UniAug are both significant and a direct result of our core methodology.
First, the improvements are not marginal; they are dominant and consistent. While individual percentage gains may vary, the average rank across multiple datasets provides the clearest picture of robust performance:
- In Table 4, UniAug achieves an average rank of 1.43, far superior to the second-best method's 3.00.
- Similarly, in Table 5, our method's rank of 1.43 decisively beats the next-best self-supervised method (BGRL), which has a rank of 5.33.
This pattern of consistently securing the top rank (Tables 4, 5, 12) demonstrates a non-marginal, highly reliable performance gain.
Second, these significant gains are caused by our proposed mechanism, not just the data. The most direct proof is our ablation study in Table 7. This experiment shows that when we disable the diffusion guidance, performance drops significantly, leading to negative transfer. This proves that the success of UniAug is causally linked to its ability to perform task-specific augmentation, rather than being a simple side effect of a large and diverse pre-training set.
Q2. Understanding the sensitivity of hyperparameters is missing
We thank the reviewer for this excellent suggestion. We agree that a hyperparameter sensitivity analysis is important for understanding our model's robustness, and we have performed the requested analysis. We analyzed the effect of the guidance step size, a key hyperparameter. Below are the results for link prediction (MRR) on the Cora dataset. As the table shows, our method achieves strong performance across a reasonably wide range of step sizes (from 0.1 to 10), with performance peaking at 1.0. This demonstrates that our method is not overly sensitive to this hyperparameter and does not require expensive, fine-grained tuning to achieve good results.
| Step size | 0.01 | 0.1 | 1 | 10 | 100 |
|---|---|---|---|---|---|
| Cora MRR (/times 100) | 32.22 ± 8.23 | 34.22 ± 7.91 | 35.36 ± 7.88 | 34.22 ± 5.12 | 31.05 ± 10.21 |
Q3. A theoretical foundation for why UniAug works across these graph structures would bring confidence to the reader
One of the key motivations of this manuscript is that GNN model architectures should be aligned with task-specific inductive biases [2]. For instance, GCN performs well on node classification for homophilic graphs, while GIN demonstrates superior performance on graph classification. This highlights the inherent difficulty in designing a single GNN that performs well across all types of tasks.
To address this, graph data augmentation (GDA) methods aim to mitigate the issue by directly modifying the data, effectively aligning data patterns with model designs. For example, Half-Hop [3] adds nodes on each edge to improve GCN's performance on node classification tasks for heterophilic graphs, achieving substantial gains. CFLP [4] generates counterfactual links to enhance link prediction. However, while GDA methods offer performance improvements, they often rely on handcrafted augmentation strategies tailored for specific tasks or datasets, limiting their generalizability to broader applications.
UniAug address this issue by leveraging a diffusion model. Intuitively, UniAug learns diverse data patterns by pre-training on graphs across domains, enabling it to adaptively augment downstream graphs with task-specific and data-specific guidance. This allows UniAug to "automatically" align data patterns with the downstream GNN architecture, resulting in empirical success across a wide range of tasks and datasets.
Q4. My main question is how much total time is spent on training UniAug and applying it to a specific graph task.
We thank the reviewer for raising the concern regarding the computation cost. Here, we provide a comparison with the best self-supervised baseline, GraphMAE. In our experiment, we have two settings for GraphMAE: (1) pre-training on the same set of graphs and fine-tuning on downstream graphs, and (2) self-supervised learning to obtain embeddings and train a classifier.
We first compare the pre-training time of GraphMAE and UniAug in the following table, where we see that the pre-training time of UniAug is longer than that of GraphMAE. As emphasized in Tables 2, 16, and 19, the pre-training version of GraphMAE leads to negative transfer, while UniAug consistently achieves positive transfer.
| Total time (minutes) | |
|---|---|
| GraphMAE | 125 |
| UniAug | 378 |
Next, we perform the comparison under a self-supervised setting. The downstream application pipeline of UniAug consists of: (1) training the guidance head; (2) performing guided generation; and (3) training the downstream GCN. The self-supervised pipeline for GraphMAE includes: (1) self-supervised training; and (2) training a classifier on embeddings. For GraphMAE, we use the tuned hyperparameters provided by the authors. We report the total time of the full pipeline on a single A6000 GPU in the following table.
| Total time (seconds) | NCI1 | PROTEINS | IMDB-B | IMDB-M |
|---|---|---|---|---|
| GraphMAE | 1273.61 | 79.24 | 51.21 | 110.11 |
| UniAug | 378.44 | 90.15 | 45.56 | 85.80 |
We observe that in most datasets, the total time of GraphMAE and UniAug are comparable. Note that GraphMAE performs self-supervised training with 300 epochs for NCI1 and 100 epochs for Reddit-B, which is relatively time-consuming.
There are several reasons why the pipeline of UniAug is actually efficient:
- When training the guidance head, we freeze all other model parameters and only update the weights for the MLP-based head.
- The number of diffusion steps for UniAug is set to 128, which is much smaller than the standard 1000 steps for continuous diffusion models. It only takes 21 seconds to sample 1000 graphs for the IMDB-B dataset.
- We choose the diffusion kernel to be the absorbing kernel, which significantly reduces memory consumption during sampling.
In addition, we argue that the improvements provided by UniAug are not marginal. In the downstream application, UniAug achieves consistent positive transfer across tasks and datasets. As mentioned in Tables 4, 5, and 12, UniAug consistently presents the highest average rank. It is worth noting that in Table 4, the average rank of UniAug is 1.43, which is much higher than the second-best method, BGRL, with an average rank of 3.00. Similarly, in Table 5, the average rank of UniAug + Half-Hop is 1.43, significantly surpassing the best self-supervised method, BGRL, which has an average rank of 5.33.
[1] Cao, Yuxuan, et al. "When to pre-train graph neural networks? from data generation perspective!.” KDD, 2023
[2] Mao, Haitao, et al. "Demystifying structural disparity in graph neural networks: Can one size fit all?." NeurIPS, 2024
[3] Azabou, Mehdi, et al. "Half-Hop: A graph upsampling approach for slowing down message passing." ICML, 2023.
[4] Zhao, Tong, et al. "Learning from counterfactual links for link prediction." ICML, 2022.
We understand your concerns and would like to emphasize the significant contributions of our work, which address key challenges for cross-domain positive transfer. We kindly ask you to reconsider your score, as we believe our findings offer valuable insights.
Thanks for running the experiments and collecting the data to answer Q2/Q4. I'm satisfied with the responses. For Q3, the response is still leaning towards an empirical/intuitive explanation, but I'm ok with it as the theoretical foundation was never a primary concern of mine. Regarding Q1, I think the average rank is one signal, but there is MRR, MAE, Hits, etc. We need to look at the whole collection of metrics when making claims that any model is dominant. They can consistently outperform, but the metrics might still be marginal, as is this case. Therefore, while I'm willing to bump up my score to a 4, I'd like the authors to re-consider overstating the metrics for the reasons above.
We sincerely thank you for your thoughtful engagement, willingness to reconsider your score, and this important, nuanced feedback. Following your excellent suggestion, we will carefully revise the manuscript to moderate our claims.
This paper addresses the challenge of leveraging large-scale, cross-domain graph data for effective learning. It introduces UniAug, a universal augmentation framework that pretrains a discrete diffusion model to capture structural patterns from diverse graphs. The pretrained model is then used to guide structure-based augmentations for downstream tasks. Experiments demonstrate that UniAug benefits from increased data scale and improves performance across a range of graph learning tasks.
优缺点分析
Strengths:
- The paper presents a novel structural augmentation approach using a diffusion model that focuses only on graph structure. By avoiding the use of node features, it handles the challenge of feature heterogeneity and can be applied across domains where features are inconsistent or missing.
- UniAug is model-agnostic and can be easily integrated with any downstream GNN, making it a flexible and practical approach that does not require modifications to existing model architectures.
- The paper includes robust evaluation across a wide range of tasks and datasets, which helps support the general effectiveness of the proposed method.
Weakness:
- The paper mainly evaluates UniAug using older models like GCN and GIN. While this helps isolate the effect of the augmentation, these models are no longer considered state-of-the-art. It remains unclear whether UniAug would still provide strong improvements when used with more recent models, such as GraphSAGE or H2GCN for node classification. It’s possible that the benefits of UniAug could be reduced—or even disappear—when tested with stronger and more modern graph architectures.
- The paper does not discuss several recent works that also address cross-domain graph generalization, such as GraphAny , GraphProp , and GraphFM . These models have already shown that cross-domain transfer is both feasible and beneficial. As a result, the claim that this is the first method to demonstrate cross-domain data scaling in graphs feels overstated. While this paper has its own clear and novel contributions, it is important to acknowledge recent progress in this area.
- While Figure 3 and Table 22 show that UniAug generally improves with more pretraining data (SMALL → FULL → EXTRA), the results often have high standard deviations and overlapping performance ranges. For example, on Enzymes, Proteins, and IMDB-M, the gains between FULL and EXTRA are small and likely not statistically significant. Similar issues appear in Table 2 (DD, IMDB-M), Table 4 (Pubmed, Power), and Table 5 (Actor, Chameleon).
问题
- While Figure 3 and Table 22 show a general trend of improved performance with more pretraining data, many of the gains (especially between FULL and EXTRA) appear small and fall within overlapping confidence intervals. Could the authors provide formal statistical tests to confirm whether these improvements are significant?
- The paper uses older models like GCN and GIN as downstream backbones. Have the authors evaluated UniAug with more recent GNN architectures, such as GraphSAGE for node classification or GPS for graph classification?
- Have the authors explored the effects of model scaling, such as increasing the size of the diffusion model (e.g., number of layers or hidden dimensions)? Given the observed data and compute scaling behavior, it would be interesting to know whether UniAug also benefits from larger model capacity.
- Could the authors include a discussion of recent graph foundation model work such as GraphAny , GraphFM , GraphProp , GFSE , etc? These methods also explore cross-domain generalization, and comparing or contrasting them with UniAug would help clarify the unique contributions of this work.
Sorry about the mixup again, and I would totally understand if the 2nd question is not feasible to answer in the remaining time. Just answering 1st, 3rd, and 4th should be enough.
Zhao, Jianan, et al. "Graphany: A foundation model for node classification on any graph." arXiv preprint arXiv:2405.20445 (2024).
Sun, Ziheng, et al. "GraphProp: Training the Graph Foundation Models using Graph Properties."
Lachi, Divyansha, et al. "Graphfm: A scalable framework for multi-graph pretraining." arXiv preprint arXiv:2407.11907 (2024).
Chen, Jialin, et al. "GFSE: A Foundational Model For Graph Structural Encoding."
局限性
Yes, they discuss the data scaling limitation of the methods that could benefit from more diversity. To go further, introducing diversity on the decoding task, brain area studied, or species (e.g. human data for the BCI application) seems interesting directions. The pretraining task limitation is interesting but would benefit from more details such as information about staying in a generative and studying which masking scheme is the most appropriate (MtM or NEDS) or switching to a discriminative approach (Pop-t)? For the transfer of knowledge beyond cross modality taking for example the cross-brain region distillation. This direction goes towards specialized models specific for some brain region instead of a foundation model of the brain, are the authors more convinced by this direction?
最终评判理由
My major concerns were regarding statistical significance of some reported improvements, regarding model scaling behavior, and regarding positioning with existing literature. All of them have been appropriately discussed by the authors in their rebuttal response.
格式问题
None
We thank the reviewer for their time and effort in the review process.
After carefully reading the comments, we believe there may have been a misunderstanding, as the review appears to be for a different paper. The points discussed do not align with the content of our submission.
To resolve this, we have already contacted the Area Chair to bring this to their attention.
We appreciate your understanding and service to the conference.
I sincerely apologize for mistakenly pasting the wrong review! I had been working on multiple reviews simultaneously, and unfortunately, I confused two papers with similar titles ("Cross-Domain..."). I recognize that this was careless and deeply regret the confusion caused.
I am providing the correct review for this paper below. Since there is less time to respond, I have also noted down which questions are the most important to answer for me to increase the score.
Summary
This paper addresses the challenge of leveraging large-scale, cross-domain graph data for effective learning. It introduces UniAug, a universal augmentation framework that pretrains a discrete diffusion model to capture structural patterns from diverse graphs. The pretrained model is then used to guide structure-based augmentations for downstream tasks. Experiments demonstrate that UniAug benefits from increased data scale and improves performance across a range of graph learning tasks.
Strengths and Weakness
Strengths:
- The paper presents a novel structural augmentation approach using a diffusion model that focuses only on graph structure. By avoiding the use of node features, it handles the challenge of feature heterogeneity and can be applied across domains where features are inconsistent or missing.
- UniAug is model-agnostic and can be easily integrated with any downstream GNN, making it a flexible and practical approach that does not require modifications to existing model architectures.
- The paper includes robust evaluation across a wide range of tasks and datasets, which helps support the general effectiveness of the proposed method.
Weakness:
- The paper mainly evaluates UniAug using older models like GCN and GIN. While this helps isolate the effect of the augmentation, these models are no longer considered state-of-the-art. It remains unclear whether UniAug would still provide strong improvements when used with more recent models, such as GraphSAGE or H2GCN for node classification. It’s possible that the benefits of UniAug could be reduced—or even disappear—when tested with stronger and more modern graph architectures.
- The paper does not discuss several recent works that also address cross-domain graph generalization, such as GraphAny , GraphProp , and GraphFM . These models have already shown that cross-domain transfer is both feasible and beneficial. As a result, the claim that this is the first method to demonstrate cross-domain data scaling in graphs feels overstated. While this paper has its own clear and novel contributions, it is important to acknowledge recent progress in this area.
- While Figure 3 and Table 22 show that UniAug generally improves with more pretraining data (SMALL → FULL → EXTRA), the results often have high standard deviations and overlapping performance ranges. For example, on Enzymes, Proteins, and IMDB-M, the gains between FULL and EXTRA are small and likely not statistically significant. Similar issues appear in Table 2 (DD, IMDB-M), Table 4 (Pubmed, Power), and Table 5 (Actor, Chameleon).
Questions:
- While Figure 3 and Table 22 show a general trend of improved performance with more pretraining data, many of the gains (especially between FULL and EXTRA) appear small and fall within overlapping confidence intervals. Could the authors provide formal statistical tests to confirm whether these improvements are significant?
- The paper uses older models like GCN and GIN as downstream backbones. Have the authors evaluated UniAug with more recent GNN architectures, such as GraphSAGE for node classification or GPS for graph classification?
- Have the authors explored the effects of model scaling, such as increasing the size of the diffusion model (e.g., number of layers or hidden dimensions)? Given the observed data and compute scaling behavior, it would be interesting to know whether UniAug also benefits from larger model capacity.
- Could the authors include a discussion of recent graph foundation model work such as GraphAny , GraphFM , GraphProp , GFSE , etc? These methods also explore cross-domain generalization, and comparing or contrasting them with UniAug would help clarify the unique contributions of this work.
Sorry about the mixup again, and I would totally understand if the 2nd question is not feasible to answer in the remaining time. Just answering 1st, 3rd, and 4th should be enough.
Zhao, Jianan, et al. "Graphany: A foundation model for node classification on any graph." arXiv preprint arXiv:2405.20445 (2024).
Sun, Ziheng, et al. "GraphProp: Training the Graph Foundation Models using Graph Properties."
Lachi, Divyansha, et al. "Graphfm: A scalable framework for multi-graph pretraining." arXiv preprint arXiv:2407.11907 (2024).
Chen, Jialin, et al. "GFSE: A Foundational Model For Graph Structural Encoding."
Q4. Could the authors include a discussion of recent graph foundation model work such as GraphAny [1], GraphFM [3], GraphProp [2], GFSE [4], etc? These methods also explore cross-domain generalization, and comparing or contrasting them with UniAug would help clarify the unique contributions of this work.
We thank the reviewer for this excellent suggestion. We agree that positioning UniAug relative to recent graph foundation models is crucial for clarifying our unique contributions. We provide a detailed comparison below. While other models have explored cross-domain generalization, they do so with different approaches that come with specific trade-offs:
- GraphAny: This model focuses on generalization to unseen graphs. However, its approach is based on a supervised pre-training paradigm and is primarily designed for classification tasks, limiting its applicability to other types of graph learning problems.
- GraphProp: This method trains two foundation models sequentially. This design can be time-consuming. Furthermore, its reliance on a Large Language Model (LLM) for node feature encoding is computationally resource-intensive and does not generalize to the many real-world graphs that lack rich textual features.
- GraphFM: This work aims to compress domain-specific features into a common latent space. A key limitation is that it often requires end-to-end training for downstream node classification. It has also shown inconsistent performance improvements, especially on node classification tasks involving heterophilic graphs.
- GFSE: This model functions as a graph structure encoder using four distinct pre-training tasks, which requires careful balancing of multiple loss functions. Additionally, its application to text-attributed graphs requires LoRA fine-tuning of an LLM, which is resource-consuming.
In contrast, UniAug introduces a universal, plug-and-playable data augmentation framework designed to overcome these challenges. Our approach is distinguished by:
- Universality through Structure: We operate purely on graph structure, making our method universally applicable to any graph, including those without rich or compatible node features.
- Plug-and-Play Modularity: As a decoupled data augmentation pipeline, UniAug can be seamlessly combined with any downstream GNN architecture for any graph learning task, avoiding the need for complex end-to-end training.
- Consistent Performance: Our diffusion-based augmentation provides consistent and significant performance improvements across a wide variety of tasks and domains.
We appreciate your feedback and believe that the innovative aspects of our approach and its practical effectiveness in diverse scenarios will bring significant insights into the field. We would be grateful if you could consider increasing your score to support our work.
I thank the authors for their reponses to my questions and concerns. The statistical tests are great and I hope they will be included in the updated version of the paper. Also, adding a discussion with other related work in cross-domain in the updated paper would help the positioning better. Overall, I am satisfied with the responses and will increase my score to a 5 (Accept).
Thank you very much for your positive feedback and for acknowledging our comprehensive responses! We greatly appreciate your constructive comments throughout the review process, which have significantly improved our work.
We noticed that your score still shows 4. Could you please update it to reflect your stated increase to 5? This would ensure your final evaluation is properly recorded.
Thank you again for your thorough and helpful review.
I have just updated my review and the score to 5. If the change is not visible, I think this year NeurIPS might be hiding score updates until the decision notifications.
Q1. While Figure 3 and Table 22 show a general trend of improved performance with more pretraining data, many of the gains (especially between FULL and EXTRA) appear small and fall within overlapping confidence intervals. Could the authors provide formal statistical tests to confirm whether these improvements are significant?
We thank you for this excellent suggestion. As requested, we have conducted pairwise t-tests comparing the performance of the FULL pre-training set against the EXTRA set. The results are presented below.
| Enzymes | Proteins | IMDB-B | IMDB-M | |
|---|---|---|---|---|
| FULL | 71.33 ± 6.51 | 74.05 ± 4.82 | 73.11 ± 2.35 | 49.67 ± 2.41 |
| EXTRA | 71.17 ± 7.10 | 75.47 ± 2.50 | 73.50 ± 2.48 | 50.13 ± 2.05 |
| Pairwise t-test p-value | -- | 0.017 | 0.15 | 0.087 |
| Cora | Citeseer | Power | Yeast | Erdos | |
|---|---|---|---|---|---|
| FULL | 32.81 ± 7.44 | 48.32 ± 6.00 | 32.97 ± 3.75 | 26.36 ± 4.62 | 36.07 ± 4.20 |
| EXTRA | 35.36 ± 7.88 | 54.66 ± 4.55 | 34.36 ± 1.68 | 27.52 ± 4.80 | 39.67 ± 4.51 |
| Pairwise t-test p-value | 0.065 | 0.012 | 0.038 | 0.078 | 0.034 |
This formal analysis provides a more nuanced picture that supports our claim:
- Statistically Significant Gains: On 4 of the 9 datasets (Proteins, Citeseer, Power, Erdos), the performance improvement is statistically significant (p < 0.05).
- Strong Positive Trend: On another 3 datasets (IMDB-M, Cora, Yeast), we observe a strong positive trend, with p-values between 0.05 and 0.10.
Overall, this statistical validation confirms that expanding the pre-training data leads to a consistent and often significant positive impact across the vast majority of tasks.
Q2. The paper uses older models like GCN and GIN as downstream backbones. Have the authors evaluated UniAug with more recent GNN architectures, such as GraphSAGE for node classification or GPS for graph classification?
We agree that demonstrating UniAug's effectiveness on more recent and powerful architectures like GraphSAGE or GPS would be a valuable addition to strengthen our claims. Since our method improves the input data by generating more informative graph structures, we hypothesize that these benefits are orthogonal to the GNN backbone and would translate to stronger architectures as well.
Given the time constraints of the rebuttal period, we were unable to complete these new experiments. However, we are committed to adding results with a more recent backbone (e.g., GraphSAGE for node classification) to the final camera-ready version of the paper to further validate the generality of our approach.
Q3. Have the authors explored the effects of model scaling, such as increasing the size of the diffusion model (e.g., number of layers or hidden dimensions)? Given the observed data and compute scaling behavior, it would be interesting to know whether UniAug also benefits from larger model capacity.
We thank the reviewer for these important questions about model scale and its comparison to other large graph models. The choice of our model size was determined by an empirical analysis of model capacity versus graph generation quality. We found that simply making the model larger did not improve performance. As shown in the table below, we evaluated generation quality for varying numbers of layers. Performance on key structural metrics peaked at 4 layers and then began to degrade. This indicates a "sweet spot" for model capacity on this task, suggesting larger models may be prone to overfitting or optimization difficulties.
| Num Layers | Degree | Spectral | Clustering |
|---|---|---|---|
| 2 | 0.3851 | 0.4865 | 1.25 |
| 3 | 0.3344 | 0.4659 | 1.15 |
| 4 | 0.2648 | 0.4592 | 1.11 |
| 5 | 0.3378 | 0.4917 | 1.19 |
We agree this is an important research direction. While a full scaling analysis for larger models was beyond our available computing resources, the layer-wise study above serves as a preliminary result. We leave the model scaling analysis as a future work.
The authors propose UniAug, a graph structure augmentor based on a discrete diffusion model. UniAug is first pre-trained on a large number of graphs across domains to learn generalizable structural patterns, and then used to augment the structure of graphs in downstream tasks. The paper claims that this approach enables effective scaling behavior and yields consistent improvements on downstream tasks.
优缺点分析
S1. The paper shows that learning from cross domains can benefit different downstream tasks.
S2. The proposed method demonstrates consistent improvements over existing graph learning baselines.
S3. Using discrete diffusion models for structure augmentation
W1. My major concern is about the node features. It seems that UniAug only learns to capture the adjacency matrices of graphs from a collection. During the data augmentation stage, just use the same node features. From my perspective, the lack of consideration of node features is a limitation.
W2. The selection of the guidance head appears to rely on trial-and-error, which may be time-consuming and task-specific. This reliance reduces the practicality and ease of generalization of UniAug across diverse downstream tasks.
W3. From Table 3, DCT still has better results than UniAug. Does this mean that domain-specific pretrained models are still better than cross-domain models? This observation suggests that domain-specific models may remain more effective than cross-domain approaches in some settings, thereby weakening the paper’s central claim regarding the universal applicability and superiority of cross-domain pretraining.
W4. Pretraining on only thousands of graphs may be insufficient to support the claim of universality. It would be better to show UniAug scales to more graphs.
W5. What is the time cost for pretraining the discrete diffusion model? It could be critical when scaling to even larger datasets.
问题
Refer to Strengths and weaknesses
局限性
Yes
格式问题
N/A
Q1. From my perspective, the lack of consideration of node features is a limitation.
It is important to note that node features exhibit great heterogeneity across datasets and domains. For instance, the atom characteristics of molecules are fundamentally different from the paper keywords in citation networks. Additionally, a large proportion of real-world networks lack corresponding node features altogether. This suggests that pre-training a feature-centric model to achieve positive transfer is extremely challenging.
Moreover, even for graphs that share the same feature space, feature-centric models can lead to negative transfer. For example, graphs with textual node features demonstrate that feature-centric pre-trained models like GraphMAE show a performance gap compared to GCN training from scratch [1]. This indicates that pre-training a feature-centric model may not be an adequate approach for achieving data scaling.
Graph structures adhere to a uniform construction principle, namely, the connections between nodes. We believe that our pre-training and augmentation paradigm can effectively utilize the diverse topological patterns across domains and provide positive transfer to various downstream tasks. During the downstream application, we assemble the augmented graphs with generated structures and original node features. This approach allows UniAug to bypass the feature heterogeneity problem and achieve consistent positive transfer.
Q2. The selection of the guidance head appears to rely on trial-and-error, which may be time-consuming and task-specific. This reliance reduces the practicality and ease of generalization of UniAug across diverse downstream tasks.
We thank the reviewer for raising this concern. We respectfully clarify that the selection of the guidance head is not based on trial-and-error, but follows a consistent and logical principle: to use the most direct supervisory signal available for the given downstream task.
- For supervised tasks (e.g., node or graph classification), the most direct signal is the ground-truth labels. Therefore, we systematically use a prediction head trained on these labels as the guidance.
- For self-supervised tasks where explicit labels are absent (e.g., link prediction), we leverage well-established, task-specific heuristics from the literature. For link prediction, we follow this principle by using a guide based on Common Neighbors (CN), a proven and effective heuristic for this task [2].
As shown in Table 18, using the CN-based guidance still achieves state-of-the-art performance, attaining the highest average rank when compared with the results in Table 4.
Q3. From Table 3, DCT still has better results than UniAug. Does this mean that domain-specific pretrained models are still better than cross-domain models?
The reviewer is correct that the domain-specific DCT model achieves a better result on this particular molecular regression task. However, this comparison highlights a fundamental difference in approach and, we argue, showcases the strength of our domain-agnostic framework.
- Different Inputs, Different Goals: It is critical to note that DCT is a specialized model designed exclusively for molecular graphs and leverages rich, domain-specific bond/atom features which carry significant chemical information. In contrast, UniAug is intentionally designed as a domain-agnostic framework that operates only on the graph's topological structure. This design choice is what allows UniAug to generalize across wildly different domains (e.g., social networks, citation networks, and molecular graphs) using a single, unified model.
- Highlighting the Power of Structural Learning: The fact that our purely structure-based UniAug achieves results that are highly competitive and remarkably close to a specialized model using extra, powerful domain features is a strong testament to our method's effectiveness. It demonstrates the significant extent to which meaningful patterns can be learned from topology alone.
We emphasize that the goal of UniAug is not to outperform every specialized model in its specific domain, but to provide a single, powerful pre-training framework that is broadly applicable.
Q4. Pretraining on only thousands of graphs may be insufficient to support the claim of universality. It would be better to show UniAug scales to more graphs.
We agree that scaling is an important dimension for pre-training, and we'd like to clarify our rationale. For a cross-domain model, we argue that the diversity of the pre-training data is also crucial compared to the raw quantity. Our primary goal was to build a corpus with the broadest possible coverage of different graph structures and domains.
- We chose to combine a subset of GitHub Star with the Network Repository precisely because it is the largest and most diverse public collection of graphs from disparate domains (social, biological, economic, etc.).
- Our own results support this strategy. As we demonstrate empirically in Figure 3, improving the coverage of different graph distributions directly correlates with better downstream performance. This finding suggests that at this scale, diversity is a primary driver of generalization.
While the resulting dataset of several thousand graphs may seem modest compared to huge, domain-specific corpora (e.g., billions of molecules in ZINC), it represents the state-of-the-art for a heterogeneous, multi-domain collection.
In Figure 5, we observe a clear trend of performance improvement as the size of the pre-training set increases. As part of our future work, we plan to explore the integration of other large-scale collections, such as SNAP, to continue pushing the boundaries of both scale and diversity.
Q5. What is the time cost for pretraining the discrete diffusion model? It could be critical when scaling to even larger datasets.
We thank the reviewer for this practical question. The entire pre-training process reported in our paper took approximately 26 hours, using four NVIDIA RTX A6000 GPUs. We consider this a very reasonable one-time cost to produce a powerful, universal model that can then be rapidly applied to numerous downstream tasks. In the context of large-scale pre-training, this is a computationally efficient result. Future work could also incorporate standard engineering optimizations (e.g., mixed-precision training) to further improve efficiency for larger pre-training corpus.
[1] Chen, Zhikai, et al. "Text-space graph foundation models: Comprehensive benchmarks and new insights.”
[2] Wang, Xiuan, et al. “Neural Common Neighbor with Completion for Link Prediction.”
We appreciate your feedback and believe that the innovative aspects of our approach and its practical effectiveness in diverse scenarios will bring significant insights into the field. We would be grateful if you could consider increasing your score to support our work.
The paper proposes UniAug, a universal graph structure augmentation framework that pretrains a discrete diffusion model to capture structural patterns from diverse graphs. The motivation of the paper is reasonable. The datasets, in experiments, are comprehensive, span diverse domains, and the improvements are largely consistent across all data.
Overall, the research problem is important and the approaching is novel and interesting. I vote for acceptance.