On the Scalability of GNNs for Molecular Graphs
We study scaling laws of Graph Neural Networks on 2D molecular graphs during pretraining as well as finetuning.
摘要
评审与讨论
This paper investigates the scaling of GNNS across various settings including width, depth, number of molecules, number of labels, and diversity in pretraining across 12 datasets. Different conclusions were drawn from experiments conducted during both pretraining and fine-tuning stages. Finally, the authors introduce a foundational model named MolGPS which integrates various findings.
优点
- The experiments are conducted extensively, taking into account various variables, resulting in some interesting findings.
- The problem addressed in this paper—the scaling law problem in the molecular domain—remains unresolved but is crucial for advancing molecular representation learning.
缺点
- The pre-training strategy employed in this paper is a supervised method, overlooking a range of existing self-supervised approaches in molecular representation learning [1][2][3]. It is unconventional to explore scaling laws without considering established pre-training strategies like masking and denoising.
- The pre-training data in this paper consists of only 5 million molecules. While this scale may be constrained by the label requirements in supervised pre-training, it is insufficient for exploring scaling laws in the molecular domain. Many works use significantly larger datasets, often involving hundreds of millions or even billions of molecules for pre-training their models.
- The scaling of molecular data appears to perform poorly under the downstream fine-tuning and probing testing protocols. Are there any insights or analyses regarding this phenomenon?
- From the foundational model perspective, it is also important to evaluate quantum molecular properties such as QM9 and MD17 beyond tasks related to biological ADME properties.
[1]:Uni-Mol: a universal 3d molecular representation learning framework
[2]:Pre-training Molecular Graph Representation with 3D Geometry
[3]:Fractional Denoising for 3D Molecular Pre-training
问题
See weakness
局限性
The paper attempts to uncover the significant scaling law challenges in molecular representation learning. However, it is limited by its focus on a single supervised strategy and constrained data scaling. Despite conducting extensive experiments, these efforts fall short of fully addressing the issue and providing actionable conclusions to the community.
We thank the reviewer for providing detailed feedback on the paper which is of utmost value to our work. Below we address your concerns point-by-point. We also kindly invite the reviewer to refer to the general rebuttal where we share further information and a summary of the feedback we received.
The pre-training strategy [is] supervised [...], overlooking a range of existing self-supervised approaches [...]
Thank you for your valuable feedback. We agree that our work utilizes a lesser explored strategy of supervised pretraining. Unsupervised molecular models are built on years of cutting-edge pretraining research [2,3,4] with vast empirical promises [6]. In fact, our work, to our best knowledge, is the first to successfully scale supervised pretraining for GNNs to the multi-billion parameter regime, setting new downstream task performance standards for the multiple benchmarks considered in our submission.
In Table 1 and 2 of the rebuttal doc, we compare our approach to self-supervised strategies and find that our method maintains the clear upper hand. The unsupervised MolE [10] model variants clearly underperform the standard MolE model that leverages both unsupervised and supervised pretraining (Table 2), which is in turn outperformed by the listed MolGPS variants by large margins. Even our smaller MPNN++ models of comparable size to MolE (~100M parameters) both outperform the self-supervised variant.
In Table 1, we compare to the unsupervised GraphMVP [11] on MoleculeNet (only on tasks that were not part of our pretraining). Our MolGPS from the original submission outperforms [11] on 3/4 tasks, while the model variants that we recently pretrained with additional Phenomics data outperform across all tasks. We acknowledge the results reported by Uni-Mol [12] slightly surpass our score, but note that [12] uses a different data splitting approach compared to GraphMVP, which can have a significant impact in low-data regiemes. We have checked the paper and code, but we were unable to find the recipe so far.
Lastly, the mentioned reference [13] focuses on 3D molecular modeling, which is outside the scope of the present work (as discussed below).
The pretraining data [...] of only 5 million molecules.
We would like to point out that due to the supervised pretraining, the scale of the dataset is not directly comparable. We also note that previous works in GNN literature scale to less than 5M graphs [14] and show interesting scaling trends.
In our case, it is important to characterize the diversity not only by the number of molecules, but instead paired with the number of labels per molecule. We recall our label scaling experiments that show the impact of reducing the data diversity by removing labels. The performance gains from incorporating Phenomics data further support this hypothesis (Figure 2 of rebuttal).
The scaling of molecular data appears to perform poorly [for] the downstream [...].
We indeed found no consistent trend on downstream task performance for increased number of molecules in the pretraining dataset (Figure 2(a) and Appendix E.3 of our submission). That said, no molecular scale (except for the lowest 12.5%) had an outstanding negative effect on downstream tasks. We therefore hypothesize that our approach is robust to a smaller number of molecules during pretraining when the highly important label diversity per molecule is maintained (as discussed above).
[I]t is also important to evaluate quantum molecular properties such as QM9 and MD17.
Thank you for the suggestion. However, datasets such as QM9 and MD17 require the model to reason over graph geometry in 3D coordinate spaces. Evaluating on such tasks would be an uneven evaluation of a 2D model such as MolGPS being used to learn 3D downstream tasks.
We would like to also highlight the strong performance of MolGPS on the complementary gene inhibition tasks (e.g., the pkis2-* task series of Polaris benchmark) and the significant improvements of our most recent model variants that have been pretrained with additional Phenomics data (Figure 1 in rebuttal pdf).
Furthermore, ADME(T) tasks capture the efficacy and toxicity of bioassays which play a crucial role in screening and evaluation of likely drug candidates. These tasks continue to remain the pinnacle of industrial pharmaceutical evaluation [15,16]. We thus prioritize our empirical evaluation by keeping their real-world applications and biological importance in mind.
[2]. Li et al, A knowledge-guided pre-training framework for improving molecular representation learning, Nature 2023
[3]. Xia et al, Pre-training Graph Neural Networks for Molecular Representations: Retrospect and Prospect, ICML AI4Science Workshop 2022
[4]. Li et al, KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular Property Prediction, ACM SIGKDD 2022
[6]. Lu et al, Learning to Pre-train Graph Neural Networks, AAAI 2021
[7]. Sun et al, Does GNN Pretraining Help Molecular Representation?, NeurIPS 2022
[8]. Sun et al, MoCL: Data-driven Molecular Fingerprint via Knowledge-aware Contrastive Learning from Molecular Graph, ACM SIGKDD 2021
[10] Méndez-Lucio et al, MolE: a molecular foundation model for drug discovery, arxiv 2022
[11] Liu et al, Pre-training molecular graph representation with 3d geometry. arXiv preprint arxiv 2021
[12] Zhou et al, Uni-mol: A universal 3d molecular representation learning framework, ChemRxiv 2023
[13] Feng et al, Fractional denoising for 3d molecular pre-training, ICML 2023
[14]. Chen et al, Uncovering neural scaling laws in molecular representation learning, NeurIPS, 2024
[15]. Shi et al, Fine-tuning BERT for automatic ADME semantic labeling in FDA drug labeling to enhance product-specific guidance assessment, Journal of biomedical informatics, 2023
[16]. Walter et al, Multi-task ADME/PK Prediction at Industrial Scale: Leveraging Large and Diverse Experimental Datasets, Molecular Informatics, 2024
Thank you for the authors' detailed feedback. However, I still believe that validating the scaling law using the available labeled data may make a trivial contribution to the community.
As highlighted in the abstract, "Scaling deep learning models has been at the heart of recent revolutions in language modeling and image generation." The validation of scaling laws in CV and NLP domains largely relies on unlabeled data and self-supervised tasks [1,2,3], as referenced in the paper. I argue that a more relevant approach for studying scaling is under an unlabeled setting, and there are already studies that explore this [4].
In the biological domain, obtaining labeled data is more expensive and challenging compared to the CV or NLP domains. Validating scaling laws in such a constrained setting seems unusual. The effectiveness of this approach may only depend on the relevance of supervised tasks between pre-training and fine-tuning.
[1]: OpenAI. Gpt-4 technical report
[2]: Llama 2: Open foundation and fine-tuned chat models
[3]: Language models are unsupervised multitask learners.
[4]: Uni-Mol2: Exploring Molecular Pretraining Model at Scale
We thank the reviewer for their response and are happy to provide further clarification of the concerns mentioned.
The validation of scaling laws in CV and NLP domains largely relies on unlabeled data and self-supervised tasks [1,2,3] [...]. [A] more relevant approach for studying scaling is under an unlabeled setting [e.g.,] [4].
Unsupervised approaches are an interesting avenue of research in certain areas of the molecular domain. As suggested, Uni-Mol models [4,5] show promising results for modeling molecules in 3D space and for finetuning to physics-based downstream tasks like QM9.
We would like to remind the reviewer that we follow a significantly different approach based on modeling 2D graphs with different applications & downstream tasks, where unsupervised models have not yielded comparable results.
In the context of CV/NLP [1,2,3] we also point out the lack of equally label-rich datasets that explains why supervised pretraining is not a focal point in those domains.
With respect to Uni-Mol2 [4], we note that it has been only finetuned to physics-based tasks (QM9, e.g., homo-lumo gap), that highly depend on 3D coordinates and graph structure. [4] is specialized on understanding graph structure in 3D space, which leads the approach to perform well on such tasks.
We hope that our work paves the way for an era where foundational GNNs drive pharmaceutical drug discovery. [abstract of our submission]
Our objectives are different: predicting properties that drive drug discovery like ADMET and binding predictions that strongly rely on understanding the biochemical space beyond graph structure. Uni-Mol2 provides no results on any shared or similar downstream task.
While the initial Uni-Mol model [5] provided some results for ADMET downstream tasks, the derivation of the results lacks transparency and is not reproducible (despite our best efforts).
Uni-Mol's finetuning compares to GraphMVP [6] (published one year before Uni-Mol), which we also compare to in our work, closely following their experimental setup. Uni-Mol outperforms GraphMVP but reports performance directly taken from [6], despite clearly using a different dataset splitting technique in their work. We reiterate that data splits have a significant impact on empirical results in the context of molecular scaffold splits, especially in the low-sample regime of the MoleculeNet. Further, [5] does not provide sufficient information for reproducibility. We have rigorously analyzed the papers and code of both works and provide details in a separate comment below.
Overall, the learnings from our scaling study (width, depth, #molecules, #labels per molecule, composition of pretraining data mixture) lead to the derivation of a foundational GNN that has set a new standard in the various competitive downstream task benchmarks considered here. We kindly request the reviewer to provide more context on their assessment of our work as “a trivial contribution” in light of the above discussion.
We would be happy to clarify any further questions the reviewer may have.
In the biological domain, obtaining labeled data is more expensive and challenging compared to the CV or NLP domains. [...] The effectiveness of this approach may only depend on the relevance of supervised tasks between pre-training and fine-tuning.
We respectfully disagree with this assessment. Adequate labeled data is readily and publicly available at large scale with our pretraining only using a small fraction of the available data.
We recall that PCBA_1328 is only a small subset of the PubChem database (only considering datasets with binary tasks with at least 6k molecules) and larger alternatives for PCQM4M exist (e.g., PM6 dataset in [7] that is more than 20x bigger).
We also recall that our results show no signs of a data bottleneck. Instead, our molecule scaling suggests, downstream task performance does not deteriorate much when pretraining on a smaller fraction of molecules, e.g., 25% or 50% (Fig. 2 of submission). This relates back to the importance of the number of labels (i.e., label scaling in submission) per molecule already discussed in our initial response.
The objective of our work, as in CV/NLP, is to obtain informative embeddings for domain-specific downstream applications. For drug discovery, our work suggests supervised pretraining is a suitable avenue, thanks to the availability of large labeled data sources. As there is no overlap between the pretraining and finetuning tasks, our downstream evaluation solely evaluates the information content of our learned representations (similar to other domains).
We are happy to provide more information if further questions arise.
[1,2,3,4] as referenced in previous comment of Reviewer Lxfs
[5] Uni-Mol
[6] Liu et al, Pre-training Molecular Graph Representation with 3D Geometry, ICLR 2022
[7] Beaini et al, Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets, ICLR 2024
Although both papers mention that they use scaffold splits, the way they implemented the splitting approach is considerably different, which makes their reported results incomparable. The specific differences are the following:
- GraphMVP: this model splits the data according to the scaffold splitting approach provided by DeepChem (here). The exact code can be found here. This function splits the data into train/validation/test splits provided the split ratios.
More specifically, the scaffold splitting provided by DeepChem, which is used by GraphMVP and our work, is described as follows (reference):
“ScaffoldSplitter organizes molecules according to the Bemis-Murcko scaffold representation, which categorizes rings, linkers, frameworks (combinations of linkers and rings), and atomic characteristics like atom type, hybridization, and bond order within a molecular dataset. Next, it sorts these groups by the number of molecules they contain in descending order.”
- Uni-Mol: this model implements the scaffold splitting according to the GroupKFold approach provided by sci-kit learn. The exact code can be found here.
More specifically, sci-kit learn GroupKFold splitting approach can be described as follows (reference):
“GroupKFold is a variant of k-fold cross-validation designed to prevent the same group from being included in both the training and test sets. GroupKFold helps to identify and avoid such overfitting by ensuring that each group is only present in either the training or test set, but not both.”
It is important to note that for the GroupKFold splitting, an important argument is the groups argument that specifies the underlying group assignment of the data. In Uni-Mol implementation, this argument is by default retrieved from a file (here); the exact file or how it is passed is not provided in the repository.
Moreover, we found that (during the fine-tuning) Uni-Mol reads splits from some .lmbd files but these files are not provided in the repository (please check here and here).
It is noteworthy that there are some other important studies, such as GPSE [8] or GraphLoG [9], that also evaluate their results on MoleculeNet with scaffold splitting. However, they either adopt the OGB splits, or the sklearn splitting approaches. So, their results are also not directly comparable.
[8] Cantürk et al, Graph positional and structural encoder, ICML 2024
[9] Xu et al, Self-supervised graph-level representation learning with local and global structure, ICML 2021
We hope our two comments above addressed the reviewer’s concerns. Please let us know if there are any additional clarifications that we can provide that could increase your support for the acceptance of the paper.
This paper investigates the scaling of GNNs on molecular tasks. In their setting, they pre-train a GNN on multiple molecule dataset and then fine-tune these GNNs on different datasets (“down-stream tasks”). They investigate how the performance of different GNNs changes with parameter count, depth, training samples and so on.
优点
-
(S1 - Clarity) The paper is extremely well written and easy to follow.
-
(S2 - Significance) Both foundation models for graphs and ML on molecules are very active research fields. As such this paper can be important to a large number of people.
-
(S3 - Novelty and Significance) To me, there are four interesting and novel contributions / observations in this paper:
- Scaling laws for GNNs.
- The investigation of how to train a pre-trained GNN on a downstream task. Here they investigate two options: fine-tuning the entire model and training only the final MLP (probing).
- A comparison of three different types of GNNs: MPNNs, Transformers and MPNNs+Transformers. It is particularly interesting that GNNs scale better with parameters than transformers.
- The observation that removing a dataset from the training data can improve performance on downstream tasks.
-
(S4 - Quality) The evaluation seems solid, the authors compare three different models and observe mostly similar behavior for all models across datasets.
缺点
-
(W1 - Clarity) The presentation of the results in the main paper is done poorly. Figure 2 is too small and difficult to read / understand. I understand that this paper presents a lot of results but putting all of them into a single figure is in my opinion a bad choice.
-
(W2) Fine tuning vs probing: you have experiments to compare these two approaches. However, I could not find an analysis of the results in the paper (maybe I have missed it).
问题
-
(Q1) How can you train a single model on multiple datasets if these datasets might have different features? Are all datasets pre-processed to have the same features?
-
(Q2) Figure 2 - probing on Polaris / TDC: it seems that most models do not profit from an increase in the number of molecules. Does this not contradict the idea of foundation models? It seems to me like this indicates that when pre-training GNNs the size of the original dataset does not matter much to the downstream task.
-
(Q3) In a similar vein, you notice that you can get better results when pretraining without the L1000 dataset. Does this imply that for graphs it is more important to have the right kind of data compared to having a lot of data (as for example LLMs or diffusion models)?
-
(Q4) From (W2): how does fine-tuning compare to probing? Is one clearly better than the other?
To sum up, this is a good experimental paper that should be accepted.
局限性
Yes
We thank the reviewer for providing detailed feedback on the paper which is of utmost value to our work. Below we address your concerns point-by-point. We also kindly invite the reviewer to refer to the general rebuttal where we share further information and a summary of the feedback we received.
(W1) Figure 2 is too small and difficult to read
We thank the reviewer for the comment and agree that Figure 2 can be better represented when split up into multiple larger figures. We will apply those changes in a revised version of our paper.
(W2 & Q4) Fine tuning vs probing: How does fine-tuning compare to probing?
We thank the reviewer for this helpful comment. We will further clarify this point when revising the paper.
Section 4.2 of the submission studies finetuning and probing side-by-side, establishing both as effective strategies for tackling downstream tasks, without conducting a direct comparison between them.
However, when deriving the foundation model MolGPS in Section 4.3, we find probing to be the overall stronger approach. The major advantage of probing is the ability to leverage multi-level information from the pretrained GNN. We recall that our pretraining is based on a supervised multi-task learning approach. As a result, different task heads capture task-specific information, while earlier layers that feed into the task heads carry more general information. When we combine fingerprints from various layers, we can think of aggregating knowledge from several “experts”. Our multi-fingerprint probing suggests this knowledge is additive as it clearly outperforms probing of any single fingerprint. Our foundation model MolGPS doubles down on this idea, combining fingerprints from multiple layers and various pretrained models. For finetuning, there is no straightforward way of taking advantage of this multi-level information, making probing the preferred approach.
(Q1) How can you train a single model on multiple datasets if these datasets might have different features? Are all datasets pre-processed to have the same features?
All datasets are indeed pre-processed to have the same features. Each molecule is initially represented by a SMILES string [1], that our pipeline features into a molecular graph (i.e., atoms and bonds as nodes and edges, respectively). The (node) features are also agnostic to the data source, e.g., using atom type, row/col in periodic table, etc.. We will add more context on the pre-processing to the appendix.
(Q2) Figure 2 - probing on Polaris / TDC: it seems that most models do not profit from an increase in the number of molecules. Does this not contradict the idea of foundation models?
We indeed found no consistent trend on downstream task performance for increased number of molecules in the pretraining dataset (Figure 2(a) and Appendix E.3 of our submission). That said, no molecular scale (except for the lowest 12.5%) had an outstanding negative effect on downstream tasks. We therefore hypothesize that our approach is robust to a smaller number of molecules during pretraining when the highly important label diversity per molecule is maintained that can be observed in our label scaling study (e.g., in Figure 2(a) and Appendix E.4 of our submission).
We would like to point out the importance of characterizing the diversity of the data not only by the number of molecules, but also paired with the number of pretraining labels/tasks per molecule. PCBA_1328 considers more than 1k different labels per molecule (albeit with high sparsity) and PCQM4M comes with 25 graph-level tasks and 4 node-level tasks (that are learned for each node of the ~4M molecules). This perspective is reinforced by our label scaling experiments that show the impact of reducing the data diversity by removing labels (e.g., Figure 2(a) of submission). Performance gains from recently incorporating Phenomics data into the pretraining data mix further support this claim, adding ~500k of molecules with a highly informative set of ~6k labels per graph (Figure 2 of rebuttal).
(Q3) In a similar vein, you notice that you can get better results when pretraining without the L1000 dataset. Does this imply that for graphs it is more important to have the right kind of data compared to having a lot of data (as for example LLMs or diffusion models)?
This is an accurate observation, which relates back to the importance of the number of molecules and the number (and quality of the labels). L1000 is exceptionally small (only 20k molecules) but features almost 1k labels for each molecule. However, in contrast to, e.g., PCBA_1328 (1328 binary classification tasks), it is a regression task that suffers from low signal-to-noise ratio. Despite trying to stabilize learning by transforming it into a classification task via binning of regression targets, the overall impact on downstream task performance was negative. Notably, the small amount of molecules may lead to the absence of positive samples in most batches, as the batches are dominated by the larger PCBA_1328 and PCQM4M molecules.
As pointed out by the reviewer, it is the quality (and label diversity) of the data that is highly relevant. The perfect example for this is the massively improved downstream task performance we report in the rebuttal doc after recently adding Phenomics data (with ~500k molecules and ~6k highly informative labels) to our previously use pretraining data mix (excluding L1000). We kindly refer the reviewer to the general rebuttal doc for more details.
Kindly let us know if our response above addressed your concerns. We will be happy to further discuss and answer any questions you may have.
[1] Weininger D, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences. 1988
I thank the authors for answering my questions and providing such a thorough rebuttal. I am still of the opinion that this is a good paper which should be accepted and will thus keep my score.
The paper examines the scaling behavior of GNNs in the context of molecular graph representation and prediction. The authors analyze various GNN architectures, including message-passing networks, graph Transformers, and hybrid models, on a large dataset of 2D molecular graphs. They find that increasing the scale of model depth, width, dataset size, and diversity significantly enhances performance on downstream tasks. They introduces MolGPS, a new graph foundation model, which demonstrates superior performance across numerous tasks in molecular property prediction.
优点
- The paper is well-written and easy to follow.
- Comprehensive experiments were performed on large scale datasets to demonstrate the scaling law.
- The empirical results of MolGPS on various dataset is promising.
- The insights on how to scale GNN models is meaningful to the community.
缺点
- Another important aspect of scaling the parameters is regularization [1]. The authors did not explicitly discuss / validate the regularization methods they are using.
- The authors did not control the compute used to train the models. It would nice to see some efficiency comparison as well.
[1] How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers https://arxiv.org/abs/2106.10270
问题
- How well do these insights generalize to other types of graph-structured data beyond molecular graphs?
局限性
NA
We thank the reviewer for their valuable feedback on the paper which is of utmost value to our work. Below we address your concerns point-by-point. We also kindly invite the reviewer to refer to the general rebuttal where we share further information and a summary of the feedback we received.
Another important aspect of scaling the parameters is regularization [1]. The authors did not explicitly discuss / validate the regularization methods they are using.
Our model indeed uses regularization techniques, e.g., dropout in each GNN layer. We will include a detailed discussion of the used methods and their parameterization in the appendix of the revised paper. The need of more sophisticated regularization techniques is largely avoided through the use of rich pretraining data that prevents the models from overfitting (except for the MPNN++ model in a few instances).
The authors did not control the compute used to train the models. It would nice to see some efficiency comparison as well.
Thank you very much for the suggestion. Model performance tables with the training compute time will be added to the appendix. Training time of a single model varies from 24 GPU hours for the smallest models to ~1120 GPU hours for the largest models. In total ~13400 GPU hours were used for computation in this paper.
The work of [1] pointed out by the reviewer presents a compelling analysis of the used computational resources compared between finetuning to a downstream task and training a model of the same size from scratch.
We note that such a study may be difficult to replicate in our case due to the high parametric scale of the GNNs in contrast to the low-data regimes for the downstream tasks with only a few 100's or 1,000's of molecules per task and almost exclusively a single target label, which would cause overfitting when trained from scratch.
How well do these insights generalize to other types of graph-structured data beyond molecular graphs?
We thank the reviewer for their question on potential applications outside the biochemical domain. It is highly likely that models similar to MolGPS could be pretrained for other applications and domains given adequate pretaining data. However, the present work would not be suited for downstream tasks outside the biochemical domain. To obtain a domain-agnostic graph-foundation model, unsupervised approaches that primarily learn from the graph structure may be a more natural choice.
That said, the proposed foundation model MolGPS is specifically targeted towards biochemistry and drug discovery in particular, with the pretraining data chosen accordingly to learn domain-specific information. For comparison, we provide some comparison to unsupervised pretraining approaches in the rebuttal doc (Tables 1 & 2) that indicate our supervised pretraining strategy is to be preferred for a “molecular” graph foundation model.
Kindly let us know if our response above addressed your concerns. We will be happy to further discuss and answer any questions you may have.
[1] Steiner et, How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers, arxiv 2021
I'm satisfied with the author's response and I'll keep my score.
The paper investigates the scalability of GNNs in molecular graph applications. It highlights the relationship between model size, dataset size, and performance upon message-passing networks, graph Transformers, and hybrid architectures using a large dataset of 2D molecular graphs. It shows that supervised pretraining on molecular graphs provides rich embeddings beneficial for downstream tasks. And it finds that the number of labels is crucial for fine-tuning performance.
优点
The experiment of MPNN and Transformer on pretrained and downstream tasks is very thorough and provides ample evidence to support the claim. The ablation studies are detailed.
缺点
Many scaling results are unsurprising, such as the consistent improvement with increased depth and width during pretraining. However, the thorough experiments compensate for this predictability. I am curious if these trends will also occur for 3D graph datasets like QM9 or others. Conducting a few additional experiments on these datasets could strengthen this paper, though this isn't necessarily required in this rebuttal.
问题
See weaknesses.
局限性
The authors claim that this work uncovers additional aspects of GNN training such as the increasing complexity of aggregation functions and their effect on scaling properties.
We thank the reviewer for providing detailed feedback on the paper which is of utmost value to our work. Below we address your concerns point-by-point. We also kindly invite the reviewer to refer to the general rebuttal where we share further information and a summary of the feedback we received.
Many scaling results are unsurprising, such as the consistent improvement with increased depth and width during pretraining. However, the thorough experiments compensate for this predictability.
Thank you for appreciating the quality of our empirical analysis and finding our findings interesting.
We agree that, especially given the unprecedented benefits of scale in other domains such as natural language, some results of this work may seem unsurprising at first. In the context of GNNs however, our work explores largely uncharted territory in terms of parametric scale [1,2,3] and the possible implications on biochemistry and drug discovery in particular. We recall the finetuning and probing performance on downstream tasks, establishing new state-of-the-art on 12/22 tasks on highly-competitive TDC benchmark and all but two tasks on Polaris and MoleculeNet benchmarks with our proposed foundation model (MolGPS). In our rebuttal we have expanded our empirical study, adding model variants of MolGPS that were scaled to the 3B parameter regime andpretrained on an improved data mix that further enhances our performance across the downstream task benchmarks. We kindly refer to the general rebuttal for more details.
I am curious if these trends will also occur for 3D graph datasets like QM9 or others. Conducting a few additional experiments on these datasets could strengthen this paper, though this isn't necessarily required in this rebuttal.
Thank you for the suggestion. We note that our proposed MolGPS model solely relies on learning from 2D graphs, while tasks in QM9 depend on molecules with 3D positions. This is one of the reasons why we limit our analysis of downstream tasks to 2D molecular tasks.
It is important to highlight the importance of publicly available labeled data for our supervised pretraining setup. Our label scaling experiments (e.g., Figure 2a of submission) highlight the importance of many diverse labels for each molecule. Current databases for 3D molecules are still fairly small and would hardly allow for the required data diversity present in our 2D pretraining data collection. However, we note that scaling design decisions uncovered in our empirical analysis may prove useful to construct and train foundational 3D GNNs in the future, when databases reach adequate scales. These models would further require the use of sophisticated training strategies and aggregation functions, a future direction which we explicitly mention in our limitations section.
Kindly let us know if our response above addressed your concerns. We will be happy to further discuss and answer any questions you may have.
[1]. Sun et al, Does GNN Pretraining Help Molecular Representation?, NeurIPS 2022.
[2]. Sun et al, MoCL: Data-driven Molecular Fingerprint via Knowledge-aware Contrastive Learning from Molecular Graph, ACM SIGKDD conference on knowledge discovery & data mining 2021.
[3]. Liu et al., Neural Scaling Laws on Graphs, arxiv 2024.
Thanks for your response! I will keep my positive score.
We thank the reviewers for providing detailed feedback on the paper and appreciating its presentation (YrHv, kaW1, FQ72, Lxfs), organization (kaW1, FQ72) and scientific contribution (FQ72, Lxfst). Overall, we have updated the paper and responded to the reviewer’s concerns in the individual rebuttals. In the following, we discuss our updates and reviewer comments that might be relevant for all reviewers.
Incorporating Phenomics data
We further integrated an additional data type into our pretraining data mix. The added Phenomics dataset contains ~6k labels for ~500k molecules (compounds) that were derived from phenomic imaging [1] of cells perturbed with either a dose of a compound or a gene knockout. We conducted a similarity analysis between the obtained images (represented by vector embeddings, e.g., similar to [2]) subject to a compound perturbation on one side, and images subject to a gene perturbation on the other side. The pretraining task is to predict for each compound if it has a phonemically visible similarity to a gene knockout (indicating a biological relationship).
Adding Phenomics data to our pretraining data mix (i.e., PCBA_1328 and PCQM4M), improved our downstream task performance across the board (Figures 1 and 2 of rebuttal doc). Comparing the scaling trends in Figure 2 of the rebuttal doc, MPNN++ w/ Phenomics pretraining exhibits a significant vertical upwards shift compared to the original MPNN++. Notably, we were also able to extend our scaling study to the 3B parameter regime. While we were previously unable to extend the scaling trend (Figure 2 of the rebuttal doc), MPNN++ w/ Phenomics maintains a positive scaling trend. We note that, in a slight deviation from the figure shown in our submission, Figure 2 of the rebuttal doc shows scaling trends for MPNN++ instead of GPS++. This is because our recent experiments with added Phenomics data were only conducted with MPNN++ for parameter scales <1B due to the availability of compute.
To better visualize the impact of our results on the TDC benchmark collection [3], we have renamed and added few baselines compared to Figure 2(b) of the original submission. We renamed TDC SOTA to “Best model per task” (a collection of 8 different models that together establish SOTA across the benchmark collection) and added MapLight + GNN [1] (the best single model out of those 8 methods) evaluated on the benchmark collection, which falls significantly short of the newly added MPNN++ variant w/ Phenomics pretraining. We are also thrilled to report that our MolGPS w/ Phenomics even outperforms the best model per task at parameter scale 3B and performs on par at scale 1B. Lastly, we added a purely self-supervised MolE variant [4] to represent unsupervised pretraining strategies (which is further discussed below).
Comparison with unsupervised methods
In response to the feedback of Reviewer Lxfs, we added further comparison to unsupervised pretraining approaches (Table 1 and 2 of rebuttal doc). We observe that the unsupervised MolE model variants [4] clearly underperform the standard MolE model that leverages both unsupervised and supervised pretraining (Table 2), which is in turn outperformed by the listed MolGPS variants by large margins. Even our smaller MPNN++ models of comparable size to MolE (with ~100M parameters; see Figure 2 of rebuttal doc at the comparable parametric scale of 100M MPNN++) both outperform the self-supervised variant.
In Table 1, we compare to the self-supervised GraphMVP [11] model on MoleculeNet (which was already featured in the submission; we only consider the tasks that were not part of our pretraining). TheMolGPS model from the original submission outperforms GraphMVP on 3/4 tasks, while the model variants that we recently pretrained with additional Phenomics data outperform across all tasks.
We overall conclude that our pretraining strategy, is a promising avenue of molecular pretraining and novel at the parametric scale of the present work.
Dataset scale in the context of supervised molecular pretraining
We would like to point out that due to our supervised pretraining approach, the scale of the dataset is not directly comparable to that of self-supervised methods, i.e, billions of molecules in some cases. We note however that previous works in GNN literature scaled to even less than 5M graphs [5] and observe interesting scaling trends.
In our case, it is important to characterize data scale not only by the number of molecules, but instead paired with the number of pretraining labels per molecule. PCBA_1328 considers more than 1k different labels per molecule (albeit with high sparsity) and PCQM4M comes with 25 graph-level tasks and 4 node-level tasks (for each node of the ~4M molecules). This is further confirmed by our label scaling study that shows the impact of reduced data diversity by removing labels (e.g., Figure 2(a) of submission). The performance gains from incorporating Phenomics data into the pretraining data mix also support this, adding ~500k of molecules with ~6k highly informative labels per graph (Figure 2 of rebuttal).
Applicability to downstream tasks with 3D molecules like QM9 and MD17
As downstream tasks from 3D molecular modeling were mentioned by 2 reviewers, we would like to clarify that our modeling approach operates on 2D molecules, while datasets such as QM9 and MD17 require the model to reason over graph geometry in 3D coordinate spaces.
[1] Bray et al, Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes, Nat Protoc 2016
[2] He et al, Masked autoencoders are scalable vision learners, CVPR 2022
[3] Notwell et al, Admet property prediction through combinations of molecular fingerprints, arxiv 2023
[4] Méndez-Lucio et al, MolE: a molecular foundation model for drug discovery, arxiv 2022
[5] Chen et al, Uncovering neural scaling laws in molecular representation learning, NeurIPS, 2024
The paper investigates scaling laws for GNNs on molecular graphs through an extensive set of experiments. They also present a new GNN architecture called MolGPS based on the insights from these experiments that outperform prior models on several tasks.
Strengths:
- Comprehensive set of experimental results.
- New insights into the scaling properties of GNNs that leads to a new architecture.
- Strong performance of MolGPS on multiple tasks.
Weaknesses:
- The main weakness of the paper is the small scale of the datasets used -- scaling behavior cannot be fully ascertained at such small scales, and it is still unclear how the scaling properties hold at larger scales.
- Focus on supervised learning, which is uncommon for studying scaling behaviors.
- Focus limited to 2D graphs, which again makes it unclear how generalizable the observed scaling properties are.
Recommendation: In spite of some weaknesses, the paper sheds light on some valuable insights into the scaling behavior of GNNs, and presents a new architecture that exhibits strong performance on multiple tasks. Overall, the strengths outweigh the limitations, and the paper is likely to be of strong interest to researchers applying GNNs to molecular tasks.