GlycoNMR: A Carbohydrate-Specific NMR Chemical Shift Dataset for Machine Learning Research
摘要
评审与讨论
The paper is about two databases of NMR of carbohydrate structrures, one extracted from a hand-curated database, another is from a simulation.
优点
A database readily available for ML deployment is usually useful for ML research.
缺点
Since it is a database paper, I'm looking for its significance to the field. While it can be used, it is very difficult for me to judge its importance and the paper doesn't really help that much. The processing carbohydrate data is "more complicated than protein data, and requires substantial domain knowledge". The comparison with protein data, making protein data "much easier" to work with is not concrete enough. These claims require some evidence to back it up, otherwise, rather hollow.
The first database is extracted from Glycosciences.DB with "domain expertise and efforts". I think this is true for any database extraction, and not any extraction can be that significant. The second database of simulation also needs to show its significance. Does it requires a massive computing power that costs millions of dollar or can be finished in a PC in a couple of hours? In the end, I am still not sure these databases how much effort, is that absolute necessary to generate once and just once for all, or it can be done by most people in the field? How does this work help the field? Is it irreplaceable?
Another thing is to show an example of how useful the database is in some problem. The paper keeps talking about chemical shifts problem without a background that is useful for those who do not work on this problem.
问题
N/A
We would like to thank reviewer dVzH for the comments and suggestions. Here we summarize and address reviewer dVzH’s concerns as follows.
1, Concerns on the significance of the dataset, insufficient description of protein difference.
Response:
The significance of the GlycoNMR dataset and benchmark can be summarized into two folds.
1, For attracting glycoscience domain expert interest in this ML direction, we believe GlycoNMR is a good starting point to apply ML to glycoscience where ML methods are relatively under-explored due to the lack of well-curated datasets and underdeveloped benchmarks. We do recognize that there exist more high-resolution tools to study protein structure than carbohydrate structure, with X-ray crystallography and cryogenic electron microscopy (cryo-EM) being the frontier [1], in addition to NMR. Carbohydrates cannot be easily crystallized or imaged in cryo-EM, so NMR is still the leading structure tool. For experimental glycoscience researchers, we would like to present our machine learning datasets and pipelines to showcase their effectiveness in aiding challenges of NMR prediction within the glycoscience field. The larger and more diverse (and publicly available) the experimentalists can help make datasets, the more powerful ML tools can become. On the other hand, collaboration with ML researchers is needed however to make the datasets ML-friendly, so we’ve tried to make substantial initial efforts, documentation, and demonstrations in this direction. We hope our study encourages greater future collaboration.
2, For machine learning researchers: we would like to introduce this emerging subfield, which holds the potential to lead to both fundamental research progress and industrial applications. For example, the current collection of an NMR spectrum may cost hundreds or thousands of dollars for a single data point, not counting all the preparation time and cost needed for the development of purification protocols for novel compounds, for example see (https://www.aiinmr.com/nmr-spectroscopy-q-a-blog/How-much-does-an-Eft-NMR-Cost and https://nmr.science.oregonstate.edu/industry-price-list). A well-curated machine learning tool for NMR chemical shift prediction may potentially lead to substantial cost savings in compound quality verification (e.g. predict which shifts should result if the correct compound was generated, a tough experimental problem for larger carbohydrates currently), or through enhanced theoretical understanding of the link between structure and NMR spectra generally.
For further detail on the significance of the GlycoNMR, please kindly take a look at the main GitHub page: (https://anonymous.4open.science/r/GlycoNMR-D381/README.md), including a brief tutorial on using our curated carbohydrate data to fit into a machine learning pipeline and a road map figure summarizing what we have covered in this current work, and what could be the potential research directions in follow-up work. Thus our study can hopefully help guide future ML researchers to contribute to the Glycoscience community, and vice versa. Data cleaning and preprocessing pages for GlycoNMR: (https://anonymous.4open.science/r/GODESS_preprocess-F9CD/README.md and https://anonymous.4open.science/r/GlycoscienceDB_preprocess-B678/README.md), these two repo records the general data preprocessing pipelines and tons of our work in aggregating the experimental results uploaded from various labs and preprocessing the simulational results. This can potentially aid future Glycoscience researchers or industrial professionals who are willing to share their data with the ML community. We believe that the following researchers using our dataset will achieve better performance than ours.
In addition, to address reviewer dVzH’s concerns in our manuscript. We have expanded the explanation and references related to the historical discussions about weakness in data standards and consistency in carbohydrate structure research relative to protein research, as well as new data complexities that are present in carbohydrate data, in the introduction and appendix:
See these references in the Introduction: “Particularly, existing structure-related carbohydrate NMR spectra databases are less extensive and less accessible to ML researchers than databases for other classes of biomolecules and proteins, leading to recent calls for improvement in standards and quality [2, 3, 4, 5].”
Then, we added these paragraphs in the Appendix D to address protein differences and potential applications of NMR shift prediction:
“A common problem in glycosciences is matching structure to NMR spectra. For example, a scientist may want to verify they have generated the correct structure in the laboratory, by examining a compound's spectra after synthesis. NMR spectral peak positions provide key features for carbohydrate structure identification, including the stereochemistry of monosaccharides, glycosidic linkage types, atomic interactions and couplings, and conformational preferences. Individual atoms (with net spin) in a carbohydrate generate the key spectral peaks for structure interpretation, which in practice in carbohydrates is typically the central ring carbon and hydrogen atoms, plus certain modification groups. Chemical shift values reported in ppm units are also independent of spectrometer frequency and thus comparable across labs and equipment settings. In carbohydrates, usually, only the hydrogen 1H and carbon 13C nuclei shifts are measurable, making spectra harder to interpret than protein spectra where nitrogen and phosphorus shifts are also accessible [5].
As another challenge specific to carbohydrates, carbohydrate NMR peaks are constrained to a much narrower region of spectra range than proteins, making them harder to separate and leading to an over-reliance on manual interpretation [5]. Development of theoretical and computational ML-based tools that can utilize large datasets to find and predict relationships between carbohydrate structure and its NMR parameters is a high priority for the field…”
To demonstrate the unique problem in carbohydrates in the manuscript, we added the following paragraph in Appendix D.
“Theoretically advancing NMR-based structural analysis approaches in ML directions requires having a comprehensive database where the same base monosaccharide units have various neighboring units or modification groups swapped out or removed across data entries, in order to see how the spectra changes as various components are combined or removed to better train models. Such comprehensive databases have been established and well-studied in protein ML research, but a lack of ML-friendly databases and poor open access data norms have hindered parallel progress in carbohydrates[3, 4, 5, 6]. While our database is certainly not comprehensive and complete, with carbohydrates being more diverse and varied than any other class of biomolecule, our 2,609 NMR spectra and structure files tailored for ease of use in ML pipeline is the first of its size for ML studies.”
“For additional ideas for boosting the data size and quality in future work: by our assessment, GODESS provides the best balance of accuracy, efficiency, and accessibility for the simulation of 1D NMR of carbohydrates. However, as with any simulation method, it likely has some biases and simplifications not seen in experimental data which are difficult to reveal without a large experimental dataset for comparison. Thus, it is important for future work to expand this dataset to include simulation datasets from other sources (e.g. CASPER), as well as to expand the experimental dataset for comparison to the theoretical predictions. The experimental dataset expansion will necessitate a serious and concentrated effort on the part of glycoscience researchers to improve the open data norms of their field.”
Thanks again for helping to improve the manuscript. We hope the above statements can address your concerns.
[1] Wang, Hong‐Wei, and Jia‐Wei Wang. "How cryo‐electron microscopy and X‐ray crystallography complement each other." Protein Science 26.1 (2017): 32-39.
[2] Ranzinger, Rene, et al. "GlycoRDF: an ontology to standardize glycomics data in RDF." Bioinformatics 31.6 (2015): 919-925.
[3] Paruzzo, Federico M., et al. "Chemical shifts in molecular solids by machine learning." Nature communications 9.1 (2018): 4501.
[4] Böhm, Michael, et al. "Glycosciences. DB: an annotated data collection linking glycomics and proteomics data (2018 update)." Nucleic Acids Research 47.D1 (2019): D1195-D1201.
[5] Toukach, Philip V., and Ksenia S. Egorova. "Source files of the Carbohydrate Structure Database: the way to sophisticated analysis of natural glycans." Scientific data 9.1 (2022): 131.
[6] Toukach, Filip V., and Valentine P. Ananikov. "Recent advances in computational predictions of NMR parameters for the structure elucidation of carbohydrates: methods and limitations." Chemical Society Reviews 42.21 (2013): 8376-8415.
2, Insufficient demonstration of unique efforts and significance in dataset curation for GlycoNMR.Exp and GlycoNMR.Sim:
Response:
We would like to thank reviewer dVzH for mentioning this point. Below we summarize and explain the efforts and significance of dataset curation of GlycoNMR.Exp.
In curating the GlycoNMR.Exp from Glycoscience.DB, first, an important point is that Glycosciences.DB as an experimental database was not designed to have the structure files and NMR files to be well-matched and labeled for ML algorithms and pipelines. Structure files are generated by certain softwares, while NMR shift files are created without standard notation by individual laboratories. Thus, this dataset was not ready for ML upon extraction at all. It would be very challenging for an ML researcher without chemistry knowledge to curate the raw data. However, our documentation and preprocessing githubs can now be used as a resource to substantially lower the barrier for curating the experimental NMR data of carbohydrates. Our dataset and preprocessing protocols are both part of the contributions we hoped would make our work useful and appealing, and should be fairly generalizable to handling common issues in carbohydrate data that would appear potentially in any experimental data.
To further demonstrate our efforts in dataset curation for both datasets, we provide extensive details on our annotation and preprocessing protocol. We kindly invite reviewer dVzH to check our newly updated Appendix L.1, L.2 and L.3.
In Appendix L.1, we give a brief description of the data annotation and preprocessing steps of GlycoNMR.Exp and GlycoNMR.Sim. This summary includes an overview of and referral to the annotation GitHub repositories linked to our paper that provide extensive notes and scripts for preprocessing. In Appendix L.2 and L.3 We showcase how extensive human labors and domain expertise is involved by presenting an example of each dataset’s curation process to demonstrate our efforts.
For additional information on how domain expertise is extensively involved in creating GlycoNMR.Exp from Glycoscience.DB, please refer to our preprocessing Github repo: (https://anonymous.4open.science/r/GlycoscienceDB_preprocess-B678/README.md).
The in-lab comments and annotations are documented under the folder: (https://anonymous.4open.science/r/GlycoscienceDB_preprocess-B678/preprocess_manual/)
For a detailed description of the pipeline of data annotation and data preprocessing and how domain expertise is involved in GlycoNMR.Sim, please refer to our preprocessing Github repo: (https://anonymous.4open.science/r/GODESS_preprocess-F9CD/README.md).
For training the 2D and 3D-based GNN model and reproducing benchmark results from the above-preprocessed carbohydrates data, please refer to our main Github repos: (https://anonymous.4open.science/r/GlycoNMR-D381/README.md). Four notebooks are provided to demonstrate the reproducibility.
2D GNN on GlycoNMR.Exp: (https://anonymous.4open.science/r/GlycoNMR-D381/2D_example_Exp_GlycoNMR.ipynb)
2D GNN on GlycoNMR.Sim: (https://anonymous.4open.science/r/GlycoNMR-D381/2D_example_Sim_GlycoNMR.ipynb)
3D GNN on GlycoNMR.Exp: (https://anonymous.4open.science/r/GlycoNMR-D381/3D_example_Exp_GlycoNMR.ipynb)
3D GNN on GlycoNMR.Sim: (https://anonymous.4open.science/r/GlycoNMR-D381/3D_example_Sim_GlycoNMR.ipynb)
For more specific time estimates, in the preprocessing step 2 of the GlycoNMR.Exp. We reformulated all the PDB files (as well as the NMR label files) into an interpretable and consistent format, as they were uploaded from various labs. Even equipped with domain knowledge, we spent at least ~15 minutes per carbohydrate to understand its structure file and NMR shift file completely and also document any inconsistencies between its structure and NMR shift entries. Thus, this step took us about ~75 hours (0.25 hr x 300 glycans) for the experimental dataset. Then, we needed to spend a few weeks on top of that in formalizing, validating, and standardizing the mismatch search and annotation process. For GODESS, we had to write a browser-driving and web-scraping script to enter lists of glycan formulas in correct notation one at a time in the GODESS interface and then navigate various pages and options in an automated way for the NMR and structure file generation. The script took 2-3 weeks to write and debug due to the complexity of the pages that had to be navigated and scraped, and each simulation takes ~1-10 min in the GODESS interface (so ~2500-5000 min minimum for simulation of ~2600 glycans). Then we had to manually check and assess each GODESS file for quality control by using a similar annotation protocol to the one used for curating and annotating the experimental NMR data.
3, Insufficient backgrounds of NMR shift:
We added this discussion to Appendix D, to clarify points which were originally more briefly discussed in the Background section related to NMR:
“In the context of carbohydrates, predicting 1D NMR chemical shift spectra shapes can reveal atomic and modification group identities and location for structure solving, including from interpretation of known spectra signatures of interactions between atoms within a monosaccharide unit or between a monosaccharide unit and modifications or other neighboring units. Chemical shift values reported in ppm units are independent of spectrometer frequency and thus comparable across labs and NMR equipment. [7]”
[7] Marion, Dominique. "An introduction to biological NMR spectroscopy." Molecular & Cellular Proteomics 12.11 (2013): 3006-3025.
Thanks again for the comments!
The paper explores the application of Machine Learning to carbohydrate studies using NMR by offering data sources and providing a benchmark. The main contribution of the paper is the "GlycoNMR" dataset which contains two laboriously curated datasets. The authors made predictions on specific chemical properties using their proposed dataset using tailored a set of carbohydrate-specific features. The authors hope that this research helps ML researchers in carbohydrate research. The authors acknowledged some limitations in their dataset and highlighted the need for more comprehensive data in upcoming research.
优点
The paper introduces a dataset to expand the NMR studies on carbohydrates, an area that presents substantial challenges. The authors also presented a set of features for carbohydrates. The quality is acceptable; the methods used for data annotation and feature engineering are described, though there's room for refining the approach in places. The paper sets a dataset for more in-depth studies and it adds incremental value to the ongoing research in the domain.
缺点
The paper provides statistics on the dataset but omits details for certain features like ring size, ring position, and atom type. Adding those statistics would add to the clarity of the features extracted. Additionally, a deeper validation of the introduced features would enhance the paper's value and credibility.
问题
Feature Statistics: There seems to be a bit of missing statistics on some specific features. Could you share some more detailed stats about the ring size, ring position, and atom type? It'd be really helpful to get a fuller picture of the datasets!
Ring Position vs. Shift Relationship: Could you provide further elaboration on the relationship between the ring position and the shift? Specifically:
-
How does the ring position influence the chemical shift value?
-
Are there any trends or patterns observed based on different ring positions that affect the shift?
-
Including some analysis, possibly graphs or charts that showcase this relationship, could enhance our understanding of the dataset.
We are grateful to the reviewer sFB8 for the valuable suggestions to investigate the feature statistics on both datasets further. In summary, we have added three appendix subsections on pages 17, 18, and 19 to address reviewer sFB8’s concerns. Here, we reply to the concerns and questions in the following.
1, Concern: Feature Statistics and deeper validation of features.
We would like to thank reviewer sFB8’s suggestions, as the visualization could further illustrate the data statistics in Table 1. We provide a detailed feature analysis on both GlycoNMR.Exp and GlycoNMR.Sim in our Appendix A.3 and A.4. We kindly invite reviewer sfB8 to check our expanded version on page 17. In Appendix A.3, we visualize the feature statistics in six pie charts for atom-level and monosaccharide-level features.
For the atom level feature (first row), we included the dataset statistics of atom type, carbon atom position, and hydrogen atom position. According to the pie chart for atomic identity(top left), we observed that of all the 27267 atoms in GlycoNMR.Exp, 13010 are Hydrogen, 7663 are Carbon, 6257 are Oxygen, and 337 are other types of atoms, including Nitrogen, Phosphorus, and Sulfur.
To describe carbon atom position (top middle), ‘Other’ indicates the off-ring carbons. Similarly, for describing the hydrogen atom position (top right), ‘Other’ indicates off-ring hydrogens.
For the monosaccharide level feature, we included Amomer (bottom left, indicates the hydroxyl group), Configure (bottom middle, indicates Fischer project information), and Ring Size (bottom right, number of in-ring carbons). The ‘N/A’ of each figure indicates that the information is not contained in the PDB file.
In addition, we added Appendix C to analyze the feature contributions by calculating the Shapely values. Here we again present the Shapley value table:
| GlycoNMR.Exp | Ring position | Modification | Stem type | Anomer | Configuration | Ring size |
|---|---|---|---|---|---|---|
| Hydrogen | 0.457 | N/A | 0.088 | 0.061 | 0.009 | 0.008 |
| Carbon | 16.852 | N/A | 2.640 | 0.515 | 0.257 | 0.085 |
| GlycoNMR.Sim | Ring position | Modification | Stem type | Anomer | Configuration | Ring size |
|---|---|---|---|---|---|---|
| Hydrogen | 0.387 | 0.014 | 0.112 | 0.051 | 0.014 | 0.003 |
| Carbon | 13.007 | 0.321 | 3.619 | 0.465 | 0.199 | 0.055 |
We observe that the Shapley values across all features are positive. Notably, the ring position of carbon and hydrogen atoms is a critical feature in predicting NMR shifts. Moreover, integrating the stem type of monosaccharides into the 2D Graph Neural Network can marginally reduce the prediction error. Other features, like the modification group, anomer, configuration, and ring size, exert lesser influence.
2, Question: Ring Position vs. Shift Relationship
We provide the violin plot of the Ring position versus the NMR shift for both datasets in Appendix A.4. In each subplot, the x-axis indicates the ring position of the atom (Carbon / Hydrogen), and the y-axis indicates the NMR shift of the corresponding atoms.
We notice that the distribution of NMR shift values for the ring positions C1 and C6 significantly vary from those of C2, C3, C4, and C5, similar to hydrogens. In addition, the distribution of NMR shift values associated with the ring positions shows similarity when comparing the GlycoNMR.Exp and the GlycoNMR.Sim. This consistency indicates the value of using simulated NMR data to help develop ML approaches for studying carbohydrates.
For a general explanation of the ring position, a fundamental factor that determines the NMR shift value is the atom’s electronic environment, especially bonded or non-bonded interactions within 1-3 atom distances away from the atom of interest.
In glycoscience specifically, it is known that in the monosaccharide units, the index of carbon and hydrogen in the ring, is one of several major factors that define the electronic environment, which is reflected in our results [1]. These basic position tendencies are further modulated by a set of factors that can also alter the electrochemical environment, including the other features tested in our model, like anomer, configuration, temperature, and solvent, and more complicated modification or conjugation groups attached to glycans. This domain knowledge prompts us to integrate ring positions as one of the node features in our 2D-based and 3D-based Graph Neural Network models.
[1] Fontana, Carolina, and Goran Widmalm. "Primary structure of glycans by NMR Spectroscopy." Chemical Reviews 123.3 (2023): 1040-1102.
Thanks again for the valuable suggestions!
The Manuscript presents a benchmark data set for NMR shift prediction in carbohydrates. It then also introduces some features based on domain knowledge that can be used to build classifiers based on the benchmark data.
优点
The data set seems to be a valuable contribution to the presented domain (NMR shift prediction in carbohydrates).
The engineered features might enhance future prediction models.
缺点
I have a problem with reproducibility. The introduction is very extensive (the own contributions basically start at page 5), but in section 3 at page 5 it is only mentioned that substantial domain expertise is required to annotate and process the data from Glyco-sciences.DB. This makes it hard to reproduce the methods and also to evaluate the contribution (Glyco-sciences.DB -> GlycoNMR.Exp). Also the sentence in the middle of page 6 "we had to utilize domain knowledge to reduce such ambiguities as much as possible when handling..." is not really describing a procedure in a reproducible manner. It is clear that not all domain knowledge can be described, but one could spend more space on the own contributions and less on the general introduction (currently 4 pages).
This is apparently addressed better after the revision, but still I think this contribution should be part of the manuscript and one could cut some of the content of the 4 pages introduction.
Furthermore, the simulated part of the data base is a little bit more problematic, since one model (GODESS) is used directly. It is questionable if ML models trained on the simulated data can learn important parameters that are not already known from GODESS. For a performance comparison of different models the data set might still be valuable, with the remark that GODESS might produce a biased view of the real world in which some methods perform worse than they would perform on real data (and we cannot judge, because we do not know the biases of the individual methods).
The authors addressed some points in the revision, but the problem is still that the potential biases of GODESS make it hard to decide how useful such a data set based on this one simulation software is.
问题
It would be good to concisely define the NMR shift prediction problem in the main manuscript.
SHAP values should be included into the manuscript for comparison.
I am not sure, if the Glycoscience.DB data could be used directly under a CC license, because it is not accessible and therefore, I cannot check the license. There should be a statemtent regarding making this data available in the manuscript.
伦理问题详情
I am not sure, if the Glycoscience.DB data could be used directly under a CC license, because it is not accessible and therefore, I cannot check the license. There should be a statemtent regarding making this data available in the manuscript.
We would like to express our sincere gratitude to reviewer ozW2 for the detailed comments and suggestions we can use to improve the paper. Here, we address reviewer ozW2’s concerns and questions in the following.
1, Concerns on reproducibility and domain expertise involved:
We would like to thank ozW2 for pointing out that the domain expertise contribution is not introduced clearly enough in the manuscript, which may result in an insufficient demonstration of the reproducibility. Here, we address the reviewer’s ozW2’s concern as follows:
-
We summarize the data annotation and preprocessing steps of GlycoNMR.Exp and GlycoNMR.Sim. This summary includes an overview of and referral to the annotation GitHub repositories linked to our paper that provide extensive notes and scripts for preprocessing.
-
We present how domain expertise is involved in annotating the raw glycan data.
-
We present an example glycan for the data annotation that demonstrates the ambiguities.
Please see as follows, we also updated our Appendix L in response to ozW2’s suggestions.
For GlycoNMR.Exp:
We first carefully manually inspected the NMR chemical shift and structure files for every carbohydrate in the dataset in detail. We documented all inconsistencies between NMR and structure files (files that many different labs uploaded across decades) and used the list of various inconsistencies to make a formulaic and reproducible data curation/annotation protocol that could be extended to any experimental dataset of carbohydrates. Mismatches were mainly because of the inconsistent sequence ordering between monosaccharide IDs from the PDB structure files and NMR shift label files, as they are uploaded from various chemistry labs, structure files can be generated from separate software unrelated to NMR shifts, and format can change over the course of years. There were no automatic ways or well-specified routines for solving this matching problem, and significant domain expertise was required from the glycoscience domain for the initial data annotation protocol establishment. However, by the end of our process, we were able to reduce our curation protocols to understandable and reproducible annotation procedures. The complete protocol and corresponding preprocessing scripts are listed in this manuscript’s supplemental data preprocessing repository (https://anonymous.4open.science/r/GlycoscienceDB_preprocess-B678/README.md), and we’ve moved an abbreviated version of the steps into the Appendix L.2 and L.3.
Using the protocol, we reformulated all the PDB and label files we retrieved from Glycoscience.DB into interpretable consistent formatting and applied our initial data cleaning for the scraped carbohydrates. This step is mainly to unify the PDB (Protein Data Bank) format of the experimental carbohydrate data uploaded from various labs in different time periods. Each PDB file should include the information of atom type, atom 3D coordinates, and the monosaccharides that the atom belongs to. Carbohydrates like (https://anonymous.4open.science/r/GlycoscienceDB_preprocess-B678/experimental_data/FullyAnnotatedPDB_V2/DB26874.pdb) are dropped due to insufficient information (missing atom name that describes the ring position).
GlycoNMR.Exp preprocessing example:
In the carbohydrate file DB26380, we need to manually annotate the PDB file: (https://anonymous.4open.science/r/GlycoscienceDB_preprocess-B678/experimental_data/FullyAnnotatedPDB_V2/DB26380.pdb) by assigning each central ring carbon and hydrogen atoms with their corresponding shift values, which is stored in the NMR label file: (https://anonymous.4open.science/r/GlycoscienceDB_preprocess-B678/experimental_data/FullyAnnotatedPDB_V2/DB26380.csv). To achieve this, we need to associate the atoms’ parent monosaccharides IDs/names between the two files. We first draw a sketch of the carbohydrate structure consisting of the basic monosaccharide components from the CSV file using the linkage information, from the linkage variable column in the NMR shift files. Atoms with the same linkages are from the same monosaccharides. For example, atoms from lines 13-19 belong to monosaccharide B-D-GLCPN. We utilize linkage information to identify monosaccharide components but not monosaccharide names such as ‘B-D-GLCPN’ because, in some scenarios, the same monosaccharide name may indicate different monosaccharide components (i.e. there can be multiple monosaccharide units with the same name in a carbohydrate, but the linkage information can be used to tell them apart for NMR shift matching purposes). For example, lines 62-67, 68-73, and 74-79 of the label file (https://anonymous.4open.science/r/GlycoscienceDB_preprocess-B678/experimental_data/FullyAnnotatedPDB_V2/DB26380.csv) refer to three separate monosaccharide unit components, that are parents of different sets of atoms and appear in different locations of the carbohydrate chain, but still have the same monosaccharide chemical name. DB26380’s sketch plot can be found on the 8th page (# 23) of our annotation document for branched carbohydrates: (https://anonymous.4open.science/r/GlycoscienceDB_preprocess-B678/preprocess_manual/nonlinear_process_doc.pdf). Second, we again inspect the PDB file and match the monosaccharide components with the help of the SWECON information which provides additional secondary linkage information at the bottom of Glycosciences.DB PDB files (line 306-315 of file: https://anonymous.4open.science/r/GlycoscienceDB_preprocess-B678/experimental_data/FullyAnnotatedPDB_V2/DB26380.pdb) and our domain expertise. Then, for another example of a common issue causing mismatches between monosaccharide shift and structure, in DB26380, we noticed that the Phosphoryl group ‘PO3’ (lines 39-42, 64-67) is treated as a monosaccharide component in the PDB file despite not being a monosaccharide, therefore the monosaccharide shift file ID ordering 3 and 13 should be disregarded when comparing to the PDB structure file, and the 4th monosaccharide residue in the PDB file should instead be matched with the 3rd monosaccharide parent and its atom components in the NMR label file. A detailed match is presented in our PDF document mentioned above. Then last, when all parent monosaccharides are correctly matched between structure and shift files, we assign the corresponding monosaccharide atoms’ shift from the label file to the PDB file by their atom names.
For a detailed description of the data cleaning, annotation, preprocessing, and documentation of how domain expertise is involved in creating GlycoNMR.Exp from Glycoscience.DB, please refer to our preprocessing Github repo for the experimental dataset: (https://anonymous.4open.science/r/GlycoscienceDB_preprocess-B678/README.md), where we meticulously summarized our data processing steps. The in-lab comments and annotations are documented under the folder: (https://anonymous.4open.science/r/GlycoscienceDB_preprocess-B678/preprocess_manual/).
For GlycoNMR.Sim:
As the NMR and structure files of GODESS were all consistently produced by the same simulation software, much fewer annotation issues existed for GODESS, though some did, as discussed in the documentation in the attached GitHub repositories. Instead of laboriously aligning the monosaccharide components between the PDB file and the label file, we conducted the data annotation with the help of the atom connection information section provided in the GODESS PDB files. In this new atom connection section, every atom in the entire carbohydrate has a unique ID number and lists which neighboring atom IDs they are bonded to. This information was not present in the Glycosciences.DB files, and it allowed us to develop a semi-automated data annotation pipeline informed by our domain expertise. To make sure the correctness of the annotations, we manually checked the annotations of each carbohydrate.
GlycoNMR.Sim preprocessing example:
In glycan: ‘aDXylp(1-6)bDGlcp(1-4)[aLFucp(1-2)bDGalp(1-2)aDXylp(1-6)]bDGlcp(1-4)[aLFucp(1-2)bDGalp(1-2)aDXylp(1-6)]bDGlcp(1-4)xDGlca’ (https://anonymous.4open.science/r/GODESS_preprocess-F9CD/dataset/Godess_carbon/2/PDB.pdb), monosaccharide bond linkage '(1-4)' indicates the carbon with position number 1 is connected to the carbon with position number 4 via a dehydration synthesis reaction, where ‘xDGlca’ is the precursor monosaccharides (in other words ‘root’). From line 223 of the PDB file, we notice that atom 1 is connected to atoms 28 and 2, this indicates that the monosaccharide with ID 2 is connected to a monosaccharide with ID in the following bounds (C1 - O4 - C4), where C indicates the carbon and O indicate the oxygen and the following number indicates the ring position. In this case, from the 3rd line of the label file (https://anonymous.4open.science/r/GODESS_preprocess-F9CD/dataset/Godess_carbon/2/c_tsv_stat.txt), we can match the monosaccharides residue ‘b-D-Glcp’ from the label file to the monosaccharides ID 2 in the PDB file using the linkage information ‘, 4’ which indicates the following bounds (C1 - O4 - C4). Then, again, we assign the corresponding monosaccharide atom's shift from the NMR label file to the PDB file by its atom name.
For a detailed description of the pipeline of data annotation, and data preprocessing and how domain expertise is involved in GlycoNMR.Sim, please refer to our preprocessing Github repo for GlycoNMR.Sim: (https://anonymous.4open.science/r/GODESS_preprocess-F9CD/README.md).
For reproducing the 3D-based GNN benchmark results from the above preprocessed carbohydrates data, please refer to our main Github repos: (https://anonymous.4open.science/r/GlycoNMR-D381/README.md). Where four notebook is provided to demonstrate the reproducibility:
2D GNN on GlycoNMR.Exp: (https://anonymous.4open.science/r/GlycoNMR-D381/2D_example_Exp_GlycoNMR.ipynb)
2D GNN on GlycoNMR.Sim: (https://anonymous.4open.science/r/GlycoNMR-D381/2D_example_Sim_GlycoNMR.ipynb)
3D GNN on GlycoNMR.Exp: (https://anonymous.4open.science/r/GlycoNMR-D381/3D_example_Exp_GlycoNMR.ipynb)
3D GNN on GlycoNMR.Sim: (https://anonymous.4open.science/r/GlycoNMR-D381/3D_example_Sim_GlycoNMR.ipynb)
We hope the above information can address your concerns.
2, Concerns on the simulation dataset and the simulation software's:
We completely agree with reviewer ozW2’s point that training the machine learning model on the simulation dataset could be a bit problematic as the simulation method (GODESS software) itself may contain some unknown bias and inevitable errors. Although there could be some gap between the simulation data and the experimental data, it is common in practice that scientists use simulated NMR spectra to help their interpretations of experimental NMR spectra.
A simulation dataset and the benchmark results on it can provide useful information, which we summarize as follows:
1, Functions as a proof-of-concept to demonstrate the usefulness of machine learning algorithms in the field of glycoscience.
We aim to demonstrate that models such as 3D-based Graph Neural Networks, when provided with extensive carbohydrate data, can attain relatively good performance in core glycoscience tasks, including the prediction of NMR shifts. If an approach does not work well on a large simulation dataset, it may be much harder to perform well on an experimental dataset. In addition the field of glycoscience, the underdevelopment of machine learning methods is mainly from the insufficient amount of open access in-lab (experimental) data, which may cost hundreds or thousands of dollars to generate a single data point, for example, see a price list of NMR cost: (https://www.aiinmr.com/nmr-spectroscopy-q-a-blog/How-much-does-an-Eft-NMR-Cost) and ( https://nmr.science.oregonstate.edu/industry-price-list) We anticipate that, with the help of this demonstration on the simulation dataset, more laboratories will be motivated to share their private experimental data towards the creation of a collaborative ML-chemistry platform and data repository in glycoscience.
2, We believe that using a large simulation dataset could be a good starting point to advance the development of AI methods in glycoscience.
Typically, the advancement of AI methods in a scientific domain begins with the use of simulated datasets. In physics and chemistry domains, simulations are fairly accurate and powerful and can aid in providing large datasets for initial ML model training - especially in topics where publicly available experimental data has low-to-moderate statistics at best (as is the case in carbohydrate NMR datasets).
For example, the QM7 and QM9 datasets [1] are initially proposed to analyze the atomization energies and thermodynamic properties of small molecules, and the datasets are constructed via density function DFT modeling. QM7 and QM9 were later integrated by the MoleculeNet platform [2] to facilitate the development of machine learning and deep learning models by learning better representations of molecules. In recent years, the dataset has been an auxiliary for molecule generation and drug discovery [3], [4], and functions as an important benchmark.
We adopted reviewer ozW2’s suggestions and added the following section to
Appendix D: Customized models for carbohydrate data to clarify the potential for bias in simulation:
By our assessment, GODESS provides the best balance of accuracy, efficiency, and accessibility for the simulation of 1D NMR of carbohydrates. However, as with any simulation method, it likely has some biases and simplifications not seen in experimental data which are difficult to reveal without a large experimental dataset for comparison. Thus, it is important for future work to expand this dataset to include simulation datasets from other sources (e.g. CASPER), as well as to expand the experimental dataset for comparison to the theoretical predictions. The experimental dataset expansion will necessitate a serious and concentrated effort on the part of glycoscience researchers to improve the open data norms of their field.
[1] Rupp, Matthias, et al. "Fast and accurate modeling of molecular atomization energies with machine learning." Physical review letters 108.5 (2012): 058301.
[2] Wu, Zhenqin, et al. "MoleculeNet: a benchmark for molecular machine learning." Chemical Science 9.2 (2018): 513-530.
[3] Bongini, Pietro, Monica Bianchini, and Franco Scarselli. "Molecular generative graph neural networks for drug discovery." Neurocomputing 450 (2021): 242-252.
[4] Luo, Youzhi, and Shuiwang Ji. "An autoregressive flow model for 3d molecular geometry generation from scratch." International Conference on Learning Representations (ICLR). 2022.
3, Question: It would be good to define the NMR shift prediction problem concisely.
In the context of Graph-Based molecular representation learning, NMR shift prediction could be formulated as a Node regression task. The input of an ML algorithm/model is a graph representing a carbohydrate molecule where each node represents an atom, and each link represents a relationship (e.g., chemical bond) between two atoms. The output contains a continuous NMR shift value of each node that represents an atom of interest to users.
In the context of carbohydrates, predicting 1D NMR chemical shift spectra shapes can reveal atomic and modification group identities and location for structure solving, including from interpretation of known spectra signatures of interactions between atoms within a monosaccharide unit or between a monosaccharide unit and modifications or other neighboring units. Chemical shift values reported in ppm units are independent of spectrometer frequency and thus comparable across labs and NMR equipment. [5]
We added discussion to appendix D, clarifying these points, which were originally more briefly discussed in the Background section related to NMR.
[5] Marion, Dominique. "An introduction to biological NMR spectroscopy." Molecular & Cellular Proteomics 12.11 (2013): 3006-3025.
4, Question: Justification for the chosen thresholds for atoms bound to construct a graph representation.
Thanks for pointing out this problem. We have cited the following sources in our manuscript to justify why these cutoffs are reasonable for carbohydrates (e.g., for C-C bonds [6], C-H bonds[7], and x-x bounds [8, 9, 10].). Also, see the new summary for the abundance of each atom type in our dataset in Appendix A.3. We also present the general chemical bond information here: https://www.webassign.net/question_assets/wertzcams3/bond_lengths/manual.html. It is true future work should adjust the various bond cutoff distances according to the compounds of the atoms.
[6] Liu, Sai, et al. "Theoretical predictions of two new chiral solid carbon oxides." Physics Letters A 385 (2021): 126941.
[7] Guzmán-Afonso, Candelaria, et al. "Understanding hydrogen-bonding structures of molecular crystals via electron and NMR nanocrystallography." Nature Communications 10.1 (2019): 3537.
[8] Zhang, Huaiyu, et al. "Electron conjugation versus π–π repulsion in substituted benzenes: why the carbon–nitrogen bond in nitrobenzene is longer than in aniline." Physical Chemistry Chemical Physics 18.17 (2016): 11821-11828.
[9] Gunbas, Gorkem, et al. "Extreme oxatriquinanes and a record C–O bond length." Nature Chemistry 4.12 (2012): 1018-1023.
5, Question: Why did you not look at SHAP values for the feature contributions (top of page 8)?
As we cannot find an existing software package to calculate the feature contributions on Graph Neural Networks, we calculated Shapley value from scratch for our 2D Graph Neural Network following [10]. Please see the attached tables.
| GlycoNMR.Exp | Ring position | Modification | Stem type | Anomer | Configuration | Ring size |
|---|---|---|---|---|---|---|
| Hydrogen | 0.457 | N/A | 0.088 | 0.061 | 0.009 | 0.008 |
| Carbon | 16.852 | N/A | 2.640 | 0.515 | 0.257 | 0.085 |
| GlycoNMR.Sim | Ring position | Modification | Stem type | Anomer | Configuration | Ring size |
|---|---|---|---|---|---|---|
| Hydrogen | 0.387 | 0.014 | 0.112 | 0.051 | 0.014 | 0.003 |
| Carbon | 13.007 | 0.321 | 3.619 | 0.465 | 0.199 | 0.055 |
The Shapley value indicates that for our 2D-based graph neural network, the positioning of the atom's ring and the type of its stem monosaccharide are key predictors for NMR shift values. Also, we have added a SHAP table containing the above information in Appendix C.
[10] Štrumbelj, Erik, and Igor Kononenko. "Explaining prediction models and individual predictions with feature contributions." Knowledge and information systems 41 (2014): 647-665.
6, Question: There should be a statement regarding making this data available in the manuscript.
We added the following disclaimer to the appendix F:
Disclaimer for GlycoNMR.Exp: According to [8], all glycan-related scientific data of the GLYCOSCIENCES.de [now called Glycosciences.DB] portals are freely accessible via the Internet following the open access philosophy: ‘free availability and unrestricted use."
[11] Toukach, Philip, et al. "Sharing of worldwide distributed carbohydrate-related digital resources: online connection of the Bacterial Carbohydrate Structure DataBase and GLYCOSCIENCES. de." Nucleic acids research 35.suppl_1 (2007): D280-D286.
Thank you very much!
In their manuscript, the authors introduce a comprehensive database dedicated to carbohydrate nuclear magnetic resonance. The reviewers commend the substantial effort invested in curating this dataset and the inclusion of baselines. However, it is important to note that empirical papers submitted to ICLR are subject to specific criteria: The process of dataset collection should be framed as a research challenge, detailing the professional methods employed to overcome various obstacles encountered. The aspect of feature engineering must be articulated as a research issue, encompassing an exploration of alternative approaches and their comparative study. It is expected that the experimental section of the paper not only relies on novel algorithms but also frames benchmarks as a research problem. This section should offer conclusive insights and guidance for future practitioners. It is pertinent to recognize that the criteria for dataset inclusion at ICLR differ from those at NeurIPS, which is more focused on fostering new baselines and datasets. While the current paper demonstrates sufficient quality through the proposal of a new dataset or benchmark, it may be more appropriately suited for the Datasets and Benchmarks track at NeurIPS.
为何不给更高分
The paper promotes a novel database but there is no new algorithm. The decision is based on serious discussions between AC, SAC, and PCs.
为何不给更低分
not applicable.
Reject