Neural Graph Matching Improves Retrieval Augmented Generation in Molecular Machine Learning
We develop a retrieval-augmented generation approach with neural graph matching for mass spectrum prediction, achieving a 45% relative improvement in top-1 retrieval accuracy.
摘要
评审与讨论
This paper introduces a method, MARASON, for predicting a mass spectrum from molecular graphs. MARASON extends an existing deep learning framework (ICEBERG) by integrating retrieval with neural graph matching. The model retrieves reference molecules with known spectra, then aligns fragments between the target and reference molecules to guide its prediction of the target’s spectrum. Experimental comparisons show that MARASON outperforms current state-of-the-art methods in terms of accuracy and retrieval performance.
给作者的问题
- In lines 209–211, you mention “in our experiments, we use the training dataset to mitigate concerns about data leakage.”, which is unclear how the data leakage is prevented. I suggest rewriting this sentence to elaborate how this procedure prevents leakage.
- In lines 217–219, you mention retrieving up to three reference spectra with collision energies similar to the target. How was the number three selected, and have you investigated how performance or computational efficiency might change if this limit were increased?
- In section 3.4 , when predicting the intensities, you rely on learned embeddings (H, Hr), reference intensities (Tr), and the matching matrix (X̄), along with a Tanimoto similarity measure. However, in real-world mass spectrometry, many experimental factors (e.g., ionization mode, instrument settings) can affect fragment intensities. Does this framework explicitly account for these variables, or do you assume consistent conditions across the NIST dataset?
- NIST released an updated tandem mass spectral library (NIST23), which includes around 60% more compounds than NIST20. I understand that the NIST database is not freely available, but, if possible, would you consider re-running the experiments on NIST23 to see how the results generalize with a larger dataset?
论据与证据
Most of the paper’s claims are backed by evidence. However, the reference to retrieval augmented generation (RAG) is misleading because the paper does not actually describe a “G (generation)” process. It is unclear whether this approach truly utilize RAG.
方法与评估标准
Please report separately the results on the FT-HCD dataset and the FT-CID dataset.
理论论述
N/A
实验设计与分析
Mostly
补充材料
Supplementary Notes A, B, and C were reviewed. They all appear appropriate. However, it is recommended that the Supplementary Material should be elaborated a bit more to improve completeness. See Weakness #4
与现有文献的关系
N/A
遗漏的重要参考文献
The paper is motivated by the assumptions that “similar structures tend to have similar fragmentation patterns in chemistry, and similar fragments tend to have similar response factors that relate their abundance to the observed intensity”, citing only (Shahneh et al., 2024). However, there is a lot of literature in computational mass spectrometry and cheminformatics that specifically discusses the correlations between structural similarity and fragmentation outcomes. I recommend citing additional references to support this idea.
MoMS-Net and 2DMolMS are missing.
其他优缺点
Strengths: • Clear discussion of background/previous work/motivation. • Clear explanation of relative concept and description of modeling approach. • The tables presenting results are well-constructed and straightforward. The ablation studies are comprehensive. • Manuscript is clearly written and easy to follow
Weaknesses:
- The paper presents a framework for mass spectrum simulation but does not clarify which specific type of mass spectrometry it supports. While ICEBERG is trained on NIST’s tandem mass spectral (MS/MS) data, the reference “NIST20” in this paper does not explicitly stating whether it is electron ionization (EI) data, tandem MS, or another type of mass spectral library. It would be helpful if the authors clarified the applicability of their method to different mass spectrometry modes—such as MS/MS, EI, or GC-MS and provided evidence or discussion on its performance across these various platforms.
- There is lack of discussion about the architecture of the GNN and other layers used in the framework. It is unclear whether they are identical to the GNN block in ICEBERG. Do all the GNN blocks share the same structure in FIG. 2? Section 4.2.3 suggests that separating these two GNNs leads to better spectrum similarity, implying that they may have distinct architectures. To address this weakness, I recommend including a more detailed description of the GNN modules, MLP blocks and the matching layers blocks in the Supplementary Information and updating FIG. 2 accordingly.
其他意见或建议
None
We thank the reviewer for recognizing the state-of-the-art accuracy of MARASON and the organization of this paper. We believe there are several misunderstandings and we clarify them as follows.
The paper does not describe a generation process and does not utilize RAG.
The problem MARASON aims to tackle is generating MS/MS from a given molecular structure. On the technical side, our first-stage model is an auto-regressive generator that predicts one bond-breaking event at each step.
The paper does not clarify which specific type of mass spectrometry it supports.
We will emphasize that MARASON is developed with ESI-MS/MS in the main text. As discussed in section 4.4.1, “We trained our models on the NIST 2020 dataset with 530,640 high-energy collision-induced dissociation (HCD) spectra and 25,541 unique molecular structures.” We believe we made it clear that our experiment setting contains MS/MS spectra (more specifically, Orbitrap spectra) with an instrument type label “HCD” in the NIST database.
Please report separately the results on the FT-HCD dataset and the FT-CID dataset.
To clarify, as discussed in section 4.4.1, we only use spectra labeled as “HCD” instrument type in NIST. We do not find any entries with instrument type labels CID or FT-CID in NIST. We believe we are working with an open-source training and testing framework that supports most baselines and all methods are trained and tested on the same dataset, ensuring fair comparison.
Questions on model architecture details and GNN layers.
We would like to clarify that MARASON adopts the same backbone architecture as ICEBERG, including both the GNN and transformer layers.
Regarding the GNN blocks shown in Figure 2: the first and third GNNs operate on molecular graphs and are structurally identical to their counterparts in ICEBERG. The second GNN, discussed in Section 3.3.2 under “DAG Hierarchical Embedding Learning,” serves as the neural graph matching module. While all GNN blocks share the same architecture, our strategy involves using separate weights depending on the module's function—i.e., whether it is used for intensity prediction or for graph matching.
We will include more detailed discussions and highlight such differences in Figure 2 in future revisions.
I recommend citing additional references for the correlation between spectrum and structural similarities.
Thank you for the suggestion, we plan to include the following references on molecular networking. Please also let us know if there are any specific references in your mind.
- Aron AT et al. Nature protocols. 2020
- Wang M et al. Nature biotechnology. 2016
MoMS-Net and 2DMolMS are missing.
We do not find any open-source implementation for MoMS-Net and the dataset statistics in Table A1 are different from ours, making it challenging to include MoMS-Net for comparison. We do not find any references to 2DMolMS—if you meant 3DMolMS, it is the first entry in Figure 3 and Table 1.
How to prevent data leakage is unclear.
Thanks for the suggestion. One possible concern of running RAG on MS/MS simulation is that the retrieved spectra might be considered as a source of data leakage that simplifies the problem. Therefore, in numerical evaluations, we restrict retrieving spectra only from training data; in real-world cases, there is no doubt about using any reference spectra available for MARASON. We will elaborate in future revisions.
Why retrieve three reference spectra with different collision energies?
Since the reference spectra may not have the exact collision energy of interest, MARASON tries to interpolate from three collision energies. We did not try including more references because the more reference spectra we have, the more computational overhead there will be.
Does this framework explicitly account for experimental variables?
Our framework handles ionization mode by different adduct types. We incorporate collision energy as an important instrument variable. For others, we assume they are the same across Orbitrap instruments. If any important factors are missing, please let us know and we would be more than glad to address them in future work.
Testing on NIST23 for results with a larger dataset
We purchased the NIST23 license, but the raw data file (.SDF) we used for NIST20 is not available at least in the NIST23 distribution we purchased. We will certainly explore NIST23 after figuring out how to extract training data from it.
Within the timeframe of rebuttal, we try to address this concern from another direction by showing that MARASON achieves state-of-the-art retrieval accuracy on MassSpecGym (please refer to the table in response to reviewer VQAR).
Recommend to elaborate a bit more on Supplementary Material.
As we cannot edit the supplementary materials at this time, we will try to include more details in future revisions. If there are any specific points that feel unclear to you, please let us know.
This paper proposes a modification of a method for generating mass spectra from molecular structures. Inspired by recent successes in retrieval augmented generation (RAG), the authors decided to apply this technique for querying similar molecules from the training set and to use them as references for generating spectra with a model that extends the previously proposed ICEBERG. The proposed model, named MARASON, retrieves the nearest reference molecule along with three of its spectra to construct a representation vector used for predicting peak intensities. A neural graph matching algorithm is introduced to align the fragment graphs in the fragmentation DAG of both the reference and target molecules. MARASON surpasses other methods on both random and scaffold splits in predicting mass spectra. Additionally, it demonstrates excellent performance in compound retrieval using mass spectra.
update after rebuttal
The Authors addressed all my comments. I decided to maintain my positive score.
给作者的问题
- Section 3.2.2 says: "All reference intensities at the same collision energy are processed by a set transformer, followed by an average pooling layer that merges intensity embeddings per fragment from three collision energies." I understand that there are three different energies for the reference compound, yet here processing intensities at the same collision energy are described. Could you elaborate on this process in more detail? What vectors serve as input to the set transformer?
- How is this model trained? Are there any new components in the loss function? To train neural graph matching, do you use any additional training steps, or is it trained alongside intensity prediction as part of the vector in Equation 7?
论据与证据
The claims made in the paper are supported by experimental evidence. The method can generate mass spectra more accurately, even in the scaffold-based split setup (Figure 3). Moreover, Table 1 shows that retrieval accuracy is also improved over earlier methods. The ablation study confirms the effectiveness of both RAG and neural graph matching.
方法与评估标准
The methods and evaluation criteria are adequate to solve the described problem. However, some parts of the method description could be more detailed, especially given that the code is not yet available. I highlighted these parts in the Questions for Authors.
理论论述
There are no theoretical claims that need formal proofs.
实验设计与分析
The experiments answer the posed research questions. What would make the claims in the paper stronger, would be conducting statistical tests for the results described in Section 4.2.1. It would also be interesting to see examples of predicted spectra compared to the original spectra and the spectra of the reference compound. For reference, similar qualitative results were presented in earlier works. Such plots would demonstrate how similar the predicted and reference spectra are.
补充材料
I read the whole supplementary material.
与现有文献的关系
This paper not only presents a significant contribution to mass spectra prediction, surpassing earlier models, but also demonstrates that RAG can be effectively utilized in the molecular domain. Additionally, neural graph matching is shown to be more effective than classical matching methods for processing fragmentation DAGs.
遗漏的重要参考文献
All the key references have been discussed.
其他优缺点
Most of the comments have been addressed in the other sections. Furthermore, I appreciate that Figure 2 offers a clear overview of the proposed method. To enhance this paper, an evaluation on a second dataset, such as NPLIB1, is recommended.
其他意见或建议
N/A
We truly appreciate your recognition of our contribution to the mass spectrometry field and our technical novelty of introducing neural graph matching. We conduct statistical tests and perform preliminary MassSpecGym experiments following your suggestions. We will work actively to complete the new results in future revisions.
Conducting statistical tests for the results described in Section 4.2.1 will make claims stronger.
Thank you for the suggestion. We first generate results with 3 random seeds for the scaffold split and perform a t-test on ICEBERG (w/ collision energy) and MARASON. The P-values are 0.005 and 0.002 for random split and scaffold split, respectively, both above the 95% confidence interval of statistical significance.
It would also be interesting to see examples of predicted spectra compared to the original spectra and the spectra of the reference compound. For reference, similar qualitative results were presented in earlier works. Such plots would demonstrate how similar the predicted and reference spectra are.
Thank you so much for the suggestion, unfortunately, we do not have the option to include visualization results with Openreview. We will add the visualization of reference spectra, predicted spectra, and ground-truth spectra in the appendix in future revisions.
To enhance this paper, an evaluation of a second dataset is recommended.
We truly appreciate your suggestion. We have run our method on the recently developed MassSpecGym benchmark [1], which is a publicly accessible library with spectra collected from MoNA, MassBank, and GNPS. We share an initial result of its performance here:
| Top- accuracy | 1 | 5 | 20 |
|---|---|---|---|
| FraGNNet | 31.93 | 63.20 | 82.70 |
| MARASON | 34.03 | 64.04 | 85.39 |
MARASON outperforms FraGNNet, the state-of-the-art on MassSpecGym, in terms of retrieval accuracy. We get the aforementioned result within the tight rebuttal time frame, and we will keep working on this benchmark with a comprehensive study in future revisions.
[1] Bushuiev et al. MassSpecGym: A benchmark for the discovery and identification of molecules. NeurIPS 2024
Section 3.2.2 says: "All reference intensities at the same collision energy are processed by a set transformer, followed by an average pooling layer that merges intensity embeddings per fragment from three collision energies." I understand that there are three different energies for the reference compound, yet here processing intensities at the same collision energy are described. Could you elaborate on this process in more detail? What vectors serve as input to the set transformer?
You are right about how we process reference spectra at each collision energy. Our purpose of using three reference spectra is that the target collision energy may not have a close match energy in the reference database and multiple reference spectra could be considered together to interpolate. We first concatenate spectral peaks with their corresponding collision energies and the target collision energy. After that, we feed the concatenated spectral vectors into a set transformer and eventually a linear layer, whereby our design of the linear layer is to “move” the peak intensities to the target energy level. Finally, we use average pooling to collect information from the same peaks at different energy levels to generate a reference spectrum embedding that is a learned interpolation from three energies.
How is this model trained? Are there any new components in the loss function? To train neural graph matching, do you use any additional training steps, or is it trained alongside intensity prediction as part of the vector in Equation 7?
The first-stage model that generates fragments is trained in the same way as ICEBERG-Generate. The second-stage model that predicts intensities is trained end-to-end, where the cosine loss between predicted spectra and ground truth spectra is the only supervision. We are able to make such a design choice because being fully differentiable is one of the major advantages of neural graph matching.
Authors present MARASON, which augments a previously-developed framework ICEBERG by retrieving the most similar molecules in a database to a target structure based on Tanimoto similarity. Both target and reference structures are fragmented using ICEBERG; a GNN is used to construct a matching matrix to predict which reference fragments are matches; these are then used to help predict peak intensities for the original target molecule. Authors evaluate their framework on retrieval accuracy, where the closest spectrum to the generated spectrum is retrieved based on cosine similarity, and show that their method outperforms previous baselines that do not use retrieval-based augmentations.
update after rebuttal
Thank you to the authors for their response. I am glad the authors were able to evaluate their method on MassSpecGym and the performance gain is notable given that the benchmark has harder train/test splits.
We treat spectra with different collision energies as distinct spectra.
Based on my understanding, baseline methods first combine spectra from different collision energies into a single spectrum and work with that -- I think it would be good for the authors to disentangle this effect, or at least discuss it if relevant.
I am happy with the rebuttal and will increase my score.
给作者的问题
[1] How did you make your design choices for the fragment-level hierarchical embeddings of what information to include? It seems like a lot of information is concatenated but not clear why (see Experimental Designs Or Analyses for other notes) [2] Why is random splitting done on 3 seeds but scaffold/RAG ablations only on 1? [3] Is it possible to compare on a second benchmark like MassSpecGym? (it's not necessary to do this for me to change my evaluation of the paper, but I think it would strengthen the paper a lot) [4] Can authors clarify how they incorporated spectra of different collisions? Also, authors improved the ICEBERG baseline by incorporating multiple collisions, is it also possible to improve other baselines this way?
If authors can clarify their experimental protocol/choices I will be happy to revisit my evaluation.
论据与证据
The claims presented in the paper are well-supported in my opinion. Authors use the ICEBERG framework and test whether or not adding a module that incorporates information from fragments that are predicted to be matching from other molecules in a known database. They show that on the same data splits from NIST, incorporating this module increases the performance on both spectral similarity and retrieval.
方法与评估标准
Authors evaluate their method on NIST, which is a standard benchmark in the field. Authors use two data splits for training/eval -- random and scaffold, which is good. Authors use 3 random seeds for the random split but it seems like only 1 seed for the scaffold split -- why is this the case? Authors could also evaluate their method on more recent such as MassSpecGym [1], which would show the utility of the RAG component in harder split settings.
理论论述
Authors don't make any theoretical claims.
实验设计与分析
In Section 3.3.2, authors describe how they build a fragment embedding for a fragment F_i from a molecule M, although they concatenate a lot of information and it's not clear why it's all needed. For instance, why did the authors decide to incorporate the embedding difference between M and F_i as well as the chemical formula difference between M and F_i? Secondly, in the hierarchical embedding, why are both the forward and reverse graphs needed, as opposed to using a bi-directional GNN?
Otherwise, I think authors did a good number of ablations for their RAG strategy and also made the direct baseline comparison more fair to the baseline. However, could authors clarify how they incorporated the collision energy information? Were the spectra for each collision point treated as separate entires or where the combined into a single spectrum? The authors improved the ICEBERG baseline, but based on my understanding, it's also possible to condition other models like NEIMS and MassFormer on this information.
补充材料
Yes, I reviewed all the supplementary material. Authors write that they pulled FraGNNet values directly from the paper -- are the data splits etc. identical to be able to make this comparison?
与现有文献的关系
The problem of MS prediction is an important problem for improving the characterization of molecules, especially given the sparsity of experimental characterizations. The performance metrics indicate that the problem is challenging (the best top-1 performance is 27%). While the method cannot be used to predict the structure of an unknown molecule, it can be used for this indirectly by predicting spectacular for unannotated molecules and then using cosine similarity.
遗漏的重要参考文献
I am not 100% familiar with SOTA models/works in this domain, but to my knowledge authors are not missing essential references.
其他优缺点
Strengths:
- Incorporating a RAG module into the ICEBERG framework had a positive effect on all metrics considered.
- RAG for MS has not been previously explored to my knowledge (but it has been explored in other molecule settings)
- Multiple data splits done (random/scaffold)
- The paper is well-written and easy to follow for the most part; figure 2 is helpful for clarifying the method
- The paper was submitted to the applications track which I think is appropriate; I think it's an interesting application of existing methods and empirically shows a nice improvement.
Weaknesses:
- Evaluation on 1 dataset
- Unclear explanation of how collision energies were incorporated
- Only 1 seed for scaffold and RAG ablations
其他意见或建议
- page 3: "non-bond" → should it be "non-bold"?
- Authors write that this represents "neutral losses" but don't explain what that is; it would be helpful to define this concept.
- why did the authors choose 0.1 Da? Is it possible to demonstrate the method on a more fine-grained binning?
Thank you for agreeing with the novelty and technical solidity of our paper. We provide more random seeds, new MassSpecGym results, and MassFormer baseline with collision energy as per your comments. Please find our reply to your questions as follows.
Why does the scaffold split experiment have only 1 random seed?
In the ICEBERG paper, there is only one random seed with scaffold split. Following your suggestion, we run MARASON on 3 random seeds on scaffold split with the following updated results:
| cosine similarity | top-1 acc | top-5 acc | top-10 acc |
|---|---|---|---|
| 0.727±0.002 | 0.284±0.001 | 0.705±0.006 | 0.856±0.004 |
From the results, MARASON’s variance to random seeds is less significant compared to the accuracy improvement (for handy reference, ICEBERG w/ collision energy has a cosine similarity of 0.711) and the trend is consistent with either random or scaffold split. We will update the results in future revisions.
Request to evaluate on MassSpecGym
Although the rebuttal time frame was quite short, we retrained our model and were able to get the following retrieval accuracies on MassSpecGym. We demonstrate the superiority of MARASON on MassSpecGym for candidates with the same formula
| Top- accuracy | 1 | 5 | 20 |
|---|---|---|---|
| FraGNNet | 31.93 | 63.20 | 82.70 |
| MARASON | 34.03 | 64.04 | 85.39 |
We will perform more benchmarking and update results in future revisions.
Questions on model design choices and the "neutral loss" embedding.
We would like to clarify that the model architecture was adopted from ICEBERG and we hypothesized that the GNN architecture of ICEBERG learns the structural information essential for graph matching. To elaborate on encoding the “neutral loss”, since represent precursor, ionized fragment, and neutral loss, respectively, subtracting the embedding of from represents the embedding of the neutral loss. We will elaborate on these details in the revised manuscript.
Why are both the forward and reverse graphs needed in DAG embedding?
When embedding the fragmentation DAG, the forward graph handles parent-to-children message passing and the reverse graph handles children-to-parent. We believe the update functions of both directions should be distinct to avoid degeneration into an undirected graph. Empirically, this strategy was more powerful than using bi-directional GNNs during our model development.
How different collisions are incorporated?
We treat spectra with different collision energies as distinct spectra. The collision energy value becomes another input dimension that is concatenated to the GNN input. In NIST retrieval experiments, we compute spectra at all collision energies, compare each one with its corresponding real spectrum, and compute the average cosine similarity over all recorded collision energies. The preliminary MassSpecGym experiments follow the official benchmark setting, which we will elaborate upon in the revised version.
Are the data splits identical to FraGNNet?
As discussed in the FraGNNet paper (Appendix I), their NIST benchmark follows MassFormer and ICEBERG to ensure fair comparison, which should align with our benchmark.
It's also possible to improve other models with RAG.
We absolutely agree with that; we select ICEBERG as the base model because it is the current open-source state-of-the-art. Another conclusion is that neural graph matching is vital for the success of RAG—just concatenating the reference spectrum to the neural network even harms the test performance, as shown in Table 2. Ultimately, we have demonstrated a RAG approach that can be readily adapted to other modeling tasks in the molecular spectroscopy domain.
Is it possible to improve other baselines with collision energy?
Yes. We applied the same design to MassFormer and validated changes to retrieval accuracies on NIST’20:
| Top- accuracy | 1 | 5 | 10 |
|---|---|---|---|
| MassFormer | 19.1 | 55.0 | 71.6 |
| MassFormer (w/ collision energy) | 20.9 | 59.6 | 76.4 |
| MARASON | 27.8 | 68.5 | 82.7 |
It shows a marginal improvement but still does not outperform MARASON.
Why did the authors choose 0.1 Da? Is it possible to demonstrate with more fine-grained binning?
Since we want to compare all methods under the same metric, every model’s output should be transformed into the same mass resolution. 0.1 Da is the result of balancing between the mass resolution of HCD/Orbitrap spectra and the feasibility of implementing binned-prediction baselines. Empirically, experimentalists also find rounding to 0.1 Da resolution is sufficient in practice. MARASON (and ICEBERG) are adaptable with any mass resolution and can be run with higher resolution; however, selecting a finer-grained resolution will make binned-prediction methods unfeasible as a 0.01 Da resolution with a maximum mass of 1500 Da will require a 150,000-dim output in their neural networks.
Typo on page 3
Thanks for pointing this out. It has been fixed.
This study introduces MARASON, an advanced computational framework that enhances RAG in mass spectrum prediction through neural graph matching. It evolves from the ICEBERG framework through a synergistic integration of graph-based neural architectures and spectral alignment mechanisms.
给作者的问题
- Under circumstances where the mass spectral database lacks structural analogs, could the RAG strategy potentially lose efficacy due to failed similarity-based retrieval, thereby reverting to a purely de novo generation mode dependent on deep learning architectures?
- The MARASON framework demonstrates significant performance enhancement over the baseline (ICEBERG) in mass spectrometric analysis. I wonder whether the authors could elucidate the respective contributions of its dual modules. Specifically, does the observed enhancement primarily stem from additional information introduced by spectral retrieval or from the structural alignment capability enabled by neural graph matching?
论据与证据
The claims are clearly supported.
方法与评估标准
The method is technically sound.
理论论述
The paper focuses on application of AI method, and no proof is needed.
实验设计与分析
The experimental designs are sound.
补充材料
I've checked the appendix, which provides further experimental result.
与现有文献的关系
The contributions of this paper are novel and origin.
遗漏的重要参考文献
References are sufficient.
其他优缺点
Strengths:
- The integration of RAG and neural graph matching under the ICEBERG framework synergistically enhances both prediction accuracy and generalization capability, establishing a new state-of-the-art performance benchmark.
- The paper is clear and easy to follow.
Weaknesses:
- The model's training and evaluation were exclusively conducted using the NIST dataset, with no validation performed on alternative publicly accessible mass spectral libraries (e.g., MoNA) or laboratory-curated in-house datasets.
其他意见或建议
See questions.
伦理审查问题
N/A
Thank you for recognizing our state-of-the-art performance and our technical soundness. We update with experiments on the recently developed MassSpecGym benchmark, evaluation with non-similar reference structures, and elaboration on the ablation study to address your concerns. We are more than happy to clarify any further questions.
The model's training and evaluation were exclusively conducted using the NIST dataset, with no validation performed on alternative publicly accessible mass spectral libraries (e.g., MoNA).
We truly appreciate your suggestion. We have run our method on the recently developed MassSpecGym benchmark [1], which is a publicly accessible library with spectra collected from MoNA, MassBank, and GNPS. We share an initial result of its performance here:
| Top- accuracy | 1 | 5 | 20 |
|---|---|---|---|
| FraGNNet | 31.93 | 63.20 | 82.70 |
| MARASON | 34.03 | 64.04 | 85.39 |
MARASON outperforms FraGNNet, the state-of-the-art on MassSpecGym, in terms of retrieval accuracy. We get the aforementioned result within the tight rebuttal time frame, and we will keep working on this benchmark with a comprehensive study in future revisions.
A bit more clarification on why we focused on NIST: as discussed under “software and data” on page 9, NIST is the only database where all spectra have collision energy annotations. In most open-source MS/MS libraries, collision energy labels are at least partially missing, making it challenging to develop a RAG model with them. It is still a reasonable assumption to have access to collision energy values for prospective use because it is an experimental variable set on MS/MS instruments.
[1] Bushuiev et al. MassSpecGym: A benchmark for the discovery and identification of molecules. NeurIPS 2024
Under circumstances where the mass spectral database lacks structural analogs, could the RAG strategy potentially lose efficacy due to failed similarity-based retrieval?
Thank you for bringing up this important point. MARASON benefits from RAG with Tanimoto similarity > 0.3 (90.7% of the testing set), under low Tanimoto similarity (< 0.3) regimes, the performance of MARASON is still close to the non-RAG baseline.
We group test instances based on the Tanimoto similarity between the retrieved structure and the target structure. We report cosine similarities (higher is better) as follows
| Tanimoto similarity | (0, 0.1] | (0.1, 0.2] | (0.2, 0.3] | (0.3, 0.4] | (0.4, 0.5] | (0.5, 0.6] | (0.6, 0.7] | (0.7, 0.8] | (0.8, 0.9] | (0.9, 1] |
|---|---|---|---|---|---|---|---|---|---|---|
| MARASON | N/A | 0.550 | 0.614 | 0.690 | 0.741 | 0.789 | 0.815 | 0.808 | 0.805 | 0.824 |
| MARASON (non-RAG) | N/A | 0.566 | 0.611 | 0.682 | 0.727 | 0.768 | 0.791 | 0.780 | 0.759 | 0.784 |
All results are from a random split on NIST20. MARASON (non-RAG) is from the first entry in Table 2. This study also highlights a strategy for further performance improvements: use standard MARASON when the retrieved structure has Tanimoto similarity > 0.3 and use the non-RAG version otherwise. There is a trend of decreased performance for non-RAG MARASON when the Tanimoto similarity drops, because a lower Tanimoto similarity means there are fewer similar structures in the training set, making it a more challenging out-of-distribution instance. We will incorporate new results and discussions in future revisions.
Does the observed enhancement primarily stem from additional information introduced by spectral retrieval or from the structural alignment capability enabled by neural graph matching?
Thank you for the thought-provoking question. We are happy to elaborate based on our ablation study presented in Table 2.
Simply retrieving the reference spectrum and concatenating it with the input to the neural network—a naive RAG-style approach—does not improve performance for MS/MS generation. In fact, it slightly degrades it: the resulting cosine similarity is 0.737, representing a 0.3% decrease compared to the non-RAG baseline.
Our findings suggest that effective fragment-level matching is crucial for realizing the benefits of retrieval. For instance, applying a simple Hungarian matching algorithm raises the cosine similarity to 0.746, which is a 0.9% improvement over the baseline. Further, introducing a learnable neural graph matching module with carefully designed architecture yields a cosine similarity of 0.757, amounting to a 2.4% relative improvement over the non-RAG model.
While gains in retrieval accuracy are even more substantial—partly because the cosine similarity metric is more saturated—the top-1 retrieval accuracies indicate that there remains considerable room for improvement.
In summary, the structural alignment capability enabled by the neural graph matching module is the primary driver of MARASON's performance improvement.
Major of my concerns have been addressed, and I would like to update my score to 3.
The rebuttal clarified a few concerns of the reviewers and provided further experimental results. After considering these, all reviewers recommend to accept the paper and particularly appreciate the comprehensive ablation study.