5.3

/10

Poster6 位审稿人

最低3最高7标准差1.5

3.0

置信度

正确性2.7

贡献度2.5

表达2.7

NeurIPS 2024

AdaNovo: Towards Robust \emph{De Novo} Peptide Sequencing in Proteomics against Data Biases

Jun Xia,Shaorong Chen,Jingbo Zhou,Xiaojun Shan,Wenjie Du,Zhangyang Gao,Cheng Tan,Bozhen Hu,Jiangbin Zheng,Stan Z. Li

OpenReview PDF

提交: 2024-05-11更新: 2024-11-06

TL;DR

We propose a novel framework that calculates conditional mutual information (CMI) between the mass spectrum and each amino acid or peptide, using CMI for robust model training against data biases in proteomics.

摘要

关键词

De Novo Peptide sequencingProteomicsMass Spectrum

评审与讨论

审稿意见

评分: 5置信度: 42024-07-03

The authors introduce a novel approach for the robust de novo sequencing of peptides from mass spectrometry experiments. The novelty on the application side is the focus on post-translationally modified peptides that are commonly ignored in pure de novo sequencing approaches, however, biologically of highest relevance in many applications. Methodologically this is achieved by computing the Conditional Mutual Information between mass spectra and amino acid sequences. Evaluation is carried out in comparison against commonly used learning based tools in the field on a commonly used benchmark dataset for de novo sequencing.

优点

• The authors identify a highly relevant biological question, the robust identification of post-translational peptides that is only very rarely studied in the context of de novo sequencing • The proposed solution based on conditional mutual information well fits the problem at hand • The authors benchmark against state-of-the-art-tool on a commonly used data set and can show improvements there

缺点

• The initial hypothesis (Post-Translational Modifications due to their lower frequency in training data compared to canonical amino acids, further resulting in unsatisfactory peptide sequencing) is somewhat flawed. PTMs are rare: even most commonly occurring ones such as phosphorylation (while common on the protein level) end up being extremely sparsely occuring (<1%) on the peptide level under natural conditions. Missing PTMs is suboptimal, but the effect on the peptide sequencing level overall has to be ignorable. • What is commonly done (e.g. https://www.nature.com/articles/s42256-022-00467-7) is to use machine learning models to identify the location of PTMs without necessarily requiring the sequence. This is a different problem than the de novo sequencing and the authors may want to acknowledge those two different schools of thought and also may want to consider benchmarking accordingly. • The benchmark data set is likely not adequate for learning tasks and is potentially subject to major data leakage. A high proportion of peptides between highly related organisms such as mouse and human is shared (e.g. https://www.researchgate.net/figure/Human-and-Mouse-Proteins-with-Identical-Amino-Acid-Sequences_tbl1_14316492. This is even more pronounced if you consider that highly expressed proteins (such as house keeping proteins) tend to be even better conserved. Thus, many peptides of one species are contained in the training set, but then also as part of another related species, part of the test set. This is not covering the problem of identifying novel proteins, particularly in the settings where a user would like to apply de novo sequencing (usually underexplored organisms without closely related ones with available genomes) • The lack of error bars (due to computational complexity) makes it hard to judge the claim that the method provided by the authors outperforms other approaches. The standard deviations given actually indicate that this may not be the case and that there may be no statistically significant advance.

问题

While I understand that computing standard deviations for all experiments can be computationally prohibitive, could you also provide them at least for Mouse and Human for all tools and indicate why you believe that AdaNovo outperforms other tools?
Can you quantify how many of the peptides in the 9 species datasets are actually identical between species or how you protect against data leakage?
Can you indicate absolute numbers for peptides with/without PTMs in the 9 species benchmark
Can you delineate your approach from existing approaches to identify PTMs and argue why it is beneficial to include this step in de novo sequencing rather than running it separately.

局限性

The authors indicate limitations in identifying never observed PTMs (they may want toexplicitly acknowledge that some open search tools actually are able to do that).

作者回复

2024-08-07

Response to Reviewer DL85 (2/2)

Could you provide standard deviations for Mouse and Human for all tools?

Thanks for your helpful reviews! Following your valuable advice, we reproduce the performance of DeepNovo and PointNovo with standard deviations over 5 different random initializations. The results shown in Table Re4 indicate that AdaNovo outperforms previous methods with statistically significant margins.

Table Re4: Empirical comparison of previous models on Mouse and Human datasets in terms of amino acid-level/peptide-level precision (with standard deviations).

Models	DeepNovo	PointNovo	CasaNovo	AdaNovo
Mouse	0.627 ± 0.009/0.293 ± 0.012	0.625 ± 0.010/0.352 ± 0.014	0.612 ± 0.015/0.449 ± 0.010	0.667 ± 0.018/0.493 ± 0.015
Human	0.603 ± 0.012/0.290 ± 0.019	0.596 ± 0.014/0.342 ± 0.010	0.585 ± 0.010/0.343 ± 0.016	0.618 ± 0.013/0.373 ± 0.012

The authors indicate limitations in identifying never observed PTMs.

Thank you for your professional reviews! The open search tools such as MSFragger [6] can identify the PTMs the human has discovered, as included in the UniMod database [7]. However, what we mean are the PTMs that have not been found by humans. We will further clarify the distinction in the revised version.

[1] Ad hoc learning of peptide fragmentation from mass spectra enables an interpretable detection of phosphorylated and cross-linked peptides (Altenburg et al., Nature Machine Intelligence)
[2] Large-scale prediction of protein ubiquitination sites using a multimodal deep architecture (He at al., BMC Systems Biology)
[3] De novo peptide sequencing by deep learning (Tran et al., PNAS)
[4] Computationally instrument-resolutionindependent de novo peptide sequencing for high-resolution devices (Qiao et al., Nature Machine Intelligence)
[5] De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model (Yilmaz et al., ICML 2022)
[6] MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. (Kong et al., Nature Methods)
[7] Unimod: Protein modifications for mass spectrometry. (Creasy et al., Nature Methods)

We greatly appreciate your insightful and helpful comments, as they will undoubtedly help us improve the quality of our article. If our response has successfully addressed your concerns and clarified any ambiguities, we respectfully hope that you consider raising the score. Should you have any further questions or require additional clarification, we would be delighted to engage in further discussion. Once again, we sincerely appreciate your time and effort in reviewing our manuscript.

2024-08-09

Thank you for the detailed explanations. This is very helpful. However, I wonder if we have a misunderstanding regarding PTMs here. The motivation of the paper and this answers given seem to imply that the authors focus on biological meaningful PTMs (such as phosphorylation), however the numbers of frequency of occurrence imply rather that the authors may find modifications such as Carbamidomethylation or oxidations that result from the mass spectrometry acquisition process (but do not carry any biological relevance and are not actual PTMs in the biological sense). The frequency numbers the authors describe is significantly above what is commonly described in literature for biological modification. This is relevant for the discussion here since I usually filter out fixed oder very frequent modificatons differently than I search true PTMs. Also frequency matters a lot for the discussion of training specific models or not and also for the leakage discussion I wonder whether this was considered (so are only the directly 1:1 identical peptides removed or also those that may have an Ox(M). Thanks for clarifying

评论- Response to Reviewer DL85 (1/2)

2024-08-07

Missing PTMs' the effect on the peptide sequencing overall has to be ignorable. Can you indicate absolute numbers for peptides with/without PTMs in the 9 species benchmark?

Thanks for your insightful and to-the-point reviews! We provide the absolute numbers for peptides with/without PTMs of the 9 species benchmark in Table Re1. As can be observed, peptides with PTMs account for 11.8% to 31.1% across different species. These proportions are relatively large and have significant impacts on the overall performance of peptide sequencing. This is because if there is a single incorrect prediction of a PTM within a peptide, the entire predicted peptide will be erroneous.

Furthermore, we remove peptides with PTMs from the test set and compared the results with those obtained using the model before the removal of PTMs. As shown in Table Re2, the model's peptide-level precision significantly improves after removing the peptides with PTMs in the test set, which verifies that the PTMs have significant impacts on the overall de novo peptide sequencing performance.

Table Re1: The absolute numbers for peptides with/without PTMs in 9 species datasets.

Species	Rice bean	Honeybee	Bacillus	Clam bacteria	Human	Mouse	M. mazei	Yeast	Tomato
Peptides with PTMs	10512	64488	34474	46846	36008	10725	36472	24849	76620
Peptides without PTMs	27263	250083	257309	103765	94575	26296	127949	86463	213430
The Ratios of Peptides with PTMs	27.8%	20.5%	11.8%	31.1%	27.6%	29.0%	22.2%	22.3%	26.4%

Table Re2: The peptide-level precision of models before and after removing peptides with PTMs in test set.

Species	Bacillus	M. mazei	Clam bacteria
Casanovo (before removing peptides with PTMs)	0.513	0.474	0.347
Casanovo (after removing peptides with PTMs)	0.572	0.520	0.385
AdaNovo (before removing peptides with PTMs)	0.561	0.523	0.397
AdaNovo (after removing peptides with PTMs)	0.598	0.545	0.423

Can you delineate your approach from existing approaches to identify PTMs and argue why it is beneficial to include this step in de novo sequencing rather than running it separately.

Thanks for your comments! AdaNovo can predict both PTMs' locations and types of the PTMs while previous methods can only predict the location of one specific PTMs type such as phosphorylation [1] or ubiquitination [2]. Therefore, the task studied here is more challenging. If we conduct PTMs identification and de novo sequencing separately, we need to train distinct models to predict various types of PTMs one by one, which is computationally complex and prone to errors. More importantly, as verfied in Table Re2, PTMs have significant impacts on the overall performance of peptide sequencing. Therefore, it is necessary to design novel methods to enhance PTMs identification in de novo peptide sequencing.

The benchmark data set is potentially subject to data leakage. Can you quantify how many of the peptides in the 9 species datasets are actually identical between species or how you protect against data leakage?

Another good point! The nine species datasets are widely used for de novo sequencing methods [3,4,5] and we follow the same experimental settings as them. Specifically, we employ a leave-one-out cross validation framework where the peptides in the training set are almost completely disjoint from the peptides of the held-out species. To illustrate this point, among the ∼26,000 unique peptide labels associated with the human spectra in the test data, only 7% overlap with the ∼250,000 unique peptide labels associated with spectra from the other eight species. The majority (93%) of peptides are not present in the training set, ensuring that the dataset is suitable for evaluating de novo sequencing methods. To further mitigate your concerns, we remove the training samples whose peptide labels overlap with the training set. The results in Table Re3 verify that AdaNovo also exhibits significant advantages on such datasets.

More importantly, in de novo sequencing, we can regard the mass spectra as samples and peptides as labels. The same peptide labels will yield different mass spectra under different experimental conditions. Data leakage occurs when test samples are present in the training set. In our experiments, the training and test spectra come from different experiments, making it impossible for test spectra to be present in the training set. This segregation ensures that there is no data leakage in the benchmark data.

Table Re3: The peptide-level precision on test set without peptides overlap with the training set.

Species	Bacillus	M. mazei	Clam bacteria
Casanovo	0.492	0.465	0.328
AdaNovo	0.549	0.515	0.376

The second part of our response can be found in the next block.

评论- Clarification regarding the misunderstanding between us

2024-08-10

Thanks very much for your professional, insightful, and helpful reviews! We greatly enjoy such in-depth discussions and believe they will significantly enhance the quality of our work.

Q1: I wonder if we have a misunderstanding regarding PTMs here. The motivation of the paper seem to imply that the authors only focus on biological meaningful PTMs ... The frequency numbers the authors describe is significantly above what is commonly described in literature for biological modification...

The motivation of our work lies in that all the PTM types (instead of one specific PTM type) have significant impacts on the overall de novo sequencing performance, which we have explained and experimentally verified in our original manuscript and response above.

Following your previous advice, we provide the ratios/frequency of peptides with PTMs in Table Re1, which can be formulated as, $\frac{\text{The number of the peptides with any PTMs type}}{\text{The number of the peptides}}$ . However, the frequency numbers of PTMs you mentioned now can be formulated as $\frac{\text{The number of a specific type of PTMs}}{\text{The number of amino acids}}$ . We show the statistics of the frequency of 3 PTM types in Table Re5, from which we can observe that the frequency of PTMs is low (0.051% - 0.914%) in the datasets, just like what is commonly described in literature for biological modification. However, the frequency of peptide with PTMs is relatatively high (see Table Re1), resulting in low peptide-level precision of previous de novo sequencing methods because if a PTM in a peptide sequence is predicted incorrectly, the entire predicted peptide sequence is incorrect in the de novo sequencing task. In other words, the PTMs have significant impacts on the overall peptide-level performance of de novo sequencing even if their frequency is lower than canonical amino acids.

Table Re5: The frequency of 3 PTM types in 9 species datasets.

PTM types	Rice bean	Honeybee	Bacillus	Clam bacteria	Human	Mouse	M. mazei	Yeast	Tomato
M(+15.99)	0.698%	0.077%	0.218%	0.914%	0.238%	0.617%	0.871%	0.389%	0.900%
N(+.98)	0.304%	0.217%	0.536%	0.441%	0.424%	0.328%	0.462%	0.385%	0.356%
Q(+.98)	0.051%	0.118%	0.112%	0.090%	0.143%	0.103%	0.061%	0.134%	0.065%

Q2: Frequency matters a lot for the discussion of training specific models or not and also for the leakage discussion I wonder whether this was considered ...

In some biological applications, we may only care about one specific type of PTMs. For example, in PTMs prediction task, previous methods usually train one model for one specific PTM type. However, to evaluate de novo sequencing models, it is unreasonable to only keep one specific PTM type and remove the other PTM types (such as Ox(M)) in the datasets because all the PTM types have significant impacts on the overall performance of de novo sequencing models, as we have explained and experimentally verified in our manuscript or response above. Additionally, de novo sequencing is primaryly applied to identify the proteomic "darker matters" caused by PTMs and other alterations. It would be better that the de novo sequencing models can identify more PTM types. Therefore, it is reasonable that the widely-used 9 species datasets for evaluating de novo sequencing models contain multiple PTM types. There is NO data leakage in these datasets as we have throughly claried before.

Furthermore, regarding train specific models for each PTM type in PTMs prediction task, some recent works in PTMs prediction also develop for the prediction of multiple types of PTMs. One is CapsNet_PTM which uses a CapsNet for seven types of PTMs prediction [1]. More recently, MusiteDeep has been extended to incorporate the CapsNet with ensemble techniques for the prediction of more types of PTMs [2].

[1] Capsule network for protein post-translational modification site prediction (Wang et al., Bioinformatics)
[2] MusiteDeep: a deep-learning based webserver for protein post-translational modification site prediction and visualization (Wang et al., Nucleic Acids Research)

Thank you once again for your thorough, insightful, and constructive reviews! We truly appreciate these detailed discussions and believe they will help resolve any misunderstandings between us. If our responses have addressed your concerns fairly, we respectfully hope you might consider raising your score to support our work. If you have any other questions or concerns, we are very willing to engage in further discussion. Thank you for the time you have dedicated to our work, which will undoubtedly help us improve the quality of our manuscript.

2024-08-10

Thank you for clarifying and resolving our misunderstanding. However, please note that there is a significant difference between „natural“ PTMs such as phosphorylation which carry biological meaning and accidental modifications such as those that you list in table 1, Oxidiation of M or deamidation of N and Q are artifacts of suboptimal sample handling in mass spectrometry acquisition. Missing them means missing a peptide, I agree, but it does not mean the same as missing a natural ptm, such as missing a phosphorylation which means missing an important biological signal. There are many tools out there already for cleaning mass spectrometry artifacts and that is easier (since a very limited space) than actually finding ptms. To me, the story telling of this contribution is rather misleading since it appears to solve a holy grail in MS proteomics (finding true ptms), but the evidence provided appears geared towards technical artifact removal. Don’t get me wrong. This is also valuable, but it is not the same as finding natural ptms.

I have adjusted my score here, but still have doubts regarding this contribution.

评论- (Clarification on the contribution) Our model AdaNovo is also applicable to the identification of natural or biologically meaningful PTMs

2024-08-11

Another good point! Actually, our model AdaNovo is also applicable to the identification of biologically meaningful PTMs, depending on the training dataset used. If the dataset contains various "natural" or biologically meaningful PTMs (such as phosphorylation), AdaNovo can also be used to identify these important biological signals. We conducted some experiments to support our claims. Specifically, we utilized a dataset that encompasses 21 distinct PTMs, referred to as the 21PTMs dataset, as detailed in [1]. For illustrative purposes, we selected data for 5 "natural" or biologically meaningful PTMs—tyrosine phosphorylation, tyrosine nitration, proline hydroxylation, lysine methylation, and arginine methylation—to train and test Casanova and AdaNovo (train:test = 9:1). As shown in Table Re 6, AdaNovo outperforms Casanovo in identifying these ("natural" or biologically meaningful) PTMs, which further lead to AdaNovo’s superiority in overall de novo sequencing performance. We will emphasize this good point in the final version.

Table Re6: The models' performance on the datasets with natural PTMs.

Models	PTM-level precision	Amino acid-level precision	Peptide-level precision
Casanovo	0.413	0.604	0.336
AdaNovo	0.518	0.620	0.377

[1] Proteometools: Systematic characterization of 21 post-translational protein modifications by liquid chromatography-tandem mass spectrometry (LC-MS/MS) using synthetic peptides. (Zolg, D. P., et al., Molecular & Cellular Proteomics)

We sincerely appreciate the points you have raised. They have helped us identify issues we had not previously considered, particularly regarding AdaNovo’s capability to identify natural or biologically meaningful PTMs. Sincerely hope that our response can address your concerns and further enhance your positive recommendation of our manuscript. If you have any additional questions or concerns, we would be more than happy to engage in a deeper and more thorough discussion. Thank you once again!

评论- Look forward to post-rebuttal feedback!

2024-08-12

Dear Reviewer DL85,

We have provided detailed responses to your reviews. Considering that the deadline for the discussion phase is approaching, we would like to know if our responses have adequately addressed your concerns. If you have any additional concerns or questions, we are more than happy to engage in further discussion. Thank you again for your time and effort in reviewing our manuscript. Your feedback has been instrumental in improving our research:)

Best,
Authors

评论- Have our responses addressed your concerns to your satisfaction?

2024-08-14

Dear Reviewer DL85,

We would like to express our sincere gratitude for dedicating your time to reviewing our paper. Your insightful comments on the difference between natural PTMs and accidental PTMs are particularly valuable, inspiring us to apply AdaNovo in the identification of natural or biologically meaningful PTMs.

We have thoroughly considered your feedback and carefully responded to each of your questions with extensive experiments. We would greatly appreciate your feedback on whether our responses have addressed your concerns to your satisfaction.

Once again, we sincerely thank you for your invaluable contribution to our paper. As the deadline is approaching, we eagerly await your post-rebuttal feedback.

Best regards,
Authors.

审稿意见

评分: 6置信度: 32024-07-09

The paper proposed a new method in protein sequencing for tandem mass spectra, especially in solving post-translational modifications.

Specifically, two (say conditional and unconditional) decoders are designed for protein sequences generation, from which conditional mutual information is calculated and further used to weigh the training samples and calculate losses.

优点

The proposed CMI weighting framework is interesting and seemingly fits the scenario.
Good empirical evidence for model effectiveness.
The method may be generalized to other scenarios with imbalanced data distribution.

缺点

The studied problem is interesting, while may not be of interest for the major audience in NeurIPS. It is also hard for readers who are not familiar with the domain to evaluate this work.
Casanovo appears to be the most important baseline in this paper. The authors would better articulate the relation between AdaNovo and Casanovo.

问题

Is "AdaNovo w/o decoder #2 and any reweighting" actually Casanovo? If not, please include this as a part of the ablation in Table 2-5.
Is 72% accuracy on AA level enough in real applications? Maybe some comparisons with database-searching or physics-based methods help answer this question.
Section 3.3.1 and 3.3.2 have confusing names. Consider changing them into amino-acid-level / peptide-level weighting.
Eq (5) and (7) seem somewhat arbitrary. Is there any evidence that the setup is optimal?

局限性

Limitations are discussed in Section 5 which is somewhat formalistic.

作者回复

2024-08-07

*Q1: Is "AdaNovo w/o decoder #2 and any reweighting" actually Casanovo? *

Thanks for your helpful reviews! Yes, "AdaNovo w/o decoder #2 and any reweighting" is Casanovo indeed. Therefore, we have performed the ablition study in the original version (Table 2-5).

Q2: Is 72% accuracy on AA level enough in real applications? Maybe some comparisons with database-searching methods help answer this question.

The accuracy of AdaNovo is sufficient for practical applications. In fact, the earliest deep learning model for de novo peptide sequencing, DeepNovo[1], has already been integrated into the commercial software PEAKS. Additionally, what we want to emphasize is that the database search method can only identify proteins already present in the database, whereas de novo peptide sequencing can identify new proteins that are out of the database. Therefore, database searching methods are not suited for our scenarios where the peptide sequences in test set do not exist in the database.

Q3: Section 3.3.1 and 3.3.2 have confusing titles. Consider changing them into amino-acid-level / peptide-level weighting.

Thanks for your valuable advice! We will mitigate this issue in the revised version.

Q4: The normalizations in Eq (5) and (7) seem somewhat arbitrary. Is there any evidence that the setup is optimal?

Thanks for your insightful reviews! The normalization in Eq (5) and (7) is carefully designed. Specifically, we employ the widely-used Z-Score Normalization to standardize variables by transforming them into a standard normal distribution with a mean of 0 and a standard deviation of 1. The cofficients $s_1$ and $s_2$ are designed to control the effects of amino acid-level and peptide-level adaptive training, respectively. Additionally, the inclusion of +1 in the formula typically ensures that the resulting weight is non-negative and that each PSM is fully utlized for training.

Q5: The authors would better articulate the relation between AdaNovo and Casanovo

Casanovo [2] initially employed Transformer encoder-decoder architecture to predict the peptide sequence for the observed spectra. In comparison to Casanovo, AdaNovo's innovation lies in its training strategy specifically tailored for spectrum-peptide matching data to mitigate the data biases (various noise and missing peaks in mass spectra and low-frequency PTMs), rather than the Transformer encoder-decoder architecture. Actually, the de novo peptide sequencing task we have explored, transitioning from mass spectra to amino acid sequences, shares similarities with fields like image captioning [5] (translating images into descriptive texts) and protein inverse folding [6] (deriving amino acid sequences from protein structures), where the encoder-decoder architecture is widely adopted. In these domains, innovation often stems from training strategies rather than the model architectures themselves.

Q6: The studied problem is interesting, while may not be of interest for the major audience in NeurIPS.

Many works [2,3,4] in computational mass spectrum including Casanovo [2] have been published in top-tier machine learning conference including NeurIPS and ICML recently because the tasks in this field can be easily formulated as machine learning problem.

Moreover, the initial phase of drug discovery involves pinpointing disease biomarker proteins or drug target proteins. De novo sequencing serves as a pivotal step in identifying these proteins within the realm of proteomics. Despite remarkable progress in AI for drug design under known targets in top-tier AI conference, the primary bottleneck in the drug discovery pipeline persists in uncovering these crucial protein targets. Our objective is to encourage researchers in the AI community focus more attention on this task, thereby advancing the development of AI-driven drug discovery and development (AIDD).

[1] De novo peptide sequencing by deep learning (Tran et al., PNAS 2017)
[2] De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model (Yilmaz et al., ICML 2022)
[3] Efficiently predicting high resolution mass spectra with graph neural networks (Murphy et al., ICML 2023)
[4] Prefix-tree decoding for predicting mass spectra from molecules (Goldman, Samuel, et al., NeurIPS 2023)
[5] From Show to Tell: A Survey on Deep Learning-Based Image Captioning (Stefanini et al., TPAMI 2023)
[6] ProteinInvBench: Benchmarking Protein Inverse Folding on Diverse Tasks, Models, and Metrics (Gao et al., NeurIPS 2023)

评论- Reply to authors' rebuttal

2024-08-08

Thank you for the rebuttal. As my original scoring is optimistic, I retain my scoring to this paper.

评论- Thanks for the efforts and time paid in our work!

2024-08-08

Dear Reviewer xGNa,

Thanks for your swift response! Your feedbeck has been invaluable in improving our research. Should you have any further questions or require additional clarification, we would be delighted to engage in further discussion to enhance your positive impression or confidence in our manuscript.

Once again, we sincerely appreciate your time and effort in reviewing our manuscript!

Best,
Authors

审稿意见

评分: 7置信度: 42024-07-14

In the field of proteomics, tandem mass spectrometry has been crucial for analyzing protein composition in biological tissues. However, existing methods struggle to identify amino acids with Post-Translational Modifications (PTMs) due to their lower frequency in training data compared to canonical amino acids, leading to suboptimal peptide sequencing performance. Additionally, noise and missing peaks in mass spectra reduce the reliability of training data (Peptide-Spectrum Matches, PSMs. To address these challenges, the authors introduce AdaNovo, a novel framework that uses conditional mutual information to mitigate these biases. AdaNovo outperforms existing methods on a widely-used benchmark, showing significant improvements in identifying PTMs.

优点

The motivation behind this work is strong, addressing the unique challenges of high noise levels in mass spectrometry data and the difficulty in identifying PTMs. The method is novel and simultaneously alleviates two key issues.
The experimental results are promising, particularly in accurately identifying PTMs. Moreover, the authors provide the code for reproducing these results.
The article is well-presented and easy to follow. For instance, the authors first introduce the process of protein identification based on mass spectrometry, which aids understanding for researchers in the AI field.

缺点

1.In the inference stage, the decoder predicts the highest-scoring amino acid for each peptide sequence position. However, the beam search has been verified to be more effective way to decode text sequence. Have the authors tried this decoding strategy in AdaNovo?

2.What is the meaning of FDR = 1% in line 176? How to calculate the FDR?

3.Minor issue: The space between the citation and main text is not consistent across the manuscript.

问题

See Weaknesses.

局限性

The authors have addressed the limitations and potential societal impact.

作者回复

2024-08-07

Q1: Have the authors tried beam search in AdaNovo?

Thanks for your helpful reviews! Following your valuable advice, we apply the beam search (beam size = 5) in AdaNovo and observe consistent improvements over greedy search in Table Re3 (peptide-level) and Table Re4 (amino acid-level). Thanks again for such helpful advice!

Table Re3: Comparison between greedy search and beam search in terms of peptide-level precision.

Models	Mouse	Human	Yeast
AdaNovo (greedy search)	0.493	0.373	0.612
AdaNovo (beam search)	0.523	0.395	0.637

Table Re4: Comparison between greedy search and beam search in terms of amino acid-level precision.

Models	Mouse	Human	Yeast
AdaNovo (greedy search)	0.667	0.618	0.825
AdaNovo (beam search)	0.690	0.643	0.836

Q2: What is the meaning of FDR = 1% in line 176? How to calculate the FDR?

FDR (False Discovery Rate) is a measure used to assess the reliability of peptide or protein identifications in database search. The typical method to calculate FDR involves using a decoy database. Here’s a simplified overview of the process: 1. Database Search: Peptide sequences are identified by matching observed mass spectra to theoretical mass spectra derived from a protein sequence database. 2. Decoy Database: A decoy database is created by reversing or shuffling the protein sequences in the original database. This decoy database contains sequences that mimic the characteristics of the real database but do not correspond to actual proteins. 3. Scoring: Peptide identifications from the real and decoy databases are scored based on parameters such as peptide mass accuracy, retention time, and fragmentation pattern. 4.FDR Calculation: The FDR is then calculated based on the number of accepted identifications from the decoy database compared to the accepted identifications from the real database. This ratio provides an estimate of the proportion of false identifications among all accepted identifications. The FDR calculation can be expressed as:

\text{FDR} = \frac{\text{Number of decoy identifications}}{\text{Number of target identifications}}

A common approach is to set a threshold for the accepted FDR, such as 1% or 5%, to control the rate of false identifications. This method allows researchers to estimate the proportion of false identifications among their identifications, providing a measure of the reliability of their results.

Q3: Minor issue: The space between the citation and main text is not consistent across the manuscript.

Thanks for your careful reviews! We will mitigate this issue in the revised version.

评论- Response

2024-08-08

Thanks. My issues has been addressed, so I raise the score to 7

评论- Thanks for your helpful and insightful feedback!

2024-08-08

Dear Reviewer #sJoV,

Thank you for your insightful and constructive review! Your feedback has been instrumental in enhancing our research, particularly the beam search method you suggested, which promises to enhance the overall performance of the AdaNovo model.

Best regards,
Authors

审稿意见

评分: 4置信度: 32024-07-15

This work introduces AdaNovo, a framework for improving peptide sequencing by addressing biases in training data. It calculates conditional mutual information (CMI) between mass spectra and amino acids/peptides, enhancing robustness against noise and improving PTM identification. Besides, the model consists of a Transformer-based architecture and employs adaptive training strategies. Extensive experiments show AdaNovo outperforms existing methods on a 9-species benchmark, with significant gains in PTM identification.

优点

The introduction of AdaNovo, which leverages conditional mutual information (CMI) to tackle biases in training data, is an effective approach demonstrated by the superior performance in experiments.
The paper is well-written and clearly structured.

缺点

The extra cost of memory compared to Casanovo is significant (~40%) as also indicated by the authors, which could limit the max length of predicted peptide sequences.
AdaNovo is built on the Transformer encoder-decoder architecture, which was initially employed by Casanovo to predict the peptide sequence for the observed spectra. Thus I feel the novelty of the proposed model is limited.

问题

As shown in Table 1, PointNovo performs on par with Casanovo, what are the costs of computing and storage of PointNovo? The authors only compare the costs of computing and storage of Casanovo and their AdaNovo in Table 5.
For the results in Table 4, why using focal loss would give worse results compared to cross-entropy loss? Besides, what is the detailed number of amino acids in each category in the dataset? It would be better to provide more statistical details of the dataset in the Appendix.

局限性

The authors have addressed the limitations of the work.

作者回复

2024-08-07

Q1: The extra cost of memory compared to Casanovo is significant, which could limit the max length of predicted peptide sequences.

Thanks for your insightful reciews! In mass spectrometry, proteins are enzymatically broken down into peptides for analysis, with peptide lengths typically ranging from 2 to 50 amino acids. This is considerably shorter than full protein sequences, which can contain hundreds or thousands of amino acid residues. Therefore, the memory constraints mentioned might not be a limiting factor for AdaNovo in predicting peptide sequences.

Moreover, we can reduce memory consumption by scaling down Decoder #2. In the rebuttal process, we scale down Decoder #2 to transformer with 3 layers and 6 layers. The experiments shown in Table Re1 indicate a marginal performance drop while significantly reducing memory consumption. Thank you very much for your comments! They have helped us identify issues we hadn't thought of before, significantly improving the efficiency of AdaNovo. We will update the experimental results in the revised version.

Table Re1: The performance of scaling down Decoder #2 from 9 layers to 3 layers on Clam bacteria dataset.

Models	#Params (M)	Peptide-level precision	Amino acid-level precision
Casanovo	47.35	0.347	0.617
AdaNovo with 3-leyer Decoder #2	53.67	0.389	0.642
AdaNovo with 6-leyer Decoder #2	59.99	0.392	0.648
AdaNovo with 9-leyer Decoder #2	66.31	0.397	0.656

Q2: What are the costs of computing and storage of PointNovo?

Thanks for your valuable advice! We compare the costs of computing and storage of PointNovo, Casanovo and AdaNovo in Table Re2. As can be observed, PointNovo contains less parameters than Casanovo and AdaNovo with similar training and inference speed. However, as shown in Table 1 of the paper, its performance is significantly inferior to Casanovo and AdaNovo.

Table Re2: Costs of Computing and Storage on the same device (Nvidia A100 GPU).

Models	#Params (M)	Training time (h)	Inference time (h)
PointNovo	4.78	57.49	7.56
Casanovo	47.35	56.52	7.14
AdaNovo	66.31	60.17	7.09

*Q3: What is the detailed number of amino acids in each category in the dataset? In Table 4, why using focal loss would give worse results than cross-entropy loss? *

Thanks for your insightful reviews! We provide the detailed number of amino acids in each category in Table Re2 (Due to the limited sapce, we only show the number of 6 canonical amino acids + 3 PTMs and we will show all the numbers in the revised version). Additionally, focal loss is designed to address moderate class imbalance. However, in our case, canonical amnio acids vastly outnumbers PTMs, focal loss may not be able to effectively handle the imbalance, leading to suboptimal performance. Also, a previous work [3] also report that focal loss is inferior to cross-entropy loss in de novo sequencing. Thanks once again for your valuable advice!

Table Re2: The detailed number of amino acids in each category.

Amnino acids	G	A	S	P	V	T	M(+15.99)	N(+. 98)	Q(+.98)
Rice bean	45578	41453	21550	34618	34586	26844	3604	1572	265
Honey bee	318184	330329	285195	269162	330057	252258	3567	10096	5460

Q4: AdaNovo is built on the Transformer encoder-decoder architecture. Thus I feel the novelty of the proposed model is limited.

AdaNovo's innovation lies in its training strategy (conditional mutual information-based adaptive training) specifically tailored for spectrum-peptide matching data to mitigate the data biases (various noise and missing peaks in mass spectra and low-frequency PTMs), rather than the Transformer encoder-decoder model architecture. The de novo peptide sequencing task we have explored, transitioning from mass spectra to amino acid sequences, shares similarities with fields like image captioning [1] (translating images into descriptive texts) and protein inverse folding [2] (deriving amino acid sequences from protein structures), where the encoder-decoder architecture is widely adopted. In these domains, innovation often stems from training strategies rather than the model architectures themselves [1,2].

[1] From Show to Tell: A Survey on Deep Learning-Based Image Captioning (Stefanini et al., TPAMI 2023)
[2] ProteinInvBench: Benchmarking Protein Inverse Folding on Diverse Tasks, Models, and Metrics (Gao et al., NeurIPS 2023)
[3] De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model (Yilmaz et al., ICML 2022)

评论- Has our response resolved your concerns?

2024-08-11

Dear Reviewer XCqV,

We appreciate your insightful and helpful reviews! Your points on memory consumption are particularly valuable, inspiring us to reduce the size of Peptide Decoder #2 to further enhance AdaNovo's efficiency. If our response has resolved your concerns and clarified any ambiguities, we respectfully hope you might consider raising the score. Should you have further questions or need additional clarification, we would be happy to discuss them. Thank you again for your time and effort in reviewing our manuscript. Your feedback has been instrumental in improving our research.

Best,
Authors

评论- Has our response adequately addressed your concerns?

2024-08-14

Dear Reviewer XCqV,

We appreciate your insightful and helpful reviews! Your points on memory consumption are particularly valuable, inspiring us to reduce the size of Peptide Decoder #2 to further enhance AdaNovo's efficiency. Considering that the deadline for the discussion phase is approaching, we would like to know if our responses have adequately addressed your concerns. If our response has resolved your concerns and clarified any ambiguities, we respectfully hope you might consider raising the score. Should you have further questions or need additional clarification, we would be happy to discuss them. We truly appreciate your time and effort, which have been crucial in refining our research.

Best regards,
Authors

审稿意见

评分: 7置信度: 12024-07-16

The paper presents AdaNovo, a novel framework designed to address the biases in training data used for de novo peptide sequencing in proteomics. The main contribution is the calculation of Conditional Mutual Information (CMI) between mass spectra and amino acids, enabling robust training that mitigates the negative impacts of Post-Translational Modifications (PTMs) frequency bias and spectral noise. Extensive experiments demonstrate that AdaNovo outperforms existing methods, especially in identifying amino acids with PTMs, thus enhancing peptide sequencing performance.

The math seems clearly described to me, and well motivated. The results in table 1 highlights the improved performance. In particular,

For this review, please note that I'm not familiar with bioinformatics, and in particular the field of predicting protein sequences from mass spec.

优点

The use of Conditional Mutual Information (CMI) for addressing training data biases in de novo peptide sequencing is novel. The experimental design is thorough, with extensive benchmarks across multiple species datasets, demonstrating significant improvements over state-of-the-art methods. The paper is well-organized, with clear sections on background, methodology, experiments, and conclusions. The results show substantial improvements in PTM identification, which is critical for proteomics research and applications in drug discovery and precision medicine.

缺点

This is a narrow field of proteomics, there are other problems that could be considered to truly showcase the power of the proposed methodology.

问题

For Figure 4, can you supply the dataset size and the performance? Or perhaps include performance in table 1. How does the amount of PTMs impact the performance vs other classifiers? Can you train models where you gradually remove the amount of PTMs and showcase the performance dropoff of your model vs. casanova?

局限性

I'm not familiar enough with the topic to understand if there are ethical concerns.

作者回复

2024-08-07

Q1: For Figure 4, can you supply the dataset size and the performance? How does the amount of PTMs impact the performance of Casanovo and AdaNovo?

Thanks for your helpful comments! We provide the dataset size and performance in Table Re1. We would add these results in Figure 4 of the revised version.

Additionally, following your valuable advice, we remove 50% and 100% peptides with PTMs and report the results in Table Re2, from which we can observe that the performance (peptide-level precision) of Casanovo and AdaNovo significantly improves after removing the peptides with PTMs in the test set. These results indicate that the amounts of PTMs have a significant impact on the performance of de novo sequencing models and AdaNovo is less sensitive to the amounts compared with Casanovo.

Table Re1: The absolute numbers for peptides with/without PTMs in 9 species datasets.

Species	Rice bean	Honeybee	Bacillus	Clam bacteria	Human	Mouse	M. mazei	Yeast	Tomato
The number of PSMs	37775	314571	291783	150611	130583	37021	164421	111312	290050
The Ratios of Peptide with PTMs	27.8%	20.5%	11.8%	31.1%	27.6%	29.0%	22.2%	22.3%	26.4%
PTMs precision of AdaNovo	69.2%	52.4%	56.4%	57.8%	48.1%	54.8%	54.6%	66.6%	59.7%

Table Re2: The peptide-level precision of Casanovo and AdaNovo before and after removing peptides with PTMs in test set.

Species	Bacillus	M. mazei	Clam bacteria
Casanovo (before removing peptides with PTMs)	0.513	0.474	0.347
Casanovo (after removing 50% peptides with PTMs)	0.536	0.491	0.356
Casanovo (after removing 100% peptides with PTMs)	0.572	0.520	0.385
AdaNovo (before removing peptides with PTMs)	0.561	0.523	0.397
AdaNovo (before removing 50% peptides with PTMs)	0.573	0.529	0.415
AdaNovo (after removing 100% peptides with PTMs)	0.598	0.545	0.423

Q2: This is a narrow field of proteomics, there are other problems that could be considered to truly showcase the power of the proposed methodology.

The initial phase of drug discovery involves pinpointing disease biomarker proteins or drug target proteins. De novo sequencing serves as a pivotal step in identifying these proteins within the realm of proteomics. Despite remarkable progress in AI for drug design under known targets, the primary bottleneck in the drug discovery pipeline persists in uncovering these crucial protein targets. Our objective is to encourage researchers in the AI community focus more attention this task, thereby advancing the development of AI-driven drug discovery and development (AIDD).

Moreover, many works [1,2,3] in computational mass spectra including Casanovo [2] have been published in top-tier machine learning conference including NeurIPS and ICML because the tasks in this field can be easily formulated as machine learning problem.

[1] Prefix-tree decoding for predicting mass spectra from molecules (Goldman, Samuel, et al., NeurIPS 2023)
[2] De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model (Yilmaz et al., ICML 2022)
[3] Efficiently predicting high resolution mass spectra with graph neural networks (Murphy et al., ICML 2023)

We greatly appreciate your insightful comments, as they will undoubtedly help us improve the quality of our article. If our response has successfully addressed your concerns and clarified any ambiguities, we respectfully hope that you consider raising the score/confidence. Should you have any further questions or require additional clarification, we would be delighted to engage in further discussion. Once again, we sincerely appreciate your time and effort in reviewing our manuscript. Your feedback has been invaluable in improving our research.

评论- Has our response resolved your concerns?

2024-08-11

Dear Reviewer xCSd,

We appreciate your insightful and helpful reviews! Your points on the impact of the number of PTMs on the performance of Casanovo and AdaNovo were particularly enlightening, and it’s an aspect we had not previously considered. We have conducted a detailed experimental analysis and discussion on this issue in our response above.

If our response has resolved your concerns and clarified any ambiguities, we respectfully hope you might consider raising the (confidence) score. Should you have further questions or need additional clarification, we would be happy to discuss them. Thank you again for your time and effort in reviewing our manuscript. Your feedback has been instrumental in improving our research!

Best regards,
Authors

评论- Look forward to post-rebuttal feedback

2024-08-12

Dear Reviewer xCSd,

Best,
Authors

评论- Response

2024-08-13

I would like to see a plot where you progressively remove 5-10%.

评论- We have progressively removed PTMs and show the plot below

2024-08-13

Dear Reviewer xCSd,

Thank you for your concise and insightful feedback! Based on your valuable advice, we have progressively removed peptides with PTMs in the test dataset, ranging from 0% to 100% (kindly note that the earlier range of 5% - 10% might not accurately reflect the trend). The updated results, available at this anonymous link (https://anonymous.4open.science/r/Remove_PTM/Figure_Re1.png), show that the peptide-level precision of both Casanovo and AdaNovo improves significantly as the proportion of removed peptides with PTMs is increased. This demonstrates that the presence of PTMs notably affects the performance of de novo sequencing models, with AdaNovo showing less sensitivity compared to Casanovo.

If our response addresses your concerns and clarifies any uncertainties, we would greatly appreciate your consideration in adjusting the score or confidence. Should you have any further questions or require additional information, please do not hesitate to reach out. Thank you once again for your time and constructive review, which has been invaluable in enhancing our research!

Best regards,
The Authors

审稿意见

评分: 3置信度: 32024-07-16

The paper introduces AdaNovo, a novel framework for de novo peptide sequencing that significantly improves the identification of post-translational modifications (PTMs) and enhances robustness against data biases in proteomics. AdaNovo utilizes Conditional Mutual Information (CMI) to reweight training losses based on the dependence of target amino acids on mass spectrum data, addressing common issues like noise and missing peaks in mass spectra. The framework shows superior performance on a 9-species benchmark, especially in identifying amino acids with PTMs, compared to previous methods. The paper also discusses the computational efficiency and scalability of AdaNovo, suggesting it as a potent tool for advancing proteomic research.

优点

AdaNovo significantly improves the identification of Post-Translational Modifications (PTMs), crucial for detailed proteomic analysis.
It effectively handles noise and biases in mass spectra data, ensuring reliable peptide sequencing results.

缺点

In eq.3 use MI (X , Z; Yj | Y<j ) but in model it is CMI(X , Z; Yj | Y<j ), and should not distinguish between italic rv and bold rv.
CMI(X , Z; Yj ) = MI (X , Z; Yj | Y<j )? ?Please provide a more detailed process.
There is a lack of verification whether the Transformer's predictions align with the model's hypothesized outputs，Additionally, although CMI can technically be calculated, the inputs used in your model do not conform well to standard data formats.
The paper does not provide comparisons with other methods that utilize strategies for handling repetitive long-tail data,

问题

see weakness

局限性

see weakness

作者回复

2024-08-07

Q1: In eq.3 use MI (X , Z; Yj | Y<j ) but in model it is CMI(X , Z; Yj | Y<j ), and should not distinguish between italic rv and bold rv.

Thanks for your careful review! Kindly note that the conditioned mutual information in model is formulated as CMI(X , Z; Yj) or MI (X , Z; Yj | Y<j ) rather than CMI(X , Z; Yj | Y<j ). In eq.3, we can rewrite CMI(X , Z; Yj) as MI (X , Z; Yj | Y<j ) because condition mutual information is defined as the mutual information between two random variables (X , Z) and Yj conditioned on a third Y<j. In other words, CMI is a kind of mutual information. Here, we use the well-known concept of mutual information to help readers understand the calculation process of conditional mutual information.

Additionally, the bold rv denotes specific values while the italic rv denotes the variables. It is necessary to distinguish between values (bold rv) and variables (italic rv) because the conditional mutual information is defined as a measure of the mutual dependence between two variables instead of specific values.

Q2: CMI(X , Z; Yj ) = MI (X , Z; Yj | Y<j )? Please provide a more detailed process.

Yes, CMI(X , Z; Yj ) = MI (X , Z; Yj | Y<j ). Here, CMI(X , Z; Yj) denotes the conditional mutual information between (X, Z) and Yj (conditioned on Y<j). This step can be derived using the definition of Conditional Mutual Information, which is the mutual information of two random variables (X , Z) and Yj given the value of a third Y<j. In other words, CMI is a kind of mutual information. Here, we use the well-known concept of mutual information to help readers understand the calculation process of conditional mutual information.

Q3: The paper does not provide comparisons with other methods that utilize strategies for handling repetitive long-tail data.

We have compared with such methods in section 4.6 of the original paper.

Q4: There is a lack of verification whether the Transformer's predictions align with the model's hypothesized outputs. Additionally, although CMI can technically be calculated, the inputs used in your model do not conform well to standard data formats.

We have elborated on how we feed the mass spectrum to the transformer in Appendix A. More specifically, we regard each mass spectrum peak (m_i, I_i) as a word/token in natural language processing and obtain its embedding by individually encoding its m/z value m_i and intensity value I_i before combining them through summation. The entire mass spectrum can be regarded as a sentence. The precursor can also be encoded using similar way. More details can be found in Appendix A.

Additionally, we feed mass spectrum ( $\mathbf{x}$ ), precuosor ( $\mathbf{z}$ ) and previous sequence ( $\mathbf{y_{<j}}$ ) to the MS Encoder and Peptide Decoder #1 to predict the next amino acid ( $y_j$ ). Therefore, the output of Peptide Decoder #1 is $p(y_{j}|\mathbf{x}, \mathbf{z}, \mathbf{y_{<j}})$ . Similarly, Peptide Decoder #2 takes previous sequence ( $\mathbf{y_{<j}}$ ) as input and outputs $p(y_{j}|\mathbf{y_{<j}})$ . And then, we can use $p(y_{j}|\mathbf{x}, \mathbf{z}, \mathbf{y_{<j}})$ and $p(y_{j}|\mathbf{y_{< j}})$ to calculate the conditional mutual information using Eq. (4).

We greatly appreciate your helpful comments, as they will undoubtedly help us improve the quality of our article. If our response has successfully addressed your concerns and clarified the ambiguities, we respectfully hope that you consider raising the score. Should you have any further questions or require additional clarification, we would be delighted to engage in further discussion. Once again, we sincerely appreciate your time and effort in reviewing our manuscript. Your feedback has been invaluable in improving our research.

评论- Clarification regarding the misunderstandings in terms of formula notation.

2024-08-08

Dear Reviewer XeYm,

Thanks for your prompt and insightful response! We agree that the misunderstanding between us lies in the notation of the conditional mutual information. Following your valuable advice, we would update $CMI(X, Z; Y_{j})$ as $I(X,Z; Y_{j}|Y_{<j})$ in Eq. (3) and Eq. (4) to avoid misunderstandings.

Furthermore, we would like to inquire if our response adequately addresses your concerns. Should you have any further questions or require additional clarification, we would be delighted to engage in further discussion. Once again, we sincerely appreciate your time and effort in reviewing our manuscript. Your feedback has been invaluable in improving our research!

Best,
Authors

2024-08-08

It is recommended that you write your formula according to the format provided in https://www.jmlr.org/papers/volume24/21-0482/21-0482.pdf.

Regarding the misunderstanding between us, you directly denote $CMI(X, Z; Y_j)$ as $I(X, Z, Yj | Y_{<j})$ . However, CMI involves the mutual information between two variables given a third variable or set of variables.This could potentially lead to a misinterpretation as $I(X, Z | Y_j)$ , particularly given that $X, Y_j, Z$ are the only variables under consideration.

In terms of formula notation, there remains significant room for enhancement. Therefore, the original score will be retained, with the expectation that the author will make further refinements.

评论- We have updated the formula notations following your valuable advice

2024-08-11

Dear Reviewer XeYm,

Thanks very much for your feedback on the formula notation! Your suggestions help standardize the mathematical notations in our manuscript and avoid potential misunderstandings. Following your valuable advice, we have updated the notations and uploaded the revised manuscript to the anonymous link: https://anonymous.4open.science/r/NeurIPS24_AdaNovo_Revision-7C65/AdaNovo_NeurIPS2024_Revision.pdf.

Specifically, we replace $CMI(X, Z; Y)$ and $MI(X, Z; Y)$ with $I(X, Z; Y|Y_{<j})$ and $I(X, Z; Y)$ , respectively. Also, $\mathcal{X}$ , $\mathcal{Y}$ , $\mathcal{Z}$ , $\mathcal{Y_{<j}}$ are replaced with $X, Y, Z, Y_{<j}$ .

These changes do not require substantial modifications to the original manuscript. Sincerely hope that they would resolve your concerns regarding the formula notations. Please let us know if our response has adequately addressed your concerns or if further clarification is needed. We appreciate your time and invaluable feedback in improving our research!

Best,
Authors

评论- Have our responses adequately addressed your concerns?

2024-08-12

Dear Reviewer XeYm,

Best,
Authors

评论- Have our responses addressed your concerns to your satisfaction?

2024-08-14

Dear Reviewer XeYm,

We would like to express our sincere gratitude for dedicating your time to reviewing our paper. Your suggestions help standardize the mathematical notations in our manuscript and avoid potential misunderstandings.

We have thoroughly considered your feedback and carefully responded to each of your questions. We would greatly appreciate your feedback on whether our responses have addressed your concerns to your satisfaction.

Once again, we sincerely thank you for your invaluable contribution to our paper. As the deadline is approaching, we eagerly await your post-rebuttal feedback.

Best regards,
Authors.

评论- General Response to All the Reviewers

2024-08-07

We thank all the reviewers for their insightful and constructive reviews of our manuscript. We are encouraged to hear that the reviewers find that our idea is novel or interesting ( $\frac{5}{6}$ Reviewers XeYm, xCSd, sJoV, xGNa, DL85); our motivation/research problem is good/interesting ( $\frac{5}{6}$ Reviewers xCSd, XeYm, sJoV, xGNa, DL85). Also, they think that our experimental results are comprehensive or promising ( $\frac{6}{6}$ Reviewers XeYm, xCSd, XCqV, sJoV, xGNa, DL85); Our work is well-presented ( $\frac{5}{6}$ Reviewers xCSd, XCqV, sJoV, xGNa, DL85).

We have carefully reviewed all the suggestions and provided detailed responses to each of the points. If there are any further questions or concerns, please don’t hesitate to let us know. We are actively engaged in the discussion and are committed to improving the quality of this work. Thanks once again for the valuable input!

最终决定Accept (poster)

2024-09-25

There were highly divert scores and opinions about this paper. On the one hand, there were negative comments regarding the unclear notation of conditional mutual information, the novelty compared to existing approaches like Casanovo and the high memory complexity. On the other hand, there were several positive comments about the convincing motivation, promising experimental results, and good presentation. In the end, to me as an AC, this is a typical "borderline" case, which good arguments both for and against this paper. However, in this case, I think that some points of criticism (particularly those regarding the problematic notation of CMI) could be addressed in a convincing way in the rebuttal, and that in the end, the positive aspects outweigh the negative ones. Therefore, I finally recommend acceptance of this paper.