5.5

/10

Rejected4 位审稿人

最低5最高6标准差0.5

4.0

置信度

正确性2.8

贡献度2.8

表达3.3

ICLR 2025

Retrieval Augmented Zero-Shot Enzyme Generation for Specified Substrate

Jiahe Du,Kaixiong Zhou,Xinyu Hong,Zhaozhuo Xu,Jinbo Xu,Xiao Huang

OpenReview PDF

提交: 2024-09-14更新: 2025-02-05

摘要

关键词

Protein DesignDiffusion model

评审与讨论

审稿意见

评分: 5置信度: 52024-10-31

This paper combines retrieval method and a discrete diffusion model to achieve zero-shot enzyme design. Specifically, first, the authors retrieve possible functional enzyme sequences according to the substrate similarity by the Tanimoto similarity. Then the authors apply a discrete diffusion model enhanced based on the MSA transformer, with the retrieved enzyme sequences being MSAs. The authors also curated an enzyme-substrate pair dataset to validate their model.

优点

The contributions are listed as follows:

The paper is well-written and easy to follow!
The idea of applying diffusion MSA transformer to design functional enzymes is interesting and doable! Modeling MSAs into diffusion models will definitely improve the design accuracy while keeping the design diversity and novelty at the same time.

缺点

The weaknesses are listed as follows:

The application scenario of the proposed method is a little unclear to me! If the target enzyme is a natural enzyme, we can always find the EC class in Brenda and the corresponding records in PDB and Uniprot. Then using a supervised method will definitely be better than zero-shot inference. If the target enzyme is a non-natural enzyme, I don't think the proposed retrieval method could find relative enzymes in the database just according to the substrate similarity cuz non-natural enzymes usually have their special properties.
Some experimental settings and results are not reasonable or unclear to me:

(1) For the baseline model ProGen2 and ProtGPT2, as they are unconditional models, how did them realize controllable enzyme design cuz they are just following autoregressive generation? If the authors just use random generation, then I don't think the comparison is fair. To achieve controllable design, why not use ProGen? There are several available models of ProGen2, which size is utilized? The pLDDT of ProGen2 is too low, why is that?

(2) The pLDDT of all models are pretty low. For AF2, pLDDT 80 indicates stable design, I know ESMFold usually achieves a little bit lower pLDDT. But in Figure 2(b), none of the cases achieve a pLDDT higher than 80, indicating the proposed model cannot design foldable enzymes.

(3) In Table 3, the novelty should be evaluated through BLASTp in Uniprot instead of just Swiss-Prot.

For the guidance training method, how to guarantee a scoring function can be used to score enzyme-substrate pairs that it never sees during training? It's hard for me to trust it.

问题

In data construction, the authors said "we select the least common reactant among all reactants in the database". What is the definition of "the least common reactant"?
What is the accuracy of the scoring function in guided training?
Can the author clearly state the inference process of all baseline models?
For the success rate, what is the definition of "successfully generated sequences"? To me, it should both satisfy function constraint and structural stability. It seems all the designed enzymes didn't pass structural selection.

2024-11-25

Dear Area Chair and Reviewers,

We sincerely thank you for your thoughtful and thorough evaluation of our paper. Below are our detailed responses to your comments:

`W1 - The application scenario is unclear. I don't think the proposed retrieval method could find relative enzymes in the database just according to the substrate similarity.` Our model aims to generate non-natural enzymes, and the retrieval works because the substrate reflects enzyme properties.

Thank you for the comment. Our model aims to generate non-natural enzymes, and the retrieval works because the substrate reflects enzyme properties.

Generate non-natural enzymes: The enzymes we aim to generate are indeed non-natural, which is why we designed the zero-shot inference approach.
Substrates reflect the desired properties from retrieved enzymes.
- Substrates reflect enzymes' properties: It is based on the observation that enzymes catalyzing highly similar substrates may also share some similarities [1].
- Same physics and chemistry rules for natural or non-natural enzymes: While non-natural enzymes may have unique properties, they still adhere to the same physical and chemical principles as natural enzymes. Therefore, their folding and interaction with substrates are governed by the same fundamental rules.
- Substrate-based retrieval keeps proteins' properties: The retrieval method searches for related enzymes based on the substrates’ similarity to the target substrate rather than similarity to the expected generated enzyme. This approach ensures that the differences between natural and non-natural enzymes do not adversely affect the retrieval results, while the generated enzyme will incorporate beneficial features from the retrieved ones.

Consequently, relevant enzymes can still be identified to assist in generating non-natural enzymes.

[1] Samuel Goldman, Ria Das, Kevin K. Yang, and Connor W. Coley. Machine learning modeling of family wide enzyme-substrate specificity screens. PLOS Computational Biology, 18(2):1–20, 02 2022.ß

`W2(1) - Unconditional generation is not fair. Why not use ProGen? Which size of ProGen2 is utilized? Why the pLDDT of ProGen2 is too low?`Unconditional generation baselines are for demonstration; ProGen cannot take substrate condition; ProGen2 is 6.4B; Low pLDDT because of lack of fine-tuning.

Thank you for the comment. We will discuss them here.

Unconditional generation baselines are for performance demonstration only: Both ProGen2 and ProtGPT2 are indeed unconditional models, and as such, they do not have explicit modules for controllable enzyme design. We include these models as baselines to demonstrate how unconditional generation performs in the context of enzyme design.
ProGen cannot take substrate condition: Regarding ProGen, it does allow for controllable generation, but this is achieved by inputting control tags (e.g., Immunoglobulin, Chorismate mutase, Glucosaminidase, Phage lysozyme), which represent prior knowledge about the protein being designed. However, these tags are specific to predefined proteins, meaning ProGen cannot generate enzymes for new substrates without such tags. Consequently, ProGen does not support zero-shot enzyme generation for novel substrates, which is a key goal of our approach.
ProGen2 in baseline is the 6.4B parameters version: The version of ProGen2 we used is the ProGen2-xlarge with 6.4B parameters.
ProGen2's low pLDDT is because of the lack of fine-tuning: The low pLDDT observed in ProGen2 is likely because it was not fine-tuned on any specific protein families, making its output more random compared to fine-tuned models that specialize in particular types of proteins.

2024-11-25

`W2(2) - In Figure 2(b), none of the cases achieve a pLDDT higher than 80, indicating the proposed model cannot design foldable enzymes.` Enzymes with low pLDDT can be foldable and functional.

Thank you for the comment. We will introduce why enzymes with low pLDDT can be foldable and functional.

Even natural proteins do not always have pLDDT over 80%: The pLDDT values in our study should be interpreted in the context of the ESMFold paper as well. The ESMFold paper [1] provides a distribution of pLDDT for natural protein sequences (Fig. 3B), showing that many natural proteins have pLDDT scores lower than 0.8. In that study, the authors define a “good confidence” pLDDT as greater than 0.5 and “high confidence” as greater than 0.7, suggesting that lower pLDDT proteins can still be functional. Therefore, a high pLDDT indicates that a protein is more likely to fold into a stable structure, while a lower pLDDT does not automatically imply that the protein is unfit to fold or perform its function.
Our model generates some enzymes with pLDDT over 80%: In Figure 2(b) of our paper, while the average pLDDT is below 0.8, we observe regions with pLDDT above 0.8, indicating that several of our designed cases achieve a higher pLDDT, reflecting a stable and foldable structure. The table below shows the distribution of pLDDT of proteins generated by the basic version of our model (the worst version in terms of pLDDT).

pLDDT 0.8 0.7 0.6 0.5
The portion of proteins over the pLDDT 0.23 0.42 0.59 0.72
Protein structure change lower pLDDT: It is also important to note that enzymes in solution are dynamic and undergo structural changes to interact with their substrates. These structural changes, particularly in the regions involved in substrate binding, can lower pLDDT values. However, this flexibility is essential for enzyme functionality. For example, proteins that consist primarily of stable secondary structures, like alpha helices, may struggle to perform enzymatic functions if they do not allow for the necessary conformational changes.
Tolerance for low pLDDT: Additionally, some regions of the enzyme may not be directly involved in catalysis and can tolerate intrinsically disordered regions (IDRs). For substrates that interact with such enzymes, the presence of IDRs far from the catalytic site may not negatively affect performance. Therefore, a lower pLDDT, particularly in non-functional regions, can still be acceptable.
Artificial proteins with low pLDDT may still be foldable: It has been reported that artificial protein with pLDDT of 50.22 can be produced in wet lab, as reported in Fig. 6 (G) of paper [2].

[1] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.

[2] Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex X. Lu, Nicolo Fusi, Ava P. Amini, and Kevin K. Yang. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, 2023.

`W2(3) - In Table 3, the novelty should be evaluated through BLASTp in Uniprot instead of just Swiss-Prot.` Resource limitation.

Thank you for the comment. Due to practical constraints, we used Swiss-Prot for the novelty evaluation, as the full UniProt database is too large to manage within our current storage capacity. The zipped package of UniProt data exceeds 1.3TB, and we do not have the resources to unzip and utilize the entire dataset. As a result, we were limited to using Swiss-Prot locally. However, we acknowledge the importance of a more comprehensive evaluation, and we plan to extend the experiment to include comparisons with UniProt in future work.

`W3 - For the guidance training method, how to guarantee a scoring function can be used to score enzyme-substrate pairs that it never sees during training?` The scoring function has generalization ability.

Thank you for the comment. The generalization ability of our guidance model has been demonstrated in [1], where it showed the capability to predict enzyme-substrate relationships for unseen pairs during training. This ability allows the model to evaluate enzyme-substrate interactions for new molecules and proteins, even those it has not encountered before. Therefore, the scoring function is designed to generalize well to novel enzyme-substrate pairs by leveraging the underlying biochemical principles learned during training.

[1] Kroll A, Ranjan S, Lercher MJ (2024) A multimodal Transformer Network for protein-small molecule interactions enhances predictions of kinase inhibition and enzyme-substrate relationships. PLOS Computational Biology 20(5): e1012100.

2024-11-25

`Q1 - What is the definition of "the least common reactant" in data construction?` The reactant molecules with the least frequency in all reactions.

Thank you for the comment. To define “the least common reactant,” we first collect all the reactants from the reactions in the database and then count the frequency of each reactant’s occurrence. For example, in a chemical reaction A + B → C + D, if A appears more frequently than B as a reactant, then B would be considered the “least common reactant” among A and B. We select this least common reactant, B, as the substrate for the enzyme targeting that reactant. This approach helps to focus on less common molecules, enhancing the specificity of the enzyme design while avoiding the bias of more frequently occurring reactants.

`Q2 - What is the accuracy of the scoring function in guided training?` 0.809.

Thank you for the comment. In testing, the accuracy of the discriminative guidance function is 0.809. Other metrics on the discriminator are shown in the table below.

Precision	Recall	F1score	Accuracy
0.290	0.363	0.322	0.809

It’s important to note that the testing dataset is imbalanced, with a ratio of positive to negative samples of 1:7. The F1 score indicates the effectiveness of the guidance function in distinguishing relevant enzyme-substrate pairs.

`Q3 - Clearly state the inference process of all baseline models.` As follows.

Thank you for the comment. We introduce the baselines as follows.

ProGen2 and ProtGPT2: These protein language models generate sequences using pre-trained weights, with a maximum sequence length of 1024.
ZymCTRL: This model uses pre-trained weights and takes the Enzyme Commission (EC) number as input, which serves as a prompt for the autoregressive model. It is important to note that the EC number corresponding to the test set substrate provides more information than just the substrate alone.
NOS: The NOS is trained as described in the original paper. It uses a discriminator to score the binding affinity between an antibody and an antigen (two protein sequences), and this score is used as a loss after the generative model. In the first step of inference, NOS updates the generator’s parameters using the current test set input for 10 iterations. After this, the inference and sampling follow the typical discrete diffusion model process. This is repeated iteratively until the full sequence is generated. In our modification, we replace the original discriminative model with an enzyme-substrate probability scoring model, which is also used in our approach. Additionally, we replace the original target protein sequence input with the target substrate molecule, while the rest of the process remains the same.
LigandMPNN: LigandMPNN is a reverse folding model designed to generate protein sequences from a given protein-ligand complex structure. To generate the compound structure, we start by randomly generating a protein sequence of length 1024, which is then used to predict its structure with ESMFold [1]. The target substrate structure is generated using RDKit, and we employ NeuralPLexer [2] to create the complex structure of the substrate docked with the protein. This complex structure is input into LigandMPNN to obtain the redesigned protein sequence.

[1] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic-level protein structure with alanguage model. Science, 379(6637):1123–1130, 2023.

[2] Zhuoran Qiao, Weili Nie, Arash Vahdat, Thomas F. Miller, and Animashree Anandkumar. State-specific protein–ligand complex structure prediction with a multiscale deep generative model. Nature Machine Intelligence, 6(2):195–208, February 2024.

2024-11-25

`Q4 - For the success rate, what is the definition of "successfully generated sequences"? To me, it should both satisfy function constraints and structural stability. It seems all the designed enzymes didn't pass structural selection.` A successfully generated sequence is the protein with a length limit, and our model-generated proteins' pLDDT is acceptable.

Thank you for the comment. We will introduce the definition and details of our generated enzymes' pLDDT.

Successfully generated protein: In our work, a “successfully generated sequence” is defined as a sequence of up to 1024 amino acids that does not encounter any program errors during the inference process. There are some occasions as unsuccessful:
- Program Error.
- No stop token.
- Protein with unknown amino acids.
Our model-generated enzymes are good at foldability.
- ESMFold regards protein with pLDDT over 0.5 as good: Regarding structural stability, we acknowledge that while a high pLDDT score typically correlates with stable protein folding, a low pLDDT does not necessarily imply that the protein is not foldable. The ESMFold paper [1] illustrates that many natural proteins have a pLDDT lower than 0.8, yet they can still fold into stable structures. The paper classifies pLDDT values above 0.5 as indicative of good confidence and values above 0.7 as high confidence. Therefore, while a higher pLDDT is more likely to indicate a stable structure, low pLDDT proteins may still fold successfully.
- Enzymes' structure change lowers pLDDT: It is important to consider that enzymes in solution are dynamic and often undergo conformational changes to interact with substrates. This structural flexibility may lower the pLDDT but is crucial for enzymatic function.
- Our model generates some enzymes with pLDDT over 80%: As noted in Figure 2(b) of our paper, although the average pLDDT is below 0.8, many designed sequences achieve pLDDT values above 0.8, indicating that a portion of the designs have structural confidence. The table below shows the distribution of pLDDT of proteins generated by the basic version of our model (the worst version in terms of pLDDT).
  
  pLDDT 0.8 0.7 0.6 0.5
  The portion of proteins over the pLDDT 0.23 0.42 0.59 0.72
- Candidate models with diverse pLDDT tendencies: In enzyme design practice, experts typically test enzyme candidates based on performance in various metrics, and they may prioritize different criteria based on the specific task at hand. Thus, we did not establish a rigid bar for structural selection. Instead, in the ablation study, we offer multiple models with different retrieved sequences and their performance on metrics such as kcat and foldability, allowing experts to select the most appropriate candidates based on their design goals.

[1] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic-level protein structure with alanguage model. Science, 379(6637):1123–1130, 2023.

2024-11-26

Thanks for the authors' response! After going through the rebuttals, I still have the following concerns:

W1: about goal of designing non-natural enzymes : If the author tends to find novel but natural enzymes, I can think about the proposed method is doable. If it's non-natural enzymes, it's hard for me to trust. Even if the author said the substrates should share some physical constraints, non-natural enzymes have very unique properties that can not be found in the current database. Usually people targeting at designing non-natural enzymes should carefully design the enzyme pocket with strictly designed physical and biochemical rules from domain experts, like luciferases [1]. I can't imagine these rules can be extracted from the current database. Besides, the reported results showed low pLDDT, further validating my concerns.

[1] De novo design of luciferases using deep learning.Nature. 2023.

W2: ProGen cannot take substrate condition: Even though ProGen cannot take substrate condition, you can describe the curresponding enzyme category and property to promote conditional generation.

W2 (2): Enzymes with low pLDDT can be foldable and functional. I agree some cases with low pLDDT can still fold stably, but these cases should be validated in wet lab like the author mentioned in [2]. However, in most cases, the general trend is higher pLDDT leads to candidates that have enzymatic activity with higher probability. If the author check the experimentally confirmed enzymes in PDB, you should find most enzymes with AF2 pLDDT higher than 80.

Overall, I think I still have significant concerns to this paper.

评论- `W2(2) - Experimentally confirmed enzymes have pLDDT higher than 80.` Our model aims to provide candidates.

2024-11-28

Thank you for the comment. We want to introduce the character of our model in the practice of enzyme design.

Steps of the full pipeline of enzyme design.
- Generated enzymes as candidates: In the practice of enzyme design, deep learning generated enzymes are used as a pool of protein sequence candidates for further filtering.
- Metrics to score proteins: In the next step, these proteins go over in silico (computational-based) filtering to obtain some scores of different properties, including pLDDT for foldability, together with task-specific metrics like kcat in our task.
- Wet lab to evaluate consistency of metrics: Then, the in vitro (wet lab) experiment will be conducted to evaluate if the metrics reflect real real-world performance. Some metrics proved to be irrelevant to the goal. If pLDDT is found to be not very dominant at this step, its importance will be considered lower in the future steps.
- Repeat and select candidates based on helpful metrics: Select candidates with higher weight on the important metrics. Repeat the wet lab and get the best protein in real-world performance as the final product.
The goal of the protein generation model.
- Provide candidates with different good metrics: The model should generate proteins with different strengths and leave the selection to other steps.
- Our model has different versions balancing kcat and pLDDT: The different versions provide model with different strengths, which enables further filtering.
Some facts about pLDDT.
- pLDDT is the folding models' confidence: pLDDT is given for each position of a protein sequence by a folding model (structure prediction model). pLDDT estimates the confidence in the predicted position of each atom in the protein structure. A pLDDT reported for the whole protein is the average of each position's pLDDT.
- pLDDT tends to be low for new sequence patterns: Since pLDDT is the folding models' confidence in the position prediction given by themselves, if the protein includes some part that is rare in natural proteins, the confidence will tend to be low.
- Generated proteins not targeting structure have a relatively lower pLDDT: Although aimed to generate natural-like enzymes, the generation target is to fit the substrate, not to generate a natural-like structure directly. There are probably some parts the folding model has not learned. Therefore, folding model being less confident in the sequence is common for generated ones.
- Even natural proteins do not always have pLDDT over 80%: The pLDDT values in our study should be interpreted in the context of the ESMFold paper as well. The ESMFold paper provides a distribution of pLDDT for natural protein sequences (Fig. 3B), showing that many natural proteins have pLDDT scores lower than 0.8. In that study, the authors define a “good confidence” pLDDT as greater than 0.5 and “high confidence” as greater than 0.7, suggesting that lower pLDDT proteins can still be functional. Therefore, a high pLDDT indicates that a protein is more likely to fold into a stable structure, while a lower pLDDT does not automatically imply that the protein is unfit to fold or perform its function.
- Protein structure change lower pLDDT: It is also important to note that enzymes in solution are dynamic and undergo structural changes to interact with their substrates. These structural changes, particularly in the regions involved in substrate binding, can lower pLDDT values. However, this flexibility is essential for enzyme functionality. For example, proteins that consist primarily of stable secondary structures, like alpha helices, may struggle to perform enzymatic functions if they do not allow for the necessary conformational changes.
- pLDDT is an average value, and IDRs will influence overall pLDDT: pLDDT is given for each position. A pLDDT reported for the whole protein is the average of each position's pLDDT. If some parts of the protein are intrinsically disordered regions (IDRs) and have a low pLDDT, the overall pLDDT will be lower, even though this part is not involved in performing the function. For example, some regions of the enzyme may not be directly involved in catalysis and can tolerate IDRs. For substrates that interact with such enzymes, the presence of IDRs far from the catalytic site may not negatively affect performance. Therefore, a lower pLDDT, particularly in non-functional regions, can still be acceptable.
Our model generates some enzymes with pLDDT over 80%: The table below shows the distribution of pLDDT of proteins generated by the basic version of our model (the worst version in terms of pLDDT). If some experts found pLDDT is dominating, which is not always the case in wet lab, he can still use a quarter of our model generated enzymes.

pLDDT	0.8	0.7	0.6	0.5
The portion of proteins over the pLDDT	0.23	0.42	0.59	0.72

2024-11-28

Thanks for the response. It seems the authors themselves are not quite clear about the goals. Initially they were saying the goal is to design non-natural enzymes, but now they are saying the goal is to design natural enzymes. I feel the paper is not well-prepared.

Additionally, I see the novelty score (identity ratio to the ground truth) in Table 6 are 12.544%, while only 23% generated enzymes have a pLDDT score higher than 0.8, 72% generated enzymes even have a pLDDT score of 0.5. It provides obvious evidence that the designed enzymes may not foldable, and the sequences just fall into some uncertain areas considering the natural protein distribution, thus leading to enzymes with high novelty score but low generation quality.

Overall, my concern to this paper still remains. I will keep my current score.

2024-11-29

Thank you for your comment. We would like to make a clarification.

`1 - Natural or non-natural.` Natural-like enzyme, but not recorded to appear in nature.

We will introduce the goal of enzyme design in general and then make further clarification. We did not say "generate natural enzymes". We said "generate natural-like enzymes" in the reply for new comments.

The goal for enzyme design: Generate new enzymes for specific desired functions, which none of the enzymes in nature can perform ideally.
- Natural-like: Enzymes in nature serve as examples and knowledge. From them, the new enzyme is designed. The example of [1] illustrates the process and relation of designed new enzymes and enzymes in nature, as "starting from the sequences of 2,000 naturally occurring NTF2s, we carried out Monte Carlo searches in sequence space" [1].
Our first response to W1
- Origin comment on weakness: The application scenario of the proposed method is a little unclear to me! If the target enzyme is a natural enzyme, we can always find the EC class in Brenda and the corresponding records in PDB and Uniprot. Then using a supervised method will definitely be better than zero-shot inference. If the target enzyme is a non-natural enzyme, I don't think the proposed retrieval method could find relative enzymes in the database just according to the substrate similarity cuz non-natural enzymes usually have their special properties.
- Reason to use "non-natural": We want to stress that the desired generated enzymes do not exist in nature.
- Context of the comments: "Natural" stands for "existing in nature and recorded by human", which can be concluded from "find the EC class in Brenda and the corresponding records in PDB and Uniprot". Non-natural" is the opposite.
Our second response to W1, which is the reply to the new comment.
- The new comment on weakness: If the author tends to find novel but natural enzymes, I can think about the proposed method is doable. If it's non-natural enzymes, it's hard for me to trust. Even if the author said the substrates should share some physical constraints, non-natural enzymes have very unique properties that can not be found in the current database.
- Reason to use "natural-like": We want to stress that the generated enzymes are not completely different from any existing enzyme.
- Context of the comments: "Non-natural" stands for "totally new enzymes with existing ones", which can be concluded from "non-natural enzymes have very unique properties that can not be found in the current database.".

[1] Yeh, A.HW., Norn, C., Kipnis, Y. et al. De novo design of luciferases using deep learning. Nature 614, 774–780 (2023).

2024-12-02

Thank you for your comment. We want to introduce the meaning of the pLDDT distribution table.

`2 - 72% of generated enzymes have a pLDDT score of 0.5.` 72% of them have a pLDDT OVER 0.5.

The table demonstrates "the portion of proteins over the pLDDT," which is the row header's content. Therefore, the table is to be interpreted as follows:

23% of the generated have a pLDDT over 0.8.
42% of the generated have a pLDDT over 0.7.
59% of the generated have a pLDDT over 0.6.
72% of the generated have a pLDDT over 0.5.

We demonstrate the same distribution in the table as follows:

pLDDT	[1, 0.8)	[0.8, 0.7)	[0.7, 0.6)	[0.6, 0.5)
The portion of proteins in the pLDDT range	0.23	0.19	0.17	0.13

2024-12-03

Thank you for your comment. We want to introduce evidence of the natural-likeness of generated enzymes.

`2 - The sequences just fall into some uncertain areas considering the natural protein distribution.` The novelty of generated enzymes has high resemblance to natural enzymes.

Being different from ground truth enzymes does not suggest "fall into some uncertain areas": Table 6 demonstrates that the generated enzymes are different from only the ground truth enzymes of the same substrate. It does not suggest that the generated enzymes have very low similarity with any existing proteins. Therefore, it is not proven that they "fall into some uncertain areas".
Generated enzymes exhibit the highest resemblance to natural enzymes in identity distribution among baselines.
- Wasserstein distance (WD) of identity distribution: For each substrate, the WD is calculated between the two distributions:
  - the identities (similarity/novelty) of ground truth enzymes compared to their most similar natural proteins (excluding exact matches), and
  - the identities (similarity/novelty) of generated enzymes compared to their most similar natural proteins.
  The results for each baseline are shown as follows. The table below is extracted from a column of Table 2. Each value is the average WD across all substrates.
  
  Model name ProtGPT2 ProGen2 ZymCTRL NOS LigandMPNN Ours
  WD to ground truth 31.5 26.7 23.0 36.5 33.6 20.8
- Highest similarity to ground truth identity distribution: As shown above, our model achieves the lowest WD, indicating that the enzymes generated by our model most closely resemble the ground truth enzymes in terms of identity distribution. It shows our model generated enzymes‘ resemblance to ground truth in terms of "the novelty to other natural proteins", reflecting natural-likeness.

Model name	ProtGPT2	ProGen2	ZymCTRL	NOS	LigandMPNN	Ours
WD to ground truth	31.5	26.7	23.0	36.5	33.6	20.8

2024-11-28

Thanks for your reply! We will address your concerns.

`W1 - About the goal of designing non-natural enzymes.` The goal is to find natural-like but novel enzymes.

Thank you for your insightful comment. We would like to clarify that the goal of our work is to generate natural-like enzymes—designed by humans yet adhering to physical and biochemical constraints present in naturally occurring enzymes.

Novel natural-like enzyme.
- Natural-like but novel: We would like to clarify that the goal of our work is to generate natural-like enzymes. These enzymes do not exist in current protein databases but share some similarities with the existing known natural enzymes.
- Not completely non-natural enzymes: Our model does not target the creation of completely non-natural proteins that are totally different from any known ones.
- New enzymes derived from the old: To be specific, the ideal case is to automatically find the protein as bases in the database, and apply slight modification to the base protein automatically. The base should already have valid pocket structures, and the modification is mainly on adapting the catalytic sites for the new target substrates.
- Patterns can be learned: Although it is hard for the model to learn the physical and biochemical rules behind enzymes, some sequence patterns can be learned. For example, to bind a specific small molecule, certain amino acids should be present because only their side chains form hydrogen bonds or π-Interactions with the target.
Why our model generates natural-like enzymes.
- Gathering from retrieved enzymes: Our model retrieves known enzymes as bases for the generation. These enzymes have some general properties of some families of enzymes, like pockets and catalytic sites for their own substrates. Since their substrates share similarities with the target substrate, some “recombination” of these retrieved enzymes mixes their pocket, bindings site, and catalytic sites and generates possible solutions for the target substrate.
- Similar to the human expert rational design practice: In rational design enzymes for new substrates, experts start from an existing protein, like Heme Protein, and further modify the base [1] [2]. Our models retrieve enzymes for candidate and reference, and then the generator derives the result from retrieved ones.

[1] Kalvet, I., Ortmayer, M., Zhao, J., Crawshaw, R., Ennist, N. M., Levy, C., ... & Baker, D. (2023). Design of heme enzymes with a tunable substrate binding pocket adjacent to an open metal coordination site. Journal of the American Chemical Society, 145(26), 14307-14315.

[2] Chaturvedi, S. S., Bím, D., Christov, C. Z., & Alexandrova, A. N. (2023). From random to rational: improving enzyme design through electric fields, second coordination sphere interactions, and conformational dynamics. Chemical Science, 14(40), 10997-11011.

`W2 - ProGen cannot take substrate condition.` ProGen does not fit the zero-shot setting.

Thank you for the comment. The reason for not including ProGen is the unfitness to zero-shot setting.

Zero-shot enzyme generation.
- Targeting substrates not recorded: Our task is to generate enzymes for the target substrate. This target substrate should never been seen by the model before testing.
- Reason for the setting: Some small molecules do not currently have good enzymes. We want to make the model being able to generate enzymes for those no-enzyme substrates.
Insights on molecule condition versus natural language condition.
- ProGen's tags input should be seen in training: The input tags for ProGen are natural language words. For example, if Progen is asked to generate enzymes for L-glutamate but it has never seen the word "glutamate", it cannot correctly embed the ground truth tags like "glutamate formimidoyltransferase" and "glutamate N-acetyltransferase."
- ProGen is good in natural language conditioned generation: The ability to take advantage of word tags makes ProGen generate proteins similar to the corresponding natural enzyme families.
- Molecule condition provides zero-shot ability: Takes L-glutamate as target substrate for example. If the SMILES representation [NH3+][C@@H](CCC([O-])=O)C([O-])=O is provided, the model can learn from other molecules with similar atoms and bonds to leverage the input.

We acknowledge the contribution of ProGen and introduced it in the related work, as we highlighted it in red font in the section Related Work in the appendix.

审稿意见

评分: 5置信度: 42024-11-01

The paper addresses the challenge of zero-shot enzyme generation tailored for specific target substrates. It introduces a substrate-specified enzyme generator that retrieves enzymes with structurally similar substrates to the target molecule and employs a discrete diffusion model to generate new enzyme sequences guided by a classifier. Experiments reveal that SENZ outperforms current enzyme generation approaches across metrics including catalytic activity and structural stability. Additionally, SENZ contributes a dataset for substrate-specified enzyme generation.

优点

The paper proposes a zero-shot enzyme design task using a retrieval-augmented method. Given that the retrieved enzymes cannot directly catalyze the target substrate, the retrieval results are used to guide the generation of novel enzyme that can catalyze the target substrate.
The model demonstrated higher performance in generating enzymes with high catalytic activity than the baseline model.
The model additionally trained a discriminator to guide the generator. Experimental results showed that with the guidance of the discriminator, the model can generate enzymes with higher catalytic activity.

缺点

The model's foundation lacks innovation, as it uses the existing method EvodiffMSA-OADM. The model only considers the sequence level without extending to the structural level.
The selection of the baselines should be improved, e.g., the protein language models for unconditional generation and sequence generation models are relatively outdated. Moreover, the details of the baseline are not explained in the paper.
In the catalytic activity evaluation, seven specific tasks were included, but the reasons for their selection were not provided. Furthermore, the results are inconsistent, in some cases, the unconditional generation results outperform the conditional generation results.
The model's foldability and catalytic performance of sequences should be further analyzed and balanced, as a decrease in foldability may hinder the successful design of proteins in the wet lab.
The paper does not include a reproducibility statement.

问题

Can you explain how the model generates enzyme sequences in extreme cases such as when there are no retrieval results, and how to ensure the quality of the generation?
During the model’s sampling phase, how to align the retrieval results to target generation? If all retrieval results are aligned with the first retrieval result, could this potentially introduce errors?
Could you explain the details of how you modified the NOS in the baselines?
For the other questions, please refer to the weaknesses section.

2024-11-25

`Q1 - How the model generates enzyme sequences in extreme cases such as when there are no retrieval results, and how to ensure the quality of the generation.` The molecule encoder can help, and discriminative models filter the generated enzymes.

Thank you for the comment. The molecule encoder can help when the retrieved results are not helpful, and discriminative models filter the generated enzymes for quality control.

Relying on molecule encoder.
- Fixed number of retrieval results: We ensure the model retrieves a fixed number of enzymes, maintaining consistency with the training setup. Different model variations can retrieve varying numbers of enzymes, but even if the retrieved results are unhelpful—or effectively equivalent to “no retrieval results”—our model leverages the molecule encoder.
- Molecule encoder: This encoder directly processes the target substrate, enabling the generative model to produce enzyme sequences even without meaningful retrieval results.
Quality control.
- kcat: We utilize a discriminative model to predict the catalytic capability of the generated proteins for the target substrate. Filtering can be done based on the value.
- Foldability: We predicted the enzyme structure with a folding model. Filtering can be done based on the structure and pLDDT.
- Docking check: We dock the generated enzymes and the target molecule. The docking score and docked structure provide a reference for filtering.

To ensure quality, these evaluation steps assess and refine the quality of the generation.

`Q2 - During the model’s sampling phase, how to align the retrieval results to target generation? If all retrieval results are aligned with the first retrieval result, could this potentially introduce errors.` The retrieved results are treated equally in the alignment algorithm.

Thank you for the comment. The retrieved results are not aligned to a single sequence. They are treated equally in the alignment algorithm.

Align by ClustalW algorithm.
- ClustalW: It begins by computing pairwise alignments to create a distance matrix, then builds a guide tree to iteratively align sequences or groups. It optimizes scoring with weighted gaps and substitution matrices to achieve biologically meaningful alignments.
- Not aligned to a single sequence: The retrieved results are treated equally within the alignment algorithm. ClustalW ensures that the order of sequences in the input does not affect the final alignment. This is achieved through the guide tree, which determines the alignment order based only on sequence similarity.
Generation process: To initialize the generation process, a fully masked sequence is added at the top of the MSA, serving as the starting state of the sequence to be generated. The MSA, including the retrieval results and the masked sequence, is then fed into the generative model. The generative model iteratively fills in some of the masked positions in the first sequence during each iteration.

Since the retrieval results are not aligned to one specific sequence, this approach avoids introducing errors from unreasonable alignment to a single sequence.

`Q3 - The details of how you modified the NOS in the baselines.` Only replace the original guidance function with the substrate-enzyme relation scorer in our model.

Thank you for the comment. We only replace the original guidance function with the substrate-enzyme relation scorer in our model.

Original NOS.
- Protein-protein binding affinity scorer: The original NOS is trained as described in its paper, using a discriminator to score the binding affinity between an antibody and an antigen (two protein sequences). This discriminator is appended after the generative model, and the score is used as a loss during training.
- Inference: During inference, NOS updates the generator's parameters with the current testing set input for ten iterations. Subsequently, inference and sampling are conducted as in a typical discrete diffusion model. This process is repeated iteratively until the full sequence is sampled.
Our modification.
- Substrate-enzyme relation scorer: We replaced the original discriminative model with the enzyme-substrate probability scoring model used in our approach. Additionally, the input for the generator was changed from the target protein sequence to the target substrate molecule.
- Keep others the same: All other steps, including training and inference processes, remained the same.

2024-11-25

`W3 - The reasons for 7 tasks selection were not provided. The unconditional generation results outperform the conditional generation results.` The 7 target substrates are important in metabolism, and massive pre-trained methods may outperform unsuitable conditioned methods.

Thank you for the comment. The seven tasks were chosen because they involve designing enzymes for seven well-known small molecules that play critical roles in metabolic pathways. We will also introduce why unconditional generation outperforms conditional generation.

Task selection.
- Relevance: These substrates are significant in metabolism and have practical importance in biochemical research and applications.
- Independence: They are not included in the training or validation sets of any of the models, ensuring the evaluation is unbiased and represents a true zero-shot scenario.
- Ground Truth: Known enzymes for these substrates exist, providing ground truth for comparison with the enzymes generated by the models.
Reasons for superiority in unconditional methods.
- Unconditional methods with diverse training sets: For unconditional generation baselines such as ProGen2 and ProtGPT2, we used the weights provided by their authors. The pre-training dataset of ProGen2 and ProtGPT2 is extensive and diverse, which may lead these models to generate proteins with functional properties suitable for catalysis, even without explicit guidance. These large-scale protein sequence datasets for pre-training include ground truth enzymes for some test substrates. They are likely to generate proteins well from all perspectives.
- Some conditional methods only generate protein binder: Some conditional methods, including LigandMPNN, generate protein binder, not the enzyme, for the target. Therefore, they perform badly in kcat.
- kcat variability: The catalytic activity evaluation (via kcat) is highly sensitive. A single or double point mutation in the catalytic site can lead to significant changes, sometimes resulting in a 5-fold improvement. This inherent variability in kcat measurements contributes to the observed inconsistencies.
- Multiple functional sites in one protein: The unconditional-generated proteins sometimes have multiple catalytic sites and pockets, and thus have high kcat. When the conditional-generated enzymes cannot leverage the target well, their generated protein will have no function.

Considering the massive pre-training in unconditional methods and the weakness of modeling substrate conditions for conditional baselines, the result is not unpredictable.

`W4 - The model's foldability and catalytic performance of sequences should be further analyzed and balanced.` We provide versions of the model, and wet lab practitioners can do the selection.

Thank you for the comment. We provide versions of the model in the ablation study, and wet lab practitioners can select the model.

Metrics should be evaluated in wet lab: In wet lab, practitioners prioritize multiple metrics depending on the specific application. The designed enzymes are tested in initial experiments, and the importance of each metric is then tailored to the task at hand. Therefore, we do not know the importance of the metrics before the wet lab, and it is hard to select the best version of the balanced model.
Candidate models as the solution: We provide a range of models trained with different numbers of retrieved sequences. These models allow domain experts to choose the most suitable designs for their experimental objectives. For example, in applications where catalytic activity is paramount, sequences with higher Kcat values may be prioritized. Conversely, for tasks requiring robust folding in physiological conditions, foldability may take precedence. Moreover, as shown in Figure 3(f), the structures of our designed enzymes demonstrate satisfactory docking positions, further supporting their functional viability.

`W5 - The paper does not include a reproducibility statement.` We append it after the conclusion section.

Thank you for the comment. We have added a Reproducibility Statement section between the Conclusion and References in the revised manuscript. The content of the section is as follows:

We have described all necessary details to ensure reproducibility, including dataset information, model architectures, hyperparameters, and evaluation protocols. The code is available at https://anonymous.4open.science/r/SENZ-2BE1/.

We also highlighted it in the manuscript.

2024-11-25

`W2 - The protein language models for the unconditional generation and sequence generation models are relatively outdated. Moreover, the details of the baseline are not explained in the paper.` The baselines are selected based on their importance and relevance, and we appended the details of baselines.

Thank you for the comment. We will introduce the selection of baselines and their details.

Select mainstream approaches as baselines.
- Unconditional generation models: The protein language models, including ProGen2 (Nov. 2023 on Cell systems) and ProtGPT2 (July 2022 on Nature Communications), are unconditional protein generation models with high impact. The protein language models are used as baselines to evaluate unconditional generation performance.
- Conditional generation models: ZymCTRL is selected because it is designed for enzyme generation, although it needs the Enzyme Commission (EC) number.
Details.
- ProGen2 and ProtGPT2: We utilized the pre-trained weights for both models to generate sequences with a maximum length of 1024. These models serve as benchmarks for the capability of protein language models to generate sequences without specific functional guidance.
- ZymCTRL: This model employs pre-trained weights and uses the Enzyme Commission (EC) number as a prompt for the autoregressive generation process. It is worth noting that the EC number provides more detailed information about enzymatic function compared to the substrate alone, offering this baseline an advantage in generating enzyme sequences for the given tasks.
- NOS: We trained NOS following the methodology of its original paper. The original NOS framework uses a discriminator to score the binding affinity between an antibody and an antigen (two protein sequences). We replaced the original discriminator with our enzyme-substrate probability scoring model in our adaptation. Furthermore, we replaced the target protein sequence input with the target substrate molecule. During inference, the NOS generator is updated iteratively for 10 steps using the test set input before sampling, following a discrete diffusion model for sequence generation, as described in the original paper. These adjustments allow NOS to generate enzymes in our setting while preserving its original generative framework.
- LigandMPNN: This reverse folding model generates a protein sequence based on a protein-ligand complex structure. To adapt it for our task, we randomly generated protein sequences (length: 1024) and predicted their structures using ESMFold [1]. Using RDKit, we generated the structure of the target substrate, and NeuralPLexer [2] was employed to dock the substrate with the predicted protein structure, creating a complex structure. The resulting complex was then input into LigandMPNN for sequence redesign.

[2] Zhuoran Qiao, Weili Nie, Arash Vahdat, Thomas F. Miller, and Animashree Anandkumar. State-specific protein-ligand complex structure prediction with a multiscale deep generative model. Nature Machine Intelligence, 6(2):195–208, February 2024.

2024-11-25

Dear Area Chair and Reviewers,

We sincerely thank you for your thoughtful and thorough evaluation of our paper. Below are our detailed responses to your comments:

`W1 - The model's foundation lacks innovation, as it uses the existing method EvodiffMSA-OADM. The model only considers the sequence level without extending to the structural level.` The novelty is to generate enzymes from the substrate, and our model considers structure by MSA.

Thank you for the comment. We will introduce the novelty of our model and its consideration of structure.

Innovation.
- EvoDiff is protein sequence conditioned only: EvodiffMSA is indeed capable of generating new sequences, but it either does so without specific conditions or with only partial sequence information as input. Evodiff per se cannot generate substrate-specific enzymes because it does not take substrate as input.
- Our model is molecule-conditioned: Our approach enables the direct generation of enzymes from the target substrate in a zero-shot setting, which is a significant difference. We have introduced a retrieval module, molecule encoder, and guidance mechanism to achieve it.
Consideration of structure.
- Structure information is in MSA: As ESMFold [1] showed, MSA can provide enough information for protein structure prediction. In our model, we collect the MSA, from which the generator extracts the structure information. Moreover, since protein structures can be predicted from their sequences using folding models, it has been demonstrated that a protein’s sequence inherently carries information about its structure.
- Results have valid structure: As evidence of our model’s effectiveness, the generated enzyme shows favorable docking positions, as shown in Figure 3(f), which indicates a valid structure.

2024-11-28

Thank you for the comment. We want to introduce the novelty of our paper further from the perspective of task setting.

Zero-shot enzyme generation.
- Conditioned on substrate: Our model is designed to generate an enzyme given only a small molecule, without any natural language description or specified tags for the desired protein family.
- Target substrate is new: This target substrate should never been seen by the model before testing.
- Reason for the zero-shot setting: Some small molecules do not currently have good enzymes. We want to make the model able to generate enzymes for those substrates without enzymes.
The reason why our model can achieve zero-shot enzyme generation.
- Molecule input provides more learnable information: Compared with the label-conditioned generation, molecule condition provides more sharable information. Takes L-glutamate as a target substrate, for example. Its SMILES representation [NH3+][C@@H](CCC([O-])=O)C([O-])=O contains the atom and bond data, enabling the model to learn from other molecules with similar atoms and bonds to leverage the input. Label-conditioned models heavily rely on the information learned from each label because labels have less inner relation with each other.
- Retrieved proteins provide references for generation: Although the input is only a target substrate, we can further leverage the known enzymes more than in training. The retrieved enzymes help to represent the feature of the target molecule indirectly because they are enzymes of the substrates that share similarities to the target. The retrieved enzymes and the desired output are all proteins, making the generation as easy as extracting from references.

2024-11-30

Thanks for the comprehensive response. I have also reviewed the questions and rebuttals from other reviewers. Overall, I now have a better understanding of the model's implementation. But I still have concerns regarding the validity of the evaluation metrics and the quality of the generated outputs. The Kcat predicted by UniKP is a relatively easy-to-trick metric. Did the authors consider using metrics related to enzyme and substrate binding for evaluation? I would like to maintain my score given the current quality of the paper.

2024-11-30

Thank you for your comment. We would like to introduce our evaluation of the binding in the case study.

Steps to evaluate binding between enzyme and substrate

All the baselines generate enzyme sequences for methylphosphonate(1-).
The structure of generated enzyme sequences is predicted by ESMFold [1].
The enzymes' structure is docked with the substrate by NeuralPLexer [2].
AutoDock-Vina optimizes the docking position, and a Vina score is provided.
Lower (more negative) Vina scores suggest stronger predicted binding between the enzyme and substrate. The scores are listed as follows.

Model name ProtGPT2 ProGen2 ZymCTRL NOS LigandMPNN Ours
Vina score -2.607 -2.445 -2.726 -2.574 -2.133 -3.075

Model name	ProtGPT2	ProGen2	ZymCTRL	NOS	LigandMPNN	Ours
Vina score	-2.607	-2.445	-2.726	-2.574	-2.133	-3.075

Figure 3 in our manuscript also demonstrates the optimized structure of binding.

审稿意见

评分: 6置信度: 32024-11-02

This paper proposes a zero-shot enzyme generation method. This approach includes two parts. First, it retrieves the enzymes according to the substrate similarity. Second, it generates novel enzymes by a classifier-guided conditional generative model, given the substrate and retrieved proteins as input.

优点

Overall, this is a sound paper. The motivation is clear. The method is reasonable. Empirical results show the improvement of their method over several baselines in the zero-short settings.

缺点

(1) evaluation metrics

As shown in Table 2, the scores of generated enzymes are much higher than the ground truth in some cases. For example, the log10(k_cat) for ground truth is only 0.247 for Sepiapterin, while the generated enzyme could achieve 0.705. This significant difference raises concerns about the reliability of the evaluation metrics.

(2) confusing annotations about probability and variables.

For example, in P5, line 268, the notation suggests that a probability $P$ is set equal to a variable $x$ , which is conceptually unclear. Similarly, in Equations (11) and (14), there appears to be a conflation between the variable $x$ and probability expressions.

问题

(1) As mentioned in the "weakness" part, could you explain why your score is much higher than the ground truth? Additionally, is there a way to directly compare the generated enzymes to the ground truth?

(2) In the 4.4 section, do you re-train the model with 0 #Retrived or just apply this in the generation stage? This might be unfair because of the mismatch between training and testing. I have a similar question regarding the "with and without guidance" ablation study. Is this ablation also applied in training?

2024-11-25

Dear Area Chair and Reviewers,

We sincerely thank you for your thoughtful and thorough evaluation of our paper. Below are our detailed responses to your comments:

`W1 - The scores of generated enzymes are much higher than the ground truth.` It is common for artificial enzymes to outperform natural counterparts in kcat.

Thank you for the comment. We will explain the high variance in kcat and discuss why generated enzymes outperform ground truth, which is natural enzymes.

The high variance in kcat is not unexpected.
- Sensitive: kcat measurements are known to be sensitive, and it is common to observe a significant (5-fold) improvement with just 1-2 point mutations.
- Factors influencing kcat: If the catalytic site of an enzyme is replaced with amino acids that do not support catalysis, the enzyme will lose its catalytic ability, resulting in a much lower kcat value.
Ground truth enzymes are not necessarily the best.
- Nature enzymes get the function by evolution: Natural enzymes have evolved their current function and sequence through evolutionary processes that filter out non-functional variants. There is always room for designing more efficient alternatives.
- Artificial enzymes may outperform their natural counterparts: Enzyme engineering methods have produced better enzymes [1]. Frances H. Arnold was awarded the Nobel Prize for Chemistry 2018 “for the directed evolution of enzymes." Those enzyme optimization experiments are the evidence that enzymes are optimizable.

Therefore, the high variance in Kcat results is not unexpected, and natural proteins are not necessarily the most efficient enzymes for their respective substrates.

[1] Karl-Erich Jaeger, Thorsten Eggert. Enantioselective biocatalysis optimized by directed evolution. Current Opinion in Biotechnology, Volume 15, Issue 4, 2004, Pages 305-313.

`W2 - Confusing annotations about probability p and variable x.` x is the extreme case of p.

Thank you for the comment. We will explain why we sometimes deduce p to x.

x: Variable x is a 2D matrix where each row represents a position in the protein sequence, and each column is a one-hot representation of the amino acids, as we highlighted in red on Page 3, lines 115-118.
p: The probability matrix p has the same shape as x, but instead of one-hot encoding, each column in p represents the probability distribution of selecting each amino acid at a given position.
Relation between x and p: x is a deterministic matrix with 1 at the specific amino acid for each position, while p represents a probabilistic distribution. When p corresponds to a known distribution from the training set, we deduce x from p as the extreme case, where each position in x contains either 0 or 1.

To summarize, x and p have the same shape, with x representing the deterministic case where each position is restricted to a single amino acid choice (either 0 or 1), while p encodes the probabilistic distribution over amino acids.

`Q1 - Why is the score much higher than the ground truth? A way to directly compare the generated enzymes to the ground truth.` Compare sequence similarity.

Thank you for the comment. The reason why the generated protein is better than the ground truth can be found in response to W1, and we will introduce how to compare the generated protein to the ground truth directly.

The reason generated is better than the ground truth.
- Response to W1: Please refer to the explanation in response to W1.
Directly compare the generated protein to the ground truth.
- Comparing proteins themselves is unnecessary: Regarding direct comparisons with the ground truth, we aim to generate enzymes with catalytic activity, which does not necessarily imply high sequence similarity to the ground truth enzyme.
- Sequence identity for direct comparison: To directly compare the generated enzymes to the ground truth, we align the two protein sequences and calculate the identity, which measures the ratio of positions with identical amino acids. For targeting methylphosphonate(1-), the sequence identities between the ground truth enzyme and those generated by models are shown in the following table in percentage.

ProtGPT2	ProGen2	ZymCTRL	NOS	LigandMPNN	SENZ
14.14	15.27	11.47	19.57	15.10	20.17

Typically, a sequence identity below 30% indicates significant differences between two proteins.

`Q2 - Re-train the model or just change the generation stage in the ablation study?` Re-train the model.

Thank you for the question. Yes, the model has been re-trained with each setting in the ablation study. Additionally, we have trained the model both with and without guidance to assess its impact on performance.

2024-11-25

Thanks for your response. I will maintain my score.

评论- Please provide a motivation.

2024-11-26

Dear Reviewer JbBV,

Thank you for your timely response to the rebuttal. Can you please provide a motivation for maintaining your score after seeing the rebuttal and the other reviews. It is absolutely fine to maintain your score, but either way it is necessary to explain if and how the authors' response and the other reviews influenced your stance.

Many thanks and kind regards,

2024-11-28

Hi AC,

Thank you for the reminder.

While kcat does not seem to be a strong metric in this context, the comparison of sequence similarity has addressed some of my concerns. Considering the paper’s reasonable contribution to zero-shot enzyme generation, I believe it justifies maintaining my current score.

审稿意见

评分: 6置信度: 42024-11-04

This paper introduces SENZ, a retrieval-augmented method for zero-shot enzyme design conditioned on a specific substrate. Overall, the paper is very well-written, impactful, and an interesting contribution to the field of de novo enzyme design.

优点

Very good review of prior work on protein design; given the page limit it is hard to be extremely thorough but I found it a good selection and well-structured.
It is a nice contribution that the authors provide a substrate-enzyme relationship dataset as part of the work.
Good details provided in the paper for the different methods used, from the dataset curation to the model training and evaluation.
I liked the comparison to a diverse set of methods as it helps to understand the respective strengths SENZ.
The paper is well-written and well-structured as a whole.

缺点

I think just minimizing the docking score or optimizing k_cat for a given substrate and designed protein are not by themselves good enough metrics to assess the designs. In lieu of experimental validation, I would suggest rediscovery tasks to see if SENZ can design a known enzyme that is not present in the training set (i.e., still treating it as a zero-shot generation problem, but having a known solution, for validation).
It was not fully clear to me how SENZ sets itself apart from existing work in enzyme design. This could be made more clear. For reference, I work in moleular design, but I am not an expert in enzyme design and am not as familiar with the field, so some things may be obvious to you but they did not come across to me from reading the paper necessarily, and the respective advantages/limitations of SENZ could be made more explicit.
No code made available, as far as I can tell. Without this, I cannot make the score higher than 5, so please let me know if I have simply missed it and I can update my score to a 6 (there are ways to make code available anonymously).
Recommend to proofread the paper more carefully; minor grammatical mistakes here and there but nothing too distracting nor that made it impossible to understand the meaning, so I did not take off points for this. However, I wanted to point it out as in most cases they were mistakes that any grammar checker could catch with little time investment on behalf of the authors.

问题

For the same substrate, how different are the proteins generated via the different benchmarked methods? Did the authors analyze this? This would be really interesting to see, both in terms of sequence and structural similarity.
How do the authors handle potential post-translational modifications which could lead to improved/decreased enzyme activity relative to the reference enzyme?
It seems to me that there is a lot of information that the authors are currently not using in generating potential new proteins via SENZ (e.g., just the substrate, but not information on the generated products, any intermediates/transition states, 2D structures, etc, where known). I would imagine in the zero-shot setting it is quite important to include as much information as possible; how are the authors planning to incorporate this in future work to improve data efficiency?
What are the failure modes of SENZ?

伦理问题详情

n/a

2024-11-25

Thank you to the authors for revising the manuscript and providing a link to the code base.

While it is great to provide the code as-is, and certainly better than nothing, I found it very bare-bones and lacking in documentation. There is not even a filled in README. That makes it very difficult to reproduce the results, even if the src/ is available. I recommend the authors to add documentation, including instructions, so that potential users could actually benefit from having the code and perhaps even re-run some of the models trained in the paper.

I also see that there are significant concerns from myself and other reviewers that have only been addressed in the comments here, but not in the main text. I think it is very valuable to consider that if things are not clear to us, perhaps they could be clarified further in the paper, and I would recommend to make changes in the main text to include some of the relevant additional information you have given us below.

I have not updated my score.

评论- anonymous code sharing by authors is not a mandatory requirement for acceptance

2024-11-27

Dear reviewer rqNp,

Thank you for engaging in discussions with the authors. Your effort is much appreciated.

I'd like to clarify that sharing an anonymized version of the code is not a mandatory requirement for acceptance at ICLR. Therefore, the absence of shared code by the authors should not be used as a reason to recommend rejection, nor should incomplete documentation of shared code be a reason for rejection. The only relevant guidelines that I am aware of are the author guidelines on reproducibility statements. These guidelines state that a reproducibility statement is strongly encouraged, but optional, and can potentially include an anonymized link to code. Therefore, please try to judge reproducibility aspects as much as possible based on the content and information provided in the paper, and where appropriate see if the shared code can help alleviate any specific concerns that you have based on the paper.

Many thanks,

2024-11-28

Thanks AC for clarifying it is not an official requirement. I can understand that for some papers the code may not be necessary for reproducibility, but for this particular work I do believe well-documented code is necessary to ensure its reproducibility and hence my insisting.

2024-11-28

Dear reviewer rqNP,

Thank you for your timely response. Can you please elaborate why you think that for this particular paper well-documented code is absolutely necessary? Which particular aspects of the paper make you question the reproducibility?

2024-11-28

Sure!

The reason I question the reproducibility is because I do not feel that I could reproduce the results of the paper with the information presented in the paper, and I work in this field.

It starts even with the dataset. The first contribution mentioned in the paper is

We formally define the task of substrate-specified enzyme generation and present a curated dataset. This dataset consists of the substrate-enzyme pairs that are extracted from the known enzymes. We further partition it into training and test subsets without overlap in terms of proteins and small molecules to secure the zero-shot setting.

However, I cannot find any dataset anywhere in the paper, nor enough instructions for how to re-create it. In the code, there are functions called dataset.py and dataset_3D.py but they all point to files with data which is not present in the repository (I would also like to point out that the repo is not anonimized very well, for future reference to the authors, as all the path names include usernames and the like).

Preprocessing and standardization details are also lacking. Then, the rest of the code is not something which is easy to run, there are many paths and commented out lines in the code, and it is not easily readable nor possible to know what one is doing without significant effort. There is no README, and little to no documentation for most functions. IMO it is not of sufficient quality for publication yet, even though the idea and preliminary results are nice. Both the paper, experiments, and code could use refinement.

Please let me know if I am missing something. I would happily change my score if I have missed some critical points that make the work suddenly reproducible. It is not entirely impossible that I missed something, but I've spent some time staring at this paper and the code and I simply feel it is vague and even leaves out critical reproducibility details.

2024-11-28

Thank you for your suggestion. We would like to introduce the changes we made to the code repo and the manuscript.

`W1 - More Documentation.` We write a new README with instructions in the Anonymous repository.

Thank you for the comment. We have updated the readme with instructions. More code for data construction is appended. By directly running the codes, we can reproduce the results reported in the paper. If there are any specific concerns or unclear parts, we are happy to clarify.

`W2 - Make changes in the paper.` We highlighted the changes in red font.

Thank you for the comment. We have included the changes in the paper and highlighted them in red font. Changes are mainly in the Experiment section, where results in this forum are appended.

2024-11-28

Thank you for the updates to the code. It is a big improvement from the previous version.

I have increased my score, though I recommend to still do some final polishimg of the repo and accompanying documentation (there are a lot of typos, which can reduce the trust potential users may have in the code).

2024-11-28

Thank you for your time and interest. We are currently revising the code repository. The documentation is not yet finalized, and we are working to make it easier to use and read. Thank you.

2024-11-25

`Q1 - For the same substrate, how different are the proteins generated via the different benchmarked methods?` We compared them in terms of sequence and structural similarity.

Thank you for the comment. In Section 4.6, CASE STUDY TARGETING METHYLPHOSPHONATE(1-), proteins are generated by different models, all targeting the same substrate, methylphosphonate(1-). We will compare their sequence and structure similarity.

Sequence similarity.
- Metric: To directly compare the proteins generated by the baseline methods, we align their sequences and calculate the identity, which is the ratio of positions with identical amino acids.
- Result: Generally, an identity lower than 30% indicates no significant similarity in sequence. The sequence identities between the proteins targeting methylphosphonate(1-) are shown in the table below.
ProtGPT2 ProGen2 ZymCTRL NOS LigandMPNN SENZ
ProtGPT2
ProGen2 21.86
ZymCTRL 21.03 19.13
NOS 20.18 20.61 16.44
LigandMPNN 20.23 21.65 20.68 21.81
SENZ 20.00 20.16 15.63 21.82 18.76

	ProtGPT2	ProGen2	ZymCTRL	NOS	LigandMPNN
ProtGPT2
ProGen2	21.86
ZymCTRL	21.03	19.13
NOS	20.18	20.61	16.44
LigandMPNN	20.23	21.65	20.68	21.81
SENZ	20.00	20.16	15.63	21.82	18.76

Structure similarity.

Metric: For structural comparison, we use TM-align to compare the 3D protein structures. TM-align aligns the Cα atoms of protein backbones to find the best structural match, even when sequence identity is low.
Result: The TM-scores between each pair of generated enzymes targeting methylphosphonate(1-) are also provided in the table below. For the value in row A, column B, the TM-score is normalized by the length of protein A. TM-scores range from 0 to 1, where a score between 0.0 and 0.30 typically indicates no structural relationship.

	ProtGPT2	ProGen2	ZymCTRL	NOS	LigandMPNN	SENZ
ProtGPT2		0.23789	0.12233	0.12550	0.18619	0.12544
ProGen2	0.22836		0.11656	0.13564	0.18303	0.13386
ZymCTRL	0.26331	0.25968		0.26866	0.26289	0.33121
NOS	0.20648	0.22723	0.20648		0.24652	0.23912
LigandMPNN	0.20929	0.21419	0.14044	0.17088		0.15116
SENZ	0.21737	0.24737	0.27572	0.25447	0.23319

`Q2 - How to handle potential post-translational modifications that could lead to improved/decreased enzyme activity relative to the reference enzyme?` It is handled indirectly by the oracle discriminator.

Thank you for the insightful question. Post-translational modifications (PTMs), such as phosphorylation, can introduce structural changes and further affect stability. Currently, there are no specific predictors for the effects of PTMs on enzymatic kinetics. As a result, we rely on an available oracle that has been trained on the substrate-enzyme relationship dataset, which indirectly accounts for the effects of PTMs on enzyme activity.

`Q3 - How to incorporate not-used information in future work?` The 2D structure of the substrate is handled, and the full reaction can be encoded in the same way.

Thank you for the comment. We will introduce how we leverage 2D structure of substrate and how to take advantage of more information in future generation of the model.

Already leverage 2D structure of substrate.
- Directly: The 2D structure of the substrate is embedded using a graph neural network in the generator.
- Indirectly: It is implicitly considered in the retrieval process, as the Morgen Fingerprint captures some aspects of local 2D structures.
Integrate information of full reaction in the future.
- Retrieval by reaction: For future work, we plan to design a more sophisticated retrieval strategy to retrieve enzymes based on not only substrate but also all molecules in the reaction, including products and intermediates.
- Full reaction encoding: The current encoder for the substrate can also encode other molecules in full reaction.

These enhancements can be integrated directly into the current SENZ model to improve data efficiency and overall performance.

`Q4 - What are the failure modes of SENZ?` Invalid SMILES input.

Thank you for the comment. One of the primary failure modes of SENZ arises from invalid SMILES (Simplified Molecular Input Line Entry Specification) input. The substrate’s SMILES representation must be valid to ensure meaningful enzyme generation. If the user lacks knowledge of SMILES and inputs a string that does not correspond to a real molecule, SENZ may produce enzymes that have no functional or structural relevance.

2024-11-25

Dear Area Chair and Reviewers,

We sincerely thank you for your thoughtful and thorough evaluation of our paper. Below are our detailed responses to your comments:

`W1 - Docking score and kcat are insufficient metrics, and rediscovery tasks are recommended.` The kcat is important for enzyme generation, and hitting any exact protein has a low likelihood but is unnecessary.

Thank you for the comment. Regarding the proposed rediscovery task, we acknowledge its potential value. We will discuss why the metric of kcat is essential and the role of recovery task in our setting.

kcat evaluates catalytic capability and is thus crucial for enzyme generation task.
- Definition: The catalytic constant, kcat, measures the maximum number of substrate molecules an enzyme can convert into the product per unit time under substrate saturation. It reflects an enzyme’s turnover number and efficiency in catalyzing reactions.
- Relation to enzyme generation: kcat assesses the efficiency of model-generated enzymes in catalyzing target substrate, and the underlying model’s performance.
The Rediscovery task has a low likelihood of hitting and is unnecessary.
- Low likelihood of hitting: It is because of the vastness of protein space. For instance, a protein sequence of 1000 amino acids has a theoretical space of $20^{1000}$ possible combinations. Thus, the chance of precisely replicating a small set of known enzymes for a given substrate is minimal.
- Unnecessary: Functional enzymes are often not confined to a single sequence, as variations at non-critical positions typically do not significantly impact performance. Therefore, a valid enzyme is not necessarily restricted to a few known solutions.

`W2 - It was not fully clear how SENZ sets itself apart from existing work in enzyme design.` SENZ is different in its task and leverage of information.

Thank you for the comment. We will introduce the novelty of SENZ from the perspective of task setting and information utilization.

Innovation in task setting.
- No need for protein-related information: For instance, ZymCTRL [1] is a GPT-2-based model that generates enzymes using the Enzyme Commission (EC) number as a prompt. In contrast, SENZ uses only the target substrate as input, eliminating the reliance on prior knowledge such as EC numbers. This allows SENZ to design enzymes without predefined enzymatic information.
- Specified for enzymes with catalyzing capability, not binding: Other conditional generation models, such as LigandMPNN [2], focus on designing binders for small molecules or antibody-antigen interactions. These models aim to create proteins that bind tightly to their targets rather than accelerating a specific biochemical reaction. SENZ, however, directly addresses the enzyme design problem, where the objective is to catalyze reactions rather than simply bind substrates.
Innovation in leveraging information.
- Retrieve MSA from substrates: Most protein models collect multiple sequence alignments (MSAs) based on protein sequence similarity, typically by searching for sequences similar to a target protein. SENZ, however, constructs MSAs based on substrate similarity, offering a unique perspective in modeling enzyme activity.
- Guidance to model catalytic relationship: SENZ integrates discriminative guidance models during the training phase to describe the catalytic relationship to its generative diffusion model, as existing works primarily apply guidance only during the inference stage of diffusion models.

[1] Geraldene Munsamy, Sebastian Lindner, Philipp Lorenz, and Noelia Ferruz. Zymctrl: a conditional language model for the controllable generation of artificial enzymes. In Machine Learning for Structural Biology Workshop. NeurIPS 2022, 2022.

[2] Justas Dauparas, Gyu Rie Lee, Robert Pecoraro, Linna An, Ivan Anishchenko, Cameron Glasscock, and David Baker. Atomic context-conditioned protein sequence design using ligandmpnn. Biorxiv, pp. 2023–12, 2023.

`W3 - No code made available.` Anonymize repo at https://anonymous.4open.science/r/SENZ-2BE1/

Thank you for bringing this to our attention. We have uploaded the code to an Anonymous GitHub repository, which can be accessed using the following link: https://anonymous.4open.science/r/SENZ-2BE1/.

We also highlighted it in the Reproducibility Statement section between the Conclusion and References in the revised manuscript.

`W4 - Minor grammatical mistakes.` We refined the manuscript.

Thank you for highlighting this issue. We have thoroughly reviewed the entire paper again and corrected several grammatical errors with the assistance of a grammar checker. We appreciate your suggestion.

2024-11-28

Thank you for your patience. We will introduce the anonymized code repo.

Dataset construction: We updated a more detailed README with dataset construction. The parts of data preprocessing and data preparation for training are also listed here.

The dataset split is in enzyme_dataset/full_RHEA_finetune_ProSmith/split/split30_ECnumber_fair_val_test0094400943/.

Retrieved MSAs are in enzyme_dataset_new/MSA/coldDB/.

Get raw data
1. cd enzyme_dataset working directory for raw data.
2. Download the map of index of all enzymes: wget https://ftp.expasy.org/databases/rhea/tsv/rhea2uniprot%5Fsprot.tsv. Get ./rhea2uniprot_sprot.tsv.
3. Run get_uniport.py to download fasta files for sequences from UniProt in ./data/. In ./data/ there will be fasta files, each contains only one sequecne. The file name is sequence id.
4. Run rhea-seq_new.ipynb to merge all sequence with reaction id in ./rhea2uniprot_sprot.tsv. Get ./full_seq_from_rhea2uniprot_sprot_tsv.csv.
5. Get EC raw file (Will be used for baseline only): wget https://ftp.expasy.org/databases/rhea/tsv/rhea2ec.tsv.
6. Get reaction SMILES: wget https://ftp.expasy.org/databases/rhea/tsv/rhea-reaction-smiles.tsv.
7. Get substrate of each reaction with ./seq_smiles.ipynb. Get sub_from_rhea-reaction-smiles_tsv.csv.
Perform data split
1. mkdir full_RHEA_finetune_ProSmith, cd full_RHEA_finetune_ProSmith. enzyme_dataset/full_RHEA_finetune_ProSmith is the working directory for data split.
2. Run enzyme_dataset/full_RHEA_finetune_ProSmith/finetune_mmseq.ipynb. Get fintune_dataset.csv. It consists of substrate-enzyme pairs.
3. Run finetune_mmseq.ipynb to get finetune_all.fasta, which contains all enyzme sequences.
4. mmseqs easy-cluster finetune_all.fasta full_finetune_mmseq_30/full_finetune_mmseq_30 tmp --min-seq-id 0.3 -c 0.8 --cov-mode 1. Get enzyme_dataset/finetune_mmseq/finetune_mmseq_30/finetune_all_cluster.tsv and enzyme_dataset/finetune_mmseq/finetune_mmseq_30/full_finetune_mmseq_30_all_seqs.fasta. It is the cluster of sequence with identity of 30 maxium.
5. Run ./full_RHEA_finetune_ProSmith/gen_full_finetune_fair_val_test.ipynb to generate negative samples and split in ./full_RHEA_finetune_ProSmith/split/split30_ECnumber/. (This folder will be renamed later.)
Retrieval in advance
1. Run enzyme_dataset_new/makeMSAfinal_groups.py to perform enzyme retrieval for all substrates. Result in enzyme_dataset_new/MSAgroup/coldDB/. One fasta file for every substrate, containing retrieved sequences. enzyme_dataset_new/retrieve.ipynb is an example of the retrieved similar substrates.
2. Run enzyme_dataset_new/makeMSA.py to get MSAs from retrieved sequences in enzyme_dataset_new/MSA/coldDB.
Training and sampling: We also include this part in README. The main function is in these three files.
1. Guidance function training: mcgp/ProSmith_finetune_DDP_batch_neg.py.
2. Generator training: mcgp/TarDiff2loss_DDP_strict_step.py.
3. Sampling: mcgp/TarDiff_test_all_param.py.

评论- Last day for reviewers to ask questions to the authors!

2024-11-26

Dear reviewers,

Tomorrow (Nov 26) is the last day for asking questions to the authors. With this in mind, If you have not already done so, please read the rebuttal provided by the authors earlier today, as well as the other reviews. If you have not already done so, please explicitly acknowledge that you have read the rebuttal and reviews, provide your updated view accompanied by a motivation, and raise any outstanding questions for the authors.

Timeline: As a reminder, the review timeline is as follows:

November 26: Last day for reviewers to ask questions to authors.
November 27: Last day for authors to respond to reviewers.
November 28 - December 10: Reviewer and area chair discussion phase.

Thank you again for your hard work, Your AC

AC 元评审

2024-12-20

This paper received a mix of recommendations of borderline above and below the acceptance threshold. As positives, reviewers highlight the review of prior work on protein design and the contribution of providing a substrate-enzyme relationship dataset as part of the work. The authors have shared anonymized code with the reviewers to ensure reproducibility of their work and among other things the creation of the dataset as provided in the work. This has increased the score of one reviewer from a 5 to a 6.

Several reviewers have shared concerns about the metrics used to assess the enzyme designs, in particular about these metrics being insufficient, unreliable or easy to trick. Furthermore, one reviewer raised concerns about the foldability of the generated designs. Unfortunately it seems like the rebuttal and the author-reviewer discussions have not sufficiently addressed these concerns of the reviewers, and I therefore recommend to reject this paper.

审稿人讨论附加意见

See above.

最终决定Reject

2025-01-22

Reject

Retrieval Augmented Zero-Shot Enzyme Generation for Specified Substrate

摘要

评审与讨论

优点

缺点

问题

W1 - The application scenario is unclear. I don't think the proposed retrieval method could find relative enzymes in the database just according to the substrate similarity. Our model aims to generate non-natural enzymes, and the retrieval works because the substrate reflects enzyme properties.

W2(1) - Unconditional generation is not fair. Why not use ProGen? Which size of ProGen2 is utilized? Why the pLDDT of ProGen2 is too low?Unconditional generation baselines are for demonstration; ProGen cannot take substrate condition; ProGen2 is 6.4B; Low pLDDT because of lack of fine-tuning.

W2(2) - In Figure 2(b), none of the cases achieve a pLDDT higher than 80, indicating the proposed model cannot design foldable enzymes. Enzymes with low pLDDT can be foldable and functional.

W2(3) - In Table 3, the novelty should be evaluated through BLASTp in Uniprot instead of just Swiss-Prot. Resource limitation.

W3 - For the guidance training method, how to guarantee a scoring function can be used to score enzyme-substrate pairs that it never sees during training? The scoring function has generalization ability.

Q1 - What is the definition of "the least common reactant" in data construction? The reactant molecules with the least frequency in all reactions.

Q2 - What is the accuracy of the scoring function in guided training? 0.809.

Q3 - Clearly state the inference process of all baseline models. As follows.

1 - Natural or non-natural. Natural-like enzyme, but not recorded to appear in nature.

2 - 72% of generated enzymes have a pLDDT score of 0.5. 72% of them have a pLDDT OVER 0.5.

2 - The sequences just fall into some uncertain areas considering the natural protein distribution. The novelty of generated enzymes has high resemblance to natural enzymes.

W1 - About the goal of designing non-natural enzymes. The goal is to find natural-like but novel enzymes.

W2 - ProGen cannot take substrate condition. ProGen does not fit the zero-shot setting.

优点

缺点

问题

Q1 - How the model generates enzyme sequences in extreme cases such as when there are no retrieval results, and how to ensure the quality of the generation. The molecule encoder can help, and discriminative models filter the generated enzymes.

Q2 - During the model’s sampling phase, how to align the retrieval results to target generation? If all retrieval results are aligned with the first retrieval result, could this potentially introduce errors. The retrieved results are treated equally in the alignment algorithm.

Q3 - The details of how you modified the NOS in the baselines. Only replace the original guidance function with the substrate-enzyme relation scorer in our model.

W3 - The reasons for 7 tasks selection were not provided. The unconditional generation results outperform the conditional generation results. The 7 target substrates are important in metabolism, and massive pre-trained methods may outperform unsuitable conditioned methods.

W4 - The model's foldability and catalytic performance of sequences should be further analyzed and balanced. We provide versions of the model, and wet lab practitioners can do the selection.

W5 - The paper does not include a reproducibility statement. We append it after the conclusion section.

W1 - The model's foundation lacks innovation, as it uses the existing method EvodiffMSA-OADM. The model only considers the sequence level without extending to the structural level. The novelty is to generate enzymes from the substrate, and our model considers structure by MSA.

优点

缺点

问题

W1 - The scores of generated enzymes are much higher than the ground truth. It is common for artificial enzymes to outperform natural counterparts in kcat.

W2 - Confusing annotations about probability p and variable x. x is the extreme case of p.

Q1 - Why is the score much higher than the ground truth? A way to directly compare the generated enzymes to the ground truth. Compare sequence similarity.

Q2 - Re-train the model or just change the generation stage in the ablation study? Re-train the model.

优点

缺点

问题

伦理问题详情

W1 - More Documentation. We write a new README with instructions in the Anonymous repository.

W2 - Make changes in the paper. We highlighted the changes in red font.

Q1 - For the same substrate, how different are the proteins generated via the different benchmarked methods? We compared them in terms of sequence and structural similarity.

Q2 - How to handle potential post-translational modifications that could lead to improved/decreased enzyme activity relative to the reference enzyme? It is handled indirectly by the oracle discriminator.

Q3 - How to incorporate not-used information in future work? The 2D structure of the substrate is handled, and the full reaction can be encoded in the same way.

Q4 - What are the failure modes of SENZ? Invalid SMILES input.

W1 - Docking score and kcat are insufficient metrics, and rediscovery tasks are recommended. The kcat is important for enzyme generation, and hitting any exact protein has a low likelihood but is unnecessary.

W2 - It was not fully clear how SENZ sets itself apart from existing work in enzyme design. SENZ is different in its task and leverage of information.

W3 - No code made available. Anonymize repo at https://anonymous.4open.science/r/SENZ-2BE1/

W4 - Minor grammatical mistakes. We refined the manuscript.

Get raw data

Perform data split

Retrieval in advance

审稿人讨论附加意见

`W1 - The application scenario is unclear. I don't think the proposed retrieval method could find relative enzymes in the database just according to the substrate similarity.` Our model aims to generate non-natural enzymes, and the retrieval works because the substrate reflects enzyme properties.

`W2(1) - Unconditional generation is not fair. Why not use ProGen? Which size of ProGen2 is utilized? Why the pLDDT of ProGen2 is too low?`Unconditional generation baselines are for demonstration; ProGen cannot take substrate condition; ProGen2 is 6.4B; Low pLDDT because of lack of fine-tuning.

`W2(2) - In Figure 2(b), none of the cases achieve a pLDDT higher than 80, indicating the proposed model cannot design foldable enzymes.` Enzymes with low pLDDT can be foldable and functional.

`W2(3) - In Table 3, the novelty should be evaluated through BLASTp in Uniprot instead of just Swiss-Prot.` Resource limitation.

`W3 - For the guidance training method, how to guarantee a scoring function can be used to score enzyme-substrate pairs that it never sees during training?` The scoring function has generalization ability.

`Q1 - What is the definition of "the least common reactant" in data construction?` The reactant molecules with the least frequency in all reactions.

`Q2 - What is the accuracy of the scoring function in guided training?` 0.809.

`Q3 - Clearly state the inference process of all baseline models.` As follows.

`1 - Natural or non-natural.` Natural-like enzyme, but not recorded to appear in nature.

`2 - 72% of generated enzymes have a pLDDT score of 0.5.` 72% of them have a pLDDT OVER 0.5.

`2 - The sequences just fall into some uncertain areas considering the natural protein distribution.` The novelty of generated enzymes has high resemblance to natural enzymes.

`W1 - About the goal of designing non-natural enzymes.` The goal is to find natural-like but novel enzymes.

`W2 - ProGen cannot take substrate condition.` ProGen does not fit the zero-shot setting.

`Q1 - How the model generates enzyme sequences in extreme cases such as when there are no retrieval results, and how to ensure the quality of the generation.` The molecule encoder can help, and discriminative models filter the generated enzymes.

`Q2 - During the model’s sampling phase, how to align the retrieval results to target generation? If all retrieval results are aligned with the first retrieval result, could this potentially introduce errors.` The retrieved results are treated equally in the alignment algorithm.

`Q3 - The details of how you modified the NOS in the baselines.` Only replace the original guidance function with the substrate-enzyme relation scorer in our model.

`W3 - The reasons for 7 tasks selection were not provided. The unconditional generation results outperform the conditional generation results.` The 7 target substrates are important in metabolism, and massive pre-trained methods may outperform unsuitable conditioned methods.

`W4 - The model's foldability and catalytic performance of sequences should be further analyzed and balanced.` We provide versions of the model, and wet lab practitioners can do the selection.

`W5 - The paper does not include a reproducibility statement.` We append it after the conclusion section.

`W1 - The model's foundation lacks innovation, as it uses the existing method EvodiffMSA-OADM. The model only considers the sequence level without extending to the structural level.` The novelty is to generate enzymes from the substrate, and our model considers structure by MSA.

`W1 - The scores of generated enzymes are much higher than the ground truth.` It is common for artificial enzymes to outperform natural counterparts in kcat.

`W2 - Confusing annotations about probability p and variable x.` x is the extreme case of p.

`Q1 - Why is the score much higher than the ground truth? A way to directly compare the generated enzymes to the ground truth.` Compare sequence similarity.

`Q2 - Re-train the model or just change the generation stage in the ablation study?` Re-train the model.

`W1 - More Documentation.` We write a new README with instructions in the Anonymous repository.

`W2 - Make changes in the paper.` We highlighted the changes in red font.

`Q1 - For the same substrate, how different are the proteins generated via the different benchmarked methods?` We compared them in terms of sequence and structural similarity.

`Q2 - How to handle potential post-translational modifications that could lead to improved/decreased enzyme activity relative to the reference enzyme?` It is handled indirectly by the oracle discriminator.

`Q3 - How to incorporate not-used information in future work?` The 2D structure of the substrate is handled, and the full reaction can be encoded in the same way.

`Q4 - What are the failure modes of SENZ?` Invalid SMILES input.

`W1 - Docking score and kcat are insufficient metrics, and rediscovery tasks are recommended.` The kcat is important for enzyme generation, and hitting any exact protein has a low likelihood but is unnecessary.

`W2 - It was not fully clear how SENZ sets itself apart from existing work in enzyme design.` SENZ is different in its task and leverage of information.

`W3 - No code made available.` Anonymize repo at https://anonymous.4open.science/r/SENZ-2BE1/

`W4 - Minor grammatical mistakes.` We refined the manuscript.