Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Generation
We propose FoldFlow++ a new sequence conditioned protein structure generative model using flow-matching which can be finetuned for Motif Scaffolding.
摘要
评审与讨论
The authors proposed FOLDFLOW++, which is built on top of FOLDFLOW [ICLR 2024]. It adds a joint structure and sequence representation and a transformer-based geometric decoder, enabling folding and inpainting applications.
优点
The tasks the authors are attempting to solve seem very interesting and important for drug discovery.
缺点
Considering that the theoretical novelty of the paper is somewhat limited and its main contribution lies in introducing certain architectures, the experiments conducted for the new tasks (other than unconditional generation) are also somewhat limited. Please note that I am not very familiar with the topic and do not know what potential experiments could be included.
问题
-
I was wondering why it is not possible to calculate diversity and novelty for other tasks.
-
It is not clear what the main component is that helps the model improve in terms of diversity and novelty compared to FOLDFLOW on the unconditional task.
-
Can this be used for inverse folding as well?
-
I was wondering if the inpainting task is not similar to seed optimization or editing tasks. For example, LaMBO-2 [NeurIPS 2023] or GFNSeqEditor [GenBio 2023]? If yes, why one cannot compare with those kinds of models?
-
What are the oracles to evaluate the generated samples?
--- after rebuttal ---
I have read the authors' responses to my questions. They have addressed most of my concerns, and I believe the paper has merit to be accepted at NeurIPS.
局限性
I think the model's performance might depend on various components such as ESM-2 and the structure encoder.
We thank the reviewer for their time and effort in reviewing our paper. We are heartened to hear that the reviewer views that with FoldFlow++ we are tackling a problem that is “very interesting and important for drug discovery” which was our primary aim with this new state-of-the-art protein structure generative model. We next answer all the important questions raised by the reviewer while new experiments are presented in the global response to all reviewers.
the experiments conducted for the new tasks (other than unconditional generation) are also somewhat limited.
We appreciate the reviewer's concern regarding the completeness of our experimental protocol beyond unconditional generation. We would like to politely push back against this characterization as we test FoldFlow++ on a large suite of conditional generation tasks including motif-scaffolding, protein folding, and zero-shot equilibrium conformation sampling.
We highlight that the latter two tasks are included to showcase FoldFlow++ on tasks for which it was not originally trained. For instance, FoldFlow++ is not trained on any dynamics data yet is able to sample protein conformations at the level of ESMFlow—a model built for this task.
For motif-scaffolding we found that FoldFlow++ saturates the performance of the previous benchmark and we strived to introduce an even more challenging VHH nanobody task. In our 1pg PDF, we have included a quantitative diversity measurement of our motif-scaffolding designs and provided visual samples of the distribution of generated structures conditioned on a fixed motif. We believe that including such a challenging benchmark makes our conclusions more robust for the motif-scaffolding problem—which is perhaps the most relevant task considered in this paper for computational drug design.
I was wondering why it is not possible to calculate diversity and novelty for other tasks.
We thank the reviewer for this suggestion which we implement as additional results included in the 1 pg PDF. We report diversity of generated samples for the motif-scaffolding task as well as visual generations from the conditional distribution. We find that FoldFlow++ again outperforms RFDiffusion in terms of scaffold diversity. We encourage the reviewer to kindly read our global response for more details.
It is not clear what the main component is that helps the model improve in terms of diversity and novelty…
The reviewer raises an important question. There are several distinct features in FoldFlow++ that differentiate it from the original FoldFlow in terms of improved generated structure diversity and novelty. The most important piece is the use of sequence information through a large pretrained language model ESM2. We note that ESM2 incorporates sequence information of evolutionary scale protein sequence data which is significantly larger than PDB. Furthermore, it is well established that in biology protein sequences heavily determine their structure and we expect that using this biological inductive bias aids structure generation. We note that MultiFlow further showed that using sequence information—albeit without using a large pre-trained LM—aids generated structure diversity and novelty. We conducted an ablation study on the architecture of FF++ in Table 14, showing the effect of adding sequence embeddings even for unconditional generation.
We hope that this fully clarifies the question raised by the reviewer and we have included more discussion on this aspect in the updated version of the manuscript.
Can this be used for inverse folding as well?
FoldFlow++ is a sequence conditioned protein structure generative model. Consequently, it is only able to generate structure and not sequences and thus cannot perform inverse folding.
I was wondering if the inpainting task is not similar to seed optimization or editing tasks…
That's an interesting question! Inpainting in FoldFlow++ happens on the structure space, in the sense that we provide the model with partial sequence and structure and generate missing structure. The editing approaches suggested by the reviewer operate on (output) space of protein sequences—i.e. they generate amino acid sequences. Consequently, we restrict comparing FoldFlow++ against SOTA structure-based models as there is no fair way to compare to sequence-based models without performing folding/inverse-folding using a third party model which itself may introduce further error.
What are the oracles to evaluate the generated samples?
In-silico evaluation of structure models is an important aspect in understanding the capabilities and limitations of current SOTA structure generative models. Our evaluation pipeline closely follows that of the literature and the exact schematic, including oracles, is depicted in figure 8 of the appendix. We recall that the unconditional evaluation consists of inverse folding the generated designs to obtain protein sequences, then refolding with an oracle folding model; in our paper we use ESMFold. Amongst the metrics we compute is the RMSD between FF++ designs and ESMFold refolded ones where we define generated proteins that achieve <2 angstroms as designable. On designable proteins we further measure diversity and novelty using a function of the pairwise TM score (exact details in Appendix B.5).
For the motif scaffolding evaluation, we fix the amino acids corresponding to the motifs. In this case we also look at the RMSD between motifs, scaffolds, and the entire proteins. We will make the description of our evaluation pipeline clearer in the updated version of the manuscript.
Closing comment
We once again appreciate your time and effort in this rebuttal period. If the reviewer deems our responses detailed enough and satisfactory we encourage the reviewer to potentially consider a fresher evaluation of our paper with these responses in context and potentially upgrade their score.
Dear reviewer,
We are very grateful for your thorough review of our paper which allowed us to provide additional clarifications and experiments in the rebuttal on the important raised points---including new diversity metrics on the motif scaffolding task as well as generated samples. We hope our rebuttal and the global response have allowed the reviewer to clear any remaining doubts about our paper, and if not we would love to engage further in the remaining (< 24 hours) before the rebuttal period closes. Please note our rebuttal strived to highlight what we consider our main contributions to FoldFlow++ which include the new architecture, masked training procedure, and conditional generation tasks.
We again appreciate the reviewer's time and would love to answer any further questions. We would also kindly request the reviewer to potentially consider updating their score if our rebuttal and global response have succeeded in addressing all the great points raised in the review.
The paper proposes a new model FOLDFLOW++ for Conditional Protein Backbone Generation. It incorporates several techniques including sequence model, finetuning strategies, and high-quality synthetic structures to improve its performance on various tasks. The experimental results suggest the method achieves SOTA performance on various protein-related generation tasks.
优点
-
The paper is well-written and can be used in many real-world scenarios.
-
The proposed method gains sota performance on unconditional generation, motif scaffolding, folding, fine-tuning to improve secondary structure diversity, and equilibrium conformation sampling from molecular dynamics trajectories.
-
Considering the sequence embedding and ReFT is reasonable for improving the base model's performance.
缺点
- This paper has limited technical novelty. The core components are mostly proposed by previous works.
- Fusing the well-trained sequence model(ESM) into the backbone generation method (FoldFlow) intuitively can improve the structure generation[1]. Therefore, we cannot see the insightful discussion and surprising conclusion from the paper.
- Some baselines concerning MD may be missing: EIGENFOLD[2], STR2STR[3],CONFDIFF[4].
[1] A Hierarchical Training Paradigm for Antibody Structure-sequence Co-design
[2]EigenFold: Generative Protein Structure Prediction with Diffusion Models
[3]Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling
[4]Protein Conformation Generation via Force-Guided SE(3) Diffusion Models
问题
- Is the model framework the same as FlowFold If the sequence input is fully masked?
- Can the model compare with the AlphaFold2/3?
局限性
Yes
We would like to thank the reviewer for their detailed feedback and constructive comments, which allowed us to significantly improve our submission with new experiments and results which can be found in our global response. In addition, we appreciate that the reviewer found our paper to be “well-written” and that FoldFlow++ may be employed in “many real-world scenarios”. We also thank the reviewer for agreeing that FoldFlow++ achieves “sota performance” on a variety of tasks by leveraging key algorithmic and architectural components such as ReFT and a protein language model. We now address the main concerns raised by the reviewer.
This paper has limited technical novelty…
We value the reviewers opinion on the novelty of FoldFlow++. We kindly invite the reviewer to read our global response which summarizes our main contributions as well as details the principal innovations FoldFlow++ introduces at the architectural, algorithmic, and task level.
- Architectural novelty: FoldFlow++ is the first protein structure generative model that successfully integrates a large pre-trained protein sequence language mode in ESM at a much larger scale.
- Algorithmic novelty: FoldFlow++ introduces Reinforced fine-tuning (ReFT) for increasing secondary structure diversity as well as a new sequence-masking-flow matching training procedure that unlocks motif-scaffolding applications.
- Task novelty: FoldFlow++ is applied in important downstream tasks—beyond unconditional generation—with motif-scaffolding, plus introducing a new more challenging dataset on VHH since FoldFlow++ saturates the previous benchmark.
Fusing the well-trained sequence model (ESM) into the backbone generation method … we cannot see the insightful discussion and surprising conclusion from the paper.
We appreciate the reviewers concern regarding the unsurprising improvement to protein structure generation by incorporating embeddings from a protein language model. We would like to politely disagree. We note that prior to FoldFlow++ only MultiFlow and Chroma explored incorporating sequence information—not even using a pre-trained language model—to aid protein structure generation and were unable to achieve significant performance improvements (see Table 1). We argue FoldFlow++ is the first structure generative model that is able to leverage this biological inductive bias and show measurable improvement over structure-only models such as RFDiffusion, FrameFlow, and FrameDiff (see Tables 1-4).
As a result, we argue it remained an open research question to what extent sequence embeddings help structure generation as ESM2 is pre-trained on a a much larger sequence dataset drastically different than PDB.
Moreover, unlike prior models that use sequence, FoldFlow++ uses masked training which enables us to think of actual drug design applications as conditional generative modeling problems. For example, given a known motif sequence where the scaffold is masked we can generate the 3D structure of the scaffold directly. More precisely, we can train for this task as opposed to only performing it during inference unlike MultiFlow. Finally, we note that a surprising finding of FoldFlow++ is that we did not require the usage of pretrained folding model weights as employed in RFDiffusion and instead by masked training FoldFlow++ learns how to perform folding (Table 4).
Some baselines concerning MD may be missing: EIGENFOLD[2], STR2STR[3],CONFDIFF[4].
We have added comparisons on MD tasks for Eigenfold and Str2Str methods in the 1pg PDF. We do not compare to ConfDiff as it does not yet have publicly available code to our knowledge, but have also added discussion of these related works in section 4.5 (conformation sampling). Overall we find FoldFlow++ slightly outperforms EigenFold (on 3/4 performance metrics) and performs comparably to Str2Str (2/4 performance metrics) as shown in Table R1 of the attached PDF. We note that FoldFlow++ is not finetuned for sampling yet performs competitively to the suggested baselines which are more purpose-built for this task. In addition, we are excited by the possibilities of combining FoldFlow++ and improved inference methods for conformation generation as developed in Str2Str and ConfDiff in future work.
Eval Protocol: We evaluated both Eigenfold and Str2Str using the setting in section 4.5. For both models we use the default parameters and model in the public code.
Is the model framework the same as FoldFlow If the sequence input is fully masked?
That's a great question. The model framework of FoldFlow++ is not the same as the original FoldFlow model even if the sequence is fully masked. We give a more detailed account of architectural differences in our global response but, in summary, FoldFlow++ uses the EvoFormer block where FoldFlow does not. It has inductive bias from sequence embeddings, which improve generation (even for unconditional generation with fully masked sequence input. However, the reviewers intuition is correct in that the loss function—i.e. flow matching over —used in FoldFlow++ is the same as FoldFlow when the sequence is fully masked.
Can the model compare with the AlphaFold2/3?
FoldFlow++ and AlphaFold2/3 are different classes of models that are not directly comparable aside from protein folding. On the folding task FoldFlow++ underperforms ESMFold, which itself underperforms AlphaFold2/3 as they utilize multiple sequence alignment inputs. AlphaFold2/3 is not designed for our other tasks such as unconditional structure generation.
Closing comment
We thank the reviewer again for their review and detailed comments that helped strengthen the paper. We believe we have answered to the best of our ability all the great questions raised by the reviewer. We hope our answer here and in the global response allows the reviewer to consider potentially upgrading their score if they see fit. We are also more than happy to answer any further questions.
Dear reviewer,
We are very appreciative of your time and constructive comments. As the end of the rebuttal period is fast approaching we would like to have the opportunity to answer any lingering questions or doubts that may remain. We would like to note that in our rebuttal we followed your great suggestions and included new baselines for the zero-shot MD experiments. We also tried to highlight in both our global response and the rebuttal response the main technical novelty introduced in FoldFlow across architectural, training, and task novelty.
We would be happy to engage in any further discussion on these points or any other salient points that the reviewer finds important, please let us know! We thank the reviewer again for their time and if the reviewer finds our rebuttal and new experimental findings satisfactory we also would appreciate it if the reviewer could potentially consider revising their assessment of our paper.
The paper presents a protein generative model FoldFlow++ augmented with protein language model embeddings upon FoldFlow. The model is trained with sequence and structure information to learn embedding projections in SE3 space. Experiments on unconditional generation show a favorable performance of FoldFlow++ over SOTA method RFdiffusion. FoldFlow++ has the capability to be aligned to arbitrary awards like secondary structure diversity through reinforce finetuning, as well as the capability to motif scaffolding and conformation sampling.
优点
Originality
The paper is an excellent piece of work implementing protein language model embedding-guided protein flow matching model. The design of the network is reasonable and novel. Training with half-time sequence masking introduces the capability of protein folding and design at the same time. Overall the model is carefully designed and shows wonderful protein generative modeling potential.
Quality
The submission is technically sound, with much of the mathematical foundations explained in the previous FoldFlow paper. Various aspects of protein generative models are tested, e.g. unconditional sampling, protein folding, motif scaffolding, conformation sampling.
Clarity
The paper is easy to comprehend and figures are well-designed and clear.
Significance
This work integrates protein language model into a protein flow matching framework to make its protein modeling and design ability more versatile. It can perform various kinds of tasks in protein design and has a great potential as a foundational model for protein researcher.
缺点
- What are the diversity of structures for motif scaffolding task? Please include some visualization of generated structures for motif scaffolding benchmarks and statistical results.
- Alphafold2 also has conformation sampling ability. Did you benchmark it?
问题
- Are the results reported for model trained with synthetic dataset or PDB dataset?
- Could you explain how is the loss function implemented in the code?
- For protein folding and inpainting test, what kind of noise is the structure input?
- how is the AF2 high-confidence structures further distilled?
局限性
Although FoldFlow++ has various types of protein generation capabilities, performance on some of the tasks like conformation sampling and protein folding is not impressive, though I believe further finetuning on more specific datasets can benefit the model on this.
We thank the reviewer for their enthusiastic review and positive appraisal of our work! We are heartened to hear that the reviewer found our paper to be an “excellent piece of work” with the architecture being “novel” and the overall model to show “wonderful protein generative model potential”. We are also glad that the reviewer views are our work to be “technically sound” and the writing to be “easy to comprehend” with well designed figures. Finally, we are thrilled that the reviewer finds the significance of our work to have “great potential as a foundational model for protein researcher.” We now address the key clarification points raised in the review below.
What are the diversity of structures for motif scaffolding task? Please include some visualization of generated structures for motif scaffolding benchmarks and statistical results.
We acknowledge the reviewers comment regarding the diversity of structures for motif scaffolding. These additional experimental findings (along with RFDiffusion) are included in the 1pg global PDF along with a discussion in the larger global response. Summarizing these findings here, we notice that FoldFlow++ has a larger diversity of designable structures in comparison to RFDiffusion which is in line with expectation as we find that FoldFlow++ produces much more diverse secondary structures—especially for unconditional generation of short proteins as observed in Fig 3 in the main paper.
Alphafold2 also has conformation sampling ability. Did you benchmark it?
Throughout the manuscript, we compared against methods that do not rely on multiple sequence alignment (e.g. RFdiffusion, ESMFold). While providing meaningful single and pair representations, these methods are computationally expensive as they require querying a database for each sequence. However, we do compare with finetuned AlphaFlow-MD; a model that improves on Alphafold on MD tasks (see [1]).
Are the results reported for model trained with synthetic dataset or PDB dataset?
We understand this point was not sufficiently clear in the main text. For fair comparison between models and methods the main table results (Tables 1-4) are based on models trained only on PDB. The utility of synthetic data to improve diversity is presented in Appendix C in Figures 4e (main text), 12 &13 (appendix), and Table 14. We note that improving diversity using synthetic data is complementary to using ReFT to finetune for diversity.
For protein folding and inpainting test, what kind of noise is the structure input?
The noise can be broken down into the input modality which consist of protein structures and protein sequences.
-
Protein Structures: The noise is determined by manifold which in the incase of protein structures is the manifold repeated across the N residues. Since the manifold SE(3) can be decomposed into rotations and translations the noise can also be decomposed as separate noise for each rotation and translation component. For rotations o SO(3) we use the Isotropic Gaussian on distribution introduced in FrameDiff (yim et. al 2023) and FoldFlow (Bose et. al 2023) while for translations we use the familiar Gaussian distribution on .
-
Protein Sequences: Since sequences are discrete noise corresponds to replacing an amino acid in the sequence with a specialized mask token. The percentage of mask tokens in a sequence corresponds to the amount of noise added with a fully masked sequence being pure noise and as a result useful for unconditional structure generation. Partial masking of a sequence allows FoldFlow++ to perform in-painting tasks such as motif scaffolding.
How is the AF2 high-confidence structures further distilled?
We thank the reviewer for raising this important technical point which may not have been sufficiently clear in the original manuscript. After filtering the AF2 structures on high-confidence, we compute per-residue masks for the loss based on the pLDDT of each residue. Finally, we use a simple feature-based model to identify the remaining high-confidence, low-quality structures (see, Figure 7 in the text for some examples). Section 3.2 in our main paper plus the additional details in Appendix B.1.2 include even more details on the precise methodology for how we filter the AF2 synthetic structures.
Closing comments
We hope that our responses were sufficient in clarifying all the great questions asked by the reviewer and we would love to engage further if the reviewer has any further comments. We thank the reviewer again for their time and we politely encourage the reviewer to consider updating their score if our responses in this rebuttal merit it.
References
[1] Jing, B., Berger, B., & Jaakkola, T. (2024). AlphaFold meets flow matching for generating protein ensembles. arXiv preprint arXiv:2402.04845.
Thank you for the careful rebuttal and I keep my recommendation for acceptance of this great work!
We thank the reviewer again for their time and positive endorsement of our work!
The paper introduces FoldFlow++, a sequence-conditioned SE(3)-equivariant flow matching model designed for protein structure generation. FoldFlow++ builds upon previous FoldFlowmodels by incorporating a protein language model to encode sequences, a multi-modal fusion trunk to integrate structure and sequence representations, and a geometric transformer-based decoder. The model is trained on a large dataset of both known proteins and high-quality synthetic structures, demonstrating substantial improvements over previous state-of-the-art models in terms of designability, diversity, and novelty. FoldFlow++ also excels in conditional design tasks, such as designing scaffolds for VHH nanobodies.
优点
- The paper includes a detailed ablation study examining architectural components, different flow matching schedules, and more.
- The model surpasses previous state-of-the-art generative models in terms of designability, diversity, and the novelty of generated protein structures.
- The authors propose a large and diverse dataset, including high-quality synthetic structures, which enhances the model's generalizability and robustness.
- The paper explores several meaningful settings, such as Reinforced FineTuning, Motif Scaffolding, and Zero-shot Equilibrium Conformation Sampling.
缺点
- The proposed pipeline appears to be a special case of Multi-Flow.
- The model architecture remains very similar to previous work (e.g., Genie, FrameFlow/FrameDiff), leaving it unclear which specific parts of the algorithm drive the observed improvements.
问题
- In section 3.1, it’s claimed that the invariant point attention encoder (IPA) is SE(3)-equivariant. Shouldn’t it be invariant?
- When the sequence is chosen to be masked in training, how is the mask applied? Is it applied to the input space or the embedding space (after ESM2)?
- The common embedding in ESM2 is a single representation. What is the pair representation in ESM2 mentioned in Section 3.1?
- In Section 3.1, it’s claimed that adding a skip-connection between the encoder and decoder is essential for good performance. Is there any ablation study regarding this point?
- In synthetic dataset processing, how is the “masking low confidence residue” handled? Is the masking applied to both structure and sequence? Does the model leave a position for the masked residue or just ignore it?
局限性
N/A
We thank the reviewer for their time and detailed feedback which gave us an opportunity to strengthen our manuscript with additional ablation and results. We are encouraged that the reviewer felt that our paper includes a “detailed ablation” study of the different architectural components which enables the model to surpass “previous state-of-the-art generative models” in terms of designability, diversity, and novelty of generated protein structures and our use of synthetic structures to improve the models “generalizability and robustness”. We further appreciate that the reviewer views our paper to explore “several meaningful settings” such as Reinforced FineTuning, Zero-Shot Equilibrium Conformation Sampling, and Motif Scaffolding—the last of which we believe is an important biologically relevant task.
We now answer the key questions raised by the reviewer, while our global response contains detailed description of new experimental findings and shared concerns.
Proposed pipeline is a special case of MultiFlow
We appreciate the reviewer's concern that FoldFlow++ follows a similar approach to MultiFlow. We respectfully disagree with this assertion for several reasons. The most important of which is MultiFlow and FoldFlow++ have key architectural differences as well as training differences that mean the pipelines are different. We outline these in detail in our global response which we encourage the reviewer to kindly read.
The model architecture remains very similar to previous work
We acknowledge the reviewers' concern that the FoldFlow++ architecture may appear similar to past work. We would like to respectfully push back against this characterization by the reviewer. While our structure encoder is indeed the same network used in FrameFlow, FrameDiff or FoldFlow, we have enriched our representations with ESM2. The encoder representation is then processed with two Trunk blocks to learn more meaningful single and pair representations, a key element to get more designable, diverse and novel proteins. Specifically, FoldFlow++ uses EvoFormer blocks from ESMFold [1]---a module that is not used in the original FoldFlow or FrameDiff architectures. Due to FoldFlow++ being a model that accepts both protein structures and sequences, it further requires intermediate multi-modal fusion blocks. We wish to highlight that neither FoldFlow, FrameFlow, nor FrameDiff can consume protein sequences as inputs and hence the architecture of FoldFlow++ is substantially different. Finally, we empirically show in Table 14 an ablation over each architectural component benefits the overall performance of FoldFlow++. In short, we found that enriching the structure embedding in the enconder with a sequence embedding leads to improved results but the key architectural inclusion of the Folding blocks have the greatest impact on final performance.
In section 3.1, it’s claimed that the invariant point attention encoder (IPA) is SE(3)-equivariant. Shouldn’t it be invariant?
Yes, thank you for pointing this out. We have fixed it in the manuscript.
When the sequence is chosen to be masked in training, how is the mask applied? …
The mask is applied to the input space of ESM2, which matches how ESM2 was trained using the Masked Language Modeling objective. Note by masking the sequence FoldFlow++ is simultaneously being trained for unconditional generation as well protein folding. This point may not have been sufficiently clear in the original manuscript and we will update with a more precise description of the masking process.
… What is the pair representation in ESM2 mentioned in Section 3.1?
Following ESMFold [1], to define a pair representation from ESM2, we use the attention matrices from the protein language model. We will make this clearer in the updated manuscript.
In Section 3.1, it’s claimed that adding a skip-connection between the encoder and decoder is essential for good performance. Is there any ablation study regarding this point?
This finding is intended as an anecdote from our model development experience to help practitioners avoid some pitfalls we encountered when developing encoder-decoder flow matching models rather than an empirical fact. We didn’t conduct an ablation study of this particular choice since skip connections are extremely common in the ML literature, including flow matching. For example, the U-Net architecture commonly used in image flow matching contains skip connections from each “contracting” block to its equivalent “expanding” block. We found something analogous to be very useful.
In synthetic dataset processing, how is the “masking low confidence residue” handled? …
We appreciate the reviewer's question to a technical point that may have lacked clarity. The low-confidence residue masking is applied directly to the final loss. Operationally, this means zero-ing out the loss from low-confidence residues, while not affecting those residues’ structure or sequence inputs to the model. This prevents the loss for low-confidence residues from contributing to the model’s weight update, while avoiding unexpected inputs to the model such as fragmented structures or sequences. This is a standard practice when working with predicted structures, see e.g. [2]. We will incorporate a comment regarding low confidence residue masking in our updated manuscript.
Closing comments
We thank the reviewer again for their valuable feedback and great questions. We hope that our rebuttal addresses their questions and concerns and we kindly ask the reviewer to consider upgrading their score if the reviewer is satisfied with our responses. We are also more than happy to answer any further questions that arise.
[1] Lin et al (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science
[2] Ahdritz et al (2024). OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods.
We thank the reviewer again for your time and feedback that allowed us to strengthen the paper with new experiments and clarifications during this important rebuttal period. As the end of the rebuttal period is fast approaching we were wondering if our answers in the rebuttal were sufficient enough to address the important concerns raised regarding 1.) the technical novelty that distinguishes our proposed model FoldFlow++ from MultiFlow, and 2.) the main architectural and training schemes that differentiate it from past works. We highlight that our global response includes new ablations including additional baselines and visualizations.
We would be happy to engage in any further discussion that the reviewer finds pertinent, please let us know! Finally, we are very appreciative of your time and effort in this rebuttal period and hope our answers are detailed enough for the reviewer to consider a fresher evaluation of our work with a potential score upgrade if it's merited.
We thank all reviewers for their time and thorough reviews. We are glad that the reviewers found that Foldflow++ has high potential impact as a new SOTA with important applications in real-world scenarios (R VRjh, R ffQt), such as being important for drug discovery (R pz4y). We are also grateful that reviewers appreciated the clarity of presentation in that the paper is “well-written” with “well-designed and clear figures” (R VRjh, R ffQt). Finally, the reviewers agree that FoldFlow++ explores several biologically-relevant protein tasks (R PUPL, R VRjh, R ffQt) with high quality results on motif-scaffolding, protein folding, zero-shot equilibrium conformation generation, and has detailed ablation of both new architectural (e.g. ESM2 Language model) and algorithmic (e.g. Reinforced Finetuning) components. We now address the main shared concerns of the reviewers.
Summary of new experiments and ablations
We are grateful for the reviewers' suggestions of additional experiments to enhance our empirical results. Please see our discussion below as well as the 1pg rebuttal PDF of the new results.
Diversity on Motif-Scaffolding (R VRjh, R pz4y)
We computed the same clustering diversity metric for motif scaffolding as our unconditional generations (i.e, # of unique clusters in designable samples / # designable samples). We find that FoldFlow++ has excellent diversity in comparison to RFDiffusion, which in line with expectations, given FoldFlow++'s improved diversity in unconditional results section. Furthermore, in Fig R1 of the 1 pg PDF we visualize the cluster representatives for some of the motif-scaffolding examples and qualitatively observe a high degree of diversity among the samples.
Baselines for Equilibrium conformation sampling (R ffQt)
For our zero-shot equilibrium sampling experiments, we include baselines of Str2Str and EigenFold as suggested. As these methods are purpose-built for this task they represent strong baselines. In Table R1, we find that FoldFlow++ outperforms Str2Str and EigenFold on Pairwise RMSD and Global RMSF while is marginally worse than Eigen Fold on Per Target RMSF. We also observe that Str2Str is the best method when measuring the PCA W_2 distance. We stress that FoldFlow++ achieves these competitive results without any task-specific training -- combining molecular dynamics methods with FoldFlow++ is beyond the scope of this paper but remains an exciting direction of future work.
Technical Novelty and Contributions of FoldFlow++(R PUPL, R ffQt, R pz4y)
We acknowledge the reviewers' concern that FoldFlow++ may appear similar to prior work such as the original FoldFlow and Multi-Flow. However, we believe that these similarities are only high-level and our paper makes a number of novel contributions, in particular across 3 different dimensions: model architecture, training and algorithms, and evaluation tasks.
Architectural novelty (R PUPL)
Firstly, we note that FoldFlow++ is a sequence-conditioned generative model which separates it from structure-only generative models such as FoldFlow or RFDiffusion. The closest comparison is Multi-Flow, however we distinguish these models by noting that FoldFlow++ is the first backbone generation model to use and demonstrate the effectiveness of a large pre-trained protein LLM with ESM2 while MultiFlow trains a discrete diffusion model over sequences. Consequently, FoldFlow++ requires substantially different architecture choices than MultiFlow -- or FoldFlow, RFDiffusion, FrameDiff, or similar models -- to support encoding and jointly representing the structure and sequence modalities.
Training and algorithmic novelty
We outline the key differences below: Our training procedure uses masking for protein sequences during flow matching training, which is a novel approach to structure generation and allows us to perform tasks such as motif scaffolding and folding in addition to unconditional generation. In contrast, the most similar model, Multi-Flow, trains both a structural model and a sequence generative model (from scratch on PDB), and only does sequence masking at inference. As a result, the sequence component of MultiFlow is significantly less expressive than FoldFlow++ as a conditioning signal. Indeed in Table 4 of our paper we observe an almost better protein folding performance of FoldFlow++ vs. MultiFlow which suggests that masked training is a key training strategy.
Additionally, FoldFlow++ is trained on filtered synthetic data using a new and robust filtering strategy which leads to a training set size larger than previous structure generation models such as FoldFlow, MultiFlow, or RFDiffusion (without considering the RoseTTAFold model's pretraining). This filtering methodology represents a new advance into the use of synthetic data for training protein generation models.
Finally, we demonstrate that FoldFlow++ can be finetuned using a new Reinforced Finetuning (ReFT) strategy to align against scalar reward functions, in our case to increase secondary structure diversity, a current challenge for protein backbone generation models.
Task novelty
A core contribution of FoldFlow++ is demonstrating that a single model is capable of performing many protein design tasks including: unconditional generation, protein folding, motif-scaffolding, and equilibrium conformation sampling. We stress FoldFlow++ is the only current model that is validated on all tasks. For instance, FrameFlow cannot be used for folding, Multi-Flow was not evaluated on motif-scaffolding or conformation sampling.
We also introduce a new benchmark for the highly relevant motif-scaffolding task with VHH nanobodies. We believe the saturation of previous benchmarks by RFDiffusion and FoldFlow++ necessitated a more challenging testbed for conditional generation, and demonstrate how in-domain knowledge can improve motif-scaffolding.
FoldFlow++ is a sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation, incorporating a protein language model, a multi-modal fusion trunk, and a geometric transformer-based decoder. The model achieves state-of-the-art performance in various protein-related generation tasks, including folding and inpainting applications.
The reviewers found the paper well written and easy to read, and highlight the strong empirical results in the paper, which are SOTA in terms of designability and diversity. Three of the reviewers expressed concern with the technical novelty of the method, identifying considerable overlap with earlier methods such as MultiFlow. In their rebuttal, the authors explicitly listed the advances compared to these earlier method. While some reviewers maintained reservations about the technical novelty post-rebuttal, the performance improvements reported in the paper are likely to have impact on the community, and all reviewers ultimately recommended acceptance of the paper.