An All-Atom Generative Model for Designing Protein Complexes
This work presents a foundational model for all-atom protein complexes design.
摘要
评审与讨论
This paper proposed APM, a protein full-atom sequence and structure co-generation model. The model includes three parts: Seq & BB module, sidechain module and refine module. The learning is based on flow-matching and is achieved in two stages. The paper did a lot of experiments, both on single protein design and multi-chain protein generation.
给作者的问题
Please see above.
论据与证据
No. The paper claims to design protein complex. However, it seems more like a general protein design model to me:
-
In abstract, the paper mentioned "APM is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch." However for complex design, the model should condition on the target protein to design a high-affinity binder protein, right? Otherwise, you can control the designed interface type.
-
In related work, it is not related to protein complex design at all. All the author mentioned are about protein design itself.
-
In method, from the notation to model learning, the authors showed in a single protein manner. Only mentioned about complex design once at line 235-237.
-
In experiments, the authors conducted experiments on both single-chain and multi-chain folding and inverse-folding tasks. While in single-chain task, the model performed well and achieved the best performance on most metrics. However in multi-chain task, the RMSD on folding task is too bad. It seems the model is more suitable for single-chain all-atom protein generation instead of complex design.
Therefore, even though the author claimed "APM is specifically designed for modeling multichain proteins," I feel this paper is more like a general protein design model and actually from the shown performance, it is more suitable for single-chain protein design.
方法与评估标准
There are some issues in the methods and Evaluation parts:
-
One of the most biggest problem in methods part is after reading methods section, I really don't know what are the model architectures of the three modules: Seq & BB module, sidechain module and refine module. The only thing I know is they are built upon IPA and Transformer encoder. If I don't read appendix, I don't know the architectures. However, this part is the most vital information, which I think shouldn't be put in appendix.
-
Another issue is the designing of sidechains. It seems the model only predicts the torsion angles based on the predicted . it didn't add more information for reconstruct S and T. Why not just using a prediction model at the last step to predict the angles? This is actually this method did in inference. To me, this sidechain prediction here didn't add additional information to the follow-up module. I'm a little doubtful if the refine module is really useful. And there is also no ablation study on different modules, so I can not get a right answer for this point.
-
In section 4.1 data curation line 294, why filter high-quality ones in swiss-prot adn train the model on low-quality ones? Is there any overlap between Swiss-prot samples and AFDB, cuz AFDB predicts the structures of all sequences from Uniprot, including swiss-prot.
-
In section 4.1 cropping, in protein inference, residues are not contiguous, they have contact in 3D space but might be far away from each other in sequence. This way of cropping might cut off the residues in interface, which means the training data might not include complete protein interface in complexes.
理论论述
This paper is an application paper and has no theoretical claims.
实验设计与分析
There are also some problems regard the experiments:
- The two downstream tasks on complex design are both very short Peptide and only CDR-H3 in antibody (if I understand correctly). That means the model might only be suitable for short protein complex design, which is also aligned with the results on single-chain folding and multi-chain folding tasks. Can the author show some longer protein complex design, like binder design in [1, 2].
[1] Improving de novo protein binder design with deep learning. 2023.
[2] De novo design of high-affinity protein binders with AlphaProteo. 2024.
-
The pLDDT in peptide design is too low, while peptide is short. And the author used threshold 70 to calculate success rate, while usually the threshold should be 80. Can the author show the success rate of using 80 as the threshold and also the performance on PAE ?
-
In the multi-chain design task, all the performance is after pyRosseta relaxation. Can the author do an ablation study on the model without relaxation by pyRosetta so that we can know the real performance of the designed complexes.
-
In line 343 on the multi-chain folding task, the author mentioned "there are almost no other models that support multi-chain proteins.". Actually the author can use alphafold3.
补充材料
Yes, methods parts as the author did introduce them in the main context, which I think should be.
与现有文献的关系
Yes
遗漏的重要参考文献
[1] Improving de novo protein binder design with deep learning. 2023.
[2] De novo design of high-affinity protein binders with AlphaProteo. 2024.
其他优缺点
Please refer to the above sections to see the main weaknesses.
Other minor issues:
There are many typos and grammar errors:
-
In abstract, "it can be supervised fine-tuned (SFT)", supervised fine-tuned is awkward.
-
Fig 1 illustration, pick -> pink.
-
Last line of page4, "All the details refer to Appendix B." -> Refer to Appendix B for all the details.
-
page 6 line 325, "folding, inverse-folding, and inverse-folding"
其他意见或建议
Please see above.
Thank you for your thorough review and insightful comments that have helped us enhance the clarity and quality of our manuscript. We have carefully addressed your concerns below.
Q1: Claims And Evidence on Complex Design
A1: Thank you for your questions. We address your concerns as follows:
- A1.1 & A1.2: We indeed condition on the target protein in tasks such as antibody design and peptide design. Additionally, the unconditional multimer generation in a "chain-by-chain" manner (in appendix), is also part of protein complex design.
- A1.3: Thank you for your suggestion. We agree that explicitly defining notation for multi-chain proteins will enhance the clarity. In the revision, we will modify the notation system to include multi-chain notations as follows: For a protein consisting of chains, the -th chain has length , and each modality is represented as the combination of individual chains.
- A1.4: RMSD is sensitive to local structural variations and can be affected by symmetry considerations during alignment, leading to high RMSD despite overall structural similarity and high TMscores. The image (https://anonymous.4open.science/r/Rebuttal-1784/multimer.pdf) illustrates this scenario, indicating that while RMSD is high, the backbone alignment is nearly perfect. We calculated RMSD for each chain in the multimer folding task, as shown in the table. The results indicate that APM achieves accurate predictions for each chain, outperforming Boltz_woMSA.
| Method | TMscore | RMSD | RMSD_by_chains |
|---|---|---|---|
| Boltz | .87/.97 | 5.40/1.95 | 1.73/1.01 |
| Boltz_woMSA | .44/.45 | 17.86/18.43 | 10.14/10.91 |
| APM | .64/.62 | 12.60/13.67 | 5.19/3.12 |
The aim of our work is to model protein complexes. To achieve this, we integrate single-chain data to learn general protein generation and multi-chain data for complex modeling. This approach makes sure that APM is able to handle both single chain and complex tasks.
Q2: Methods And Evaluation Criteria
Thank you for your comments. We address your concerns as follows:
- A2.1: Thank you for this suggestion. We will improve the clarity by reorganizing the figures in the main text.
- A2.2: We would like to clarify the effectiveness of Sidechain Module. During the first stage of pretraining, we trained the Sidechain Module independently, allowing it to learn torsion angle information. In the second stage, we maintained a 50% probability of continuing to train the Sidechain Module, ensuring the network's parametrization captures sidechain information. The effectiveness of Refine Module is demonstrated in Table 4, which serves as an ablation study. The results show that APM with the Refine Module significantly outperforms the version using only the Backbone Module in terms of both complex generation quality and binding affinity.
- A2.3: Thank you for your observation. We actually selected high-quality samples rather than dropping them, and we will correct this in the revision. We have checked and found only a small number of duplicate samples, which have been removed in training.
- A2.4: We used AlphaFold2's crop function, which ensures spatial continuity, preventing situations where residues are far apart in space.
Q3: Experimental Designs Or Analyses
A3.1: Thanks for your suggestion! We use two longest targets from [2]: 3di3 (binder length 193) and 6m0j (binder length 194). For comparison, we use RFDiffusion to design binders by first generating structures and then designing sequences using ProteinMPNN.
The table demonstrates that APM outperforms RFDiffusion in terms of dG, with a higher percentage of dG < 0. Additionally, APM shows better foldability, as indicated by higher ipTM, and successfully designs binders for both targets.
| Method | Target | dG | %dG<0 | pLDDT | ipTM | Success |
|---|---|---|---|---|---|---|
| GroundTruth | 3di3 | -23.79 | - | 95.26 | .85 | 100% |
| 6m0j | -20.11 | - | 81.55 | .15 | 0% | |
| APM_zero-shot | 3di3 | -80.10 | 95.00% | 78.91 | .38 | 12.5% |
| 6m0j | -96.47 | 67.50% | 69.50 | .48 | 12.5% | |
| RFDiffusion | 3di3 | -50.49 | 82.50% | 87.83 | .30 | 0% |
| 6m0j | -56.10 | 67.50% | 70.90 | .45 | 0% |
A3.2: We use 80 as the threshold. APM generates the most number of sequences with pLDDT scores above 80. We regret that we are unable to report on PAE at this time, as Boltz does not provide PAE metrics by default.
| Method | Success |
|---|---|
| PPFlow | 4.92% |
| DiffPP | 7.77% |
| PepGLAD | 5.30% |
| APM_SFT | 19.13% |
| APM_zero-shot | 20.83% |
A3.3: We calculate dG without relax. The results demonstrate that APM maintains better performance compared to others, and successfully generating samples with negative dG.
| Method | dG | %dG<0 |
|---|---|---|
| GroundTruth | 8.82 | 80.65% |
| PPFlow | 3785.05 | 0% |
| DiffPP | 1003.63 | 0% |
| PepGLAD | 663.29 | 0% |
| APM_SFT | 561.91 | 0% |
| APM_zero-shot | 428.08 | 0.05% |
A3.4: Thank you for pointing this out. We will revise our statement accordingly. Due to policy restrictions, we are currently unable to use AF3, which is why we opted for Boltz.
Q4: Other issues:
A7: Thanks for your careful review. We will correct mentioned issues.
Thanks for the authors' responses! I still have the followup concern:
Q3: Experimental Designs Or Analyses
In A3.1, both long proteins show a pLDDT lower than 80, an ipTM score much lower than 0.8, which is the suggested threshold to have high-affinities across complexes by AF3. This actually verified my observations that this method perform badly on long proteins, and can only work for proteins with most of the parts given like antibody.
Thank you so much for your constructive feedback! We truly value your input and we are deeply grateful for the time and effort you have invested in reviewing our work. We have tried our best to address your concerns and questions below and kindly invite you to review our responses.
Clarification on metrics and comparison
Due to space limitations in our previous response A3.1, we only reported average metrics. Here, we present detailed results for all 8 confidence scores of folding, displayed in the format of pLDDT/ipTM, using sequences with the lowest dG. Samples with pLDDT > 80 and ipTM > 0.8 are highlighted in bold in the table. We also include APM_PMPNN, which utilizes ProteinMPNN for sequence redesign based on the generated structures of APM, for alternative comparison with RFdiffusion.
| Target | Method | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|---|
| 3di3 | APM | 75.09/0.26 | 79.54/0.16 | 76.69/0.55 | 84.31/0.70 | 81.35/0.82 | 80.46/0.13 | 77.29/0.19 | 74.66/0.20 |
| APM_PMPNN | 87.48/0.22 | 87.00/0.84 | 85.09/0.26 | 82.16/0.14 | 79.58/0.21 | 80.39/0.19 | 90.79/0.33 | 70.48/0.10 | |
| RFDiffusion | 78.09/0.63 | 91.57/0.25 | 87.76/0.33 | 89.86/0.15 | 90.53/0.28 | 88.92/0.34 | 90.26/0.22 | 83.48/0.16 | |
| 6m0j | APM | 53.35/0.21 | 73.35/0.28 | 81.23/0.88 | 71.90/0.83 | 74.47/0.52 | 76.91/0.22 | 57.17/0.56 | 63.13/0.37 |
| APM_PMPNN | 86.54/0.91 | 66.39/0.80 | 79.93/0.21 | 66.33/0.19 | 55.25/0.33 | 64.94/0.17 | 73.74/0.16 | 42.47/0.40 | |
| RFDiffusion | 50.97/0.71 | 84.67/0.76 | 83.76/0.23 | 72.55/0.46 | 52.79/0.60 | 57.88/0.58 | 83.44/0.11 | 77.43/0.14 |
The results show that both APM and APM_PMPNN can generate sequences with pLDDT>80 and ipTM>0.8, with APM_PMPNN achieving pLDDT>90. Regarding ipTM, another important confidence metric, APM generally outperforms RFDiffusion+ProteinMPNN in most samples.
We agree that long binder design is a crucial and practical task. Here, we would like to discuss the challenges and potential advancements in this important area.
About low pLDDT confidence scores
We observed that both APM and RFdiffusion encounter cases where some samples exhibit lower pLDDT and ipTM scores. The pLDDT score can vary significantly along a protein chain. This means the folding model can be very confident in the structure of some regions of the protein, but less confident in other regions. We hypothesize that the low pLDDT scores stem from the complexity of long binders. Specifically, certain regions may be naturally highly flexible or intrinsically disordered, leading the folding model to assign low pLDDT scores to these residues (as indicated in [1]).
Regarding ipTM, we speculate that the lower scores may result from the larger binding interfaces typical of long binders, which often involve multiple contact points or complex features such as convex or polar epitopes, or hydrophobic regions[2]. These structural complexities and biological properties can contribute to lower ipTM scores.
About future directions
As suggested in [2,3], pLDDT and ipTM are predictive of binding success. We would like to discuss potential approaches to improve long binder design.
APM was originally developed as a general-purpose model for complex modeling rather than a task-specific one, which presents challenges in the context of long binder design. This can be reframed as a question of how to adapt a general model into a domain-specialized one. Recent work [3], provides valuable practical directions. The authors successfully transformed RFdiffusion into an antibody-specific model by fine-tuning it on antibody-antigen complex structures, demonstrating that domain-specific data can significantly enhance performance. Similarly, a feasible approach to enhance APM for long binder design would be to use a curated dataset of long binder-target complexes, potentially sourced from PDB or synthetic data.
Besides, post-training techniques offer another strategy to optimize the model for generating high-confidence designs. As demonstrated in [2] and [3], pLDDT and ipTM correlate with binding success. Building on this insight, we could implement preference optimization focused on these confidence metrics. Applying DPO-like algorithms, we can then train the model to favor high-confidence designs while avoiding low-confidence ones.
We will investigate all of these exciting potentials as our future work.
--
[1] AlphaFold2 models indicate that protein sequence determines both structure and dynamics
[2] AlphaProteo: De novo design of high-affinity protein binders with AlphaProteo
[3] Atomically accurate de novo design of antibodies with RFdiffusion
we introduce APM (All-Atom Protein Generative Model), a model specifically designed for modeling multi-chain proteins. By integrating atom-level information and leveraging data on multi-chain proteins, APM is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch. It also performs folding and inversefolding tasks for multi-chain proteins. Moreover, APM demonstrates versatility in downstream applications: it can be supervised fine-tuned (SFT) for enhanced performance while also supporting zero-shot sampling in certain tasks, achieving state-of-the-art results.
给作者的问题
- How effective is the consistency loss for the training of refinement module?
- The scaling of structure generative model you mentioned in future work is an open-question, how do you plan to do? I don’t think a stack of structure module in AF2 could achieve scaling because of the nature of 3D representations of protein, i.e, the similar redundency with pixels, where scaling has not succeeded yet. Open discussions are welcome!
- How to achieve cond-generation in Figure 3? Which control method do you employ?
论据与证据
Claims:
- APM natively supports the modeling of multi-chain proteins without the need to use pseudo sequence to connect different chains.
- APM generates proteins with all-atom structures efficiently by utilizing an innovative integrated model structure.
Evidence:
- Experiments related to general protein demonstrate that APM is capable of generating tightly binding protein complexes, as well as performing multi-chain protein folding and inverse folding tasks.
- Experiments in specific functional protein design tasks show that APM outperforms the SOTA baselines in antibody and peptide design with higher binding affinity.
方法与评估标准
For Multi-Chain Protein Modeling, APM uses a mixture of single and multi-chain data in the training of APM.
For all-atom representation, APM chooseS to enhance residue-level information with the sidechain for all-atom protein representation that includes amino acid type, backbone structure, and the sidechain conformation parameterized by four torsion angles.
For sequence-structure dependency, first, APM decoupled the noising process for sequences and structures so that the noising level for each modality does not completely align, minimizing disruption of their dependency. Second, there is a 50% probability of performing a folding/inverse-folding task, compelling the model to learn the dependencies from both directions.
APM has demonstrated its capability in modeling multi-chain proteins and generating bioactive complexes. It achieved state-of-the-art (SOTA) performance in antibody design and binding peptide design. It also show promise for conventional single-chain protein-related tasks.
理论论述
No significant theoretical clains
实验设计与分析
APM is tested on multiple tasks including folding, inverse folding, uncond. generation and functional protein design, both in complex generation or single chain.
补充材料
The authors provide more experiments, visualizations and method details in the appendix.
与现有文献的关系
This is an application of flow matching methods on multi-chain protein generation. The modeling scheme and architeture comes from the famous AlphaFold2.
遗漏的重要参考文献
It is appreciated if more papers on sidechain flexibility in complex (sidechain prediction or generation) are included. More dicussions on motif-scaffolding, an important functional protein design task are appreciated. Protein structure refinement should also be mentioned.
其他优缺点
- It is better to provide a comprehensive ablation study for this complex system
- The systems are a combination of Alphafold2 and ESM2 and very complicated. More novel and efficient models are appreciated.
- The generative framework is not novel. A specialized and efficient generative framework for this complicated task is more appreciated.
其他意见或建议
- This paper aims to design a fundation model for multi-chain protein complex, according to its first part of Introduction (and the number of tasks it supported), but then it focuses on protein design task. Will APM generalizes well to other protein tasks in addition to protein design tasks?
- This paper shows great ambition in providing a fundation model for various protein design tasks, but currently it underperformed in several tasks.
- Does APM supports de novo Antibody design?
We appreciate your helpful feedback. We have responded to your concerns below and look forward to any additional comments.
Q1: Essential References Not Discussed
A1: Thank you for highlighting the need to include additional related works. We will enhance our Related Work section by incorporating references on sidechain prediction, such as DiffPack and AttnPacker. For motif-scaffolding, we will discuss both structure-based and sequence-based methods, including RFDiffusion, Frameflow, EvoDiff, DPLM, and ESM3. Regarding protein structure refinement, we hypothesize this refers to methods like Rosetta relax and OpenMM minimization, as well as other works like DeepAccNet (NC2021, https://www.nature.com/articles/s41467-021-21511-x). We welcome your feedback to ensure comprehensive coverage of these important areas in our Related Work section.
Q2: Other Weaknesses
A2.1: Thank you for your question. The results presented in Table 4 is indeed an ablation study. We provide two types of dG scores: one with sidechain-only relax and another with both backbone and sidechain relax. The lower dG scores from sidechain-only relax indicate high-quality backbone generation without structural conflicts. The small RMSD scores between generated and relaxed backbone demonstrate that the initial structures are already near optimal conformations, requiring minimal adjustment to reach energy minima. Notably, APM with the Refine Module significantly outperforms APM using only the Backbone Module on these metrics, validating the effectiveness of the Refine Module in complex design.
A2.2 & A2.3: Thank you for your valuable feedback. We acknowledge that our current system combines widely accepted architectures to model protein sequences and structures. Exploring more efficient frameworks is indeed an important direction for the field, and we will discuss potential approaches to this in our response A4.2.
Q3: Other Comments Or Suggestions
A3.1: This is a great question! We believe that APM has the potential to generalize to other protein-related tasks in the future, such as protein docking or directly utilizing the pre-trained encoder for affinity prediction tasks. However, this remains an open question, and we leave this for future exploration.
A3.2: Thank you for pointing this out! We acknowledge that APM currently does not achieve SOTA performance across all metrics in the task of unconditional single-chain generation. To address this, we have adjusted the temperature strategy for sequence sampling, resulting in significant improvements over previous methods for this task. Please refer to the table below for updated results.
| Method | Length 100 | Length 200 | Length 300 | |||
|---|---|---|---|---|---|---|
| scTM | scRMSD | scTM | scRMSD | scTM | scRMSD | |
| NativePDBs | 0.91 | 2.98 | 0.88 | 3.24 | 0.92 | 3.94 |
| ESM3 | 0.72 | 13.80 | 0.63 | 21.18 | 0.59 | 25.5 |
| Multiflow (woSynthetic) | 0.86 | 4.73 | 0.86 | 4.98 | 0.86 | 6.01 |
| ProteinGenerator | 0.91 | 3.75 | 0.88 | 6.24 | 0.81 | 9.26 |
| ProtPardelle | 0.56 | 12.90 | 0.64 | 13.67 | 0.69 | 14.91 |
| APM (original) | 0.92 | 3.65 | 0.88 | 5.06 | 0.87 | 7.33 |
| APM (updated) | 0.96 | 1.63 | 0.90 | 3.43 | 0.90 | 4.90 |
A3.3: Thank you for this question. APM's current implementation and experiments focus on CDR-H3 design in SFT and zero-shot manner, leveraging the conserved nature of antibody framework regions.
Q4: Questions For Authors
A4.1: Thank you for this question. To assess the effectiveness of the consistency loss, we conducted an experiment by removing the consistency loss and retraining the model to the same number of steps. When evaluated on unconditional single-chain generation tasks, the version with consistency loss showed improvements in both scTM and scRMSD compared to the ablated version, as detailed in the table below.
| Method | Length 100 | Length 200 | Length 300 | |||
|---|---|---|---|---|---|---|
| scTM | scRMSD | scTM | scRMSD | scTM | scRMSD | |
| APM (updated) | 0.96 | 1.63 | 0.90 | 3.43 | 0.90 | 4.90 |
| APM (woConsistency) | 0.93 | 2.66 | 0.89 | 3.72 | 0.88 | 5.18 |
A4.2: This is a great question, and we fully agree that it represents a crucial challenge in scaling of protein models. A recent work Proteina (https://openreview.net/forum?id=TVQLu34bdw) may serve as an important exploration of model and data scaling. Proteina employs a scalable, non-equivariant AF3-like diffusion transformer, focusing on alpha carbon coordinates without frames, which allows for scalability to many parameters and long protein generation. These considerations are crucial for advancing the field and addressing the challenges of scaling protein generative models. In the future, we plan to explore Proteina-like scalable framework to further enhance the scalability and efficiency of our protein generative models.
A4.3: We achieve conditional generation by employing masking to ensure that the loss is not computed for the condition region. Additionally, we set the time step t=1 to ensure that the ground truth condition is provided.
The paper introduces a new all atom protein backbone generation which is composed of a backbone structure model (that is equivalent to discret flow models from Campbell and al.), a side chain module and a refinement module. They are trained in two stages, first the backbone and side-chain modules separatly and then the three modules jointly. THe model is then evaluated on peptide, antibody-antigene complex, folding and inverse folding. The authors show competitive performance and the possibility to extend existing protein backbone models to inverse folding task.
给作者的问题
Did you try a binder design task on the RFdiffusion benchmark? I would be very curious to know how it performs.
论据与证据
The paper claims to define an all atom protein complex generation and this is true. They claim to be competitive on different downstream tasks, which is true.
方法与评估标准
The method makes sense and the evaluation follows the standard practice in the literature.
理论论述
Not applicable
实验设计与分析
The analysis is complete and makes sense.
补充材料
NA
与现有文献的关系
SE(3) flow matching for protein backbone was also introduced in [1] and is concurrent to FramFlow. I think it should be added to highlight the fact they are concurrent.
[1] SE(3)-Stochastic Flow Matching for Protein Backbone Generation, Bose et al, ICLR 2024
遗漏的重要参考文献
The integration of protein language model in protein backbone generation module was already done in [2] and therefore it should be discussed and evaluated against in folding tasks.
[2] Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation, Huguet et al. Neurips 2024.
其他优缺点
The paper is well written.
其他意见或建议
NA
Thank you for your valuable comments. We have addressed your concerns below and welcome any further feedback.
Q1: Relation To Broader Scientific Literature and Essential References Not Discussed.
A1: We appreciate the reviewer for pointing out these references. In our revision:
- We will include FoldFlow1 (Bose et al.) in the Related Work section to clarify the concurrent development of SE(3) flow matching approaches to FrameFlow.
- Although FoldFlow2 (Huguet et al.) showcases the integration of protein language models, the folding code for FoldFlow2 is currently unavailable. We will incorporate a discussion in the Related Work section: APM differs from FoldFlow2 in several aspects: our model is capable of generating all-atom structures and sequences, supports multi-chain generation, and is designed for complex design tasks.
Q2: Questions For Authors: Did you try a binder design task on the RFdiffusion benchmark? I would be very curious to know how it performs.
A2: We thank the reviewer for this valuable suggestion. As the Reviewer ZXig also mentioned the design of longer binder. Here, we design binders to the two targets showcased in RFDiffusion: SARS-CoV spike protein RBD (PDB id 6m0j) and IL-7RA (PDB id 3di3). For comparison, we first generate structures using RFDiffusion, then design sequences with ProteinMPNN. As APM currently does not support hot-spot residue, we only evaluate functionality and foldability.
The table demonstrates that APM in zero-shot mode significantly outperforms RFDiffusion in terms of dG, with a higher percentage of designs achieving dG < 0. Additionally, APM shows better foldability, as indicated by higher ipTM scores, and successfully designs binders for both targets. This indicates that APM is able to design high-quality binders.
| Method | Target | dG | %dG<0 | pLDDT | ipTM | Success |
|---|---|---|---|---|---|---|
| GroundTruth | 3di3 | -23.79 | - | 95.26 | .85 | 100% |
| 6m0j | -20.11 | - | 81.55 | .15 | 0% | |
| APM_zero-shot | 3di3 | -80.10 | 95.00% | 78.91 | .38 | 12.5% |
| 6m0j | -96.47 | 67.50% | 69.50 | .48 | 12.5% | |
| RFDiffusion | 3di3 | -50.49 | 82.50% | 87.83 | .30 | 0% |
| 6m0j | -56.10 | 67.50% | 70.90 | .45 | 0% |
For the binder design task, could you try the same target as in RFDiffusion? I am thinking about MDM2, PD1, PDL1, CD3E.
Thank you for your valuable suggestion. We selected targets from PDB IDs: 1ycr (MDM2), 4zqk (human programmed death-1/PD1), 4z18 (ligand PD-L1), and 1xiw (CD3-epsilon/CD3E), using the binders that interact with these targets in the PDB as references (if our setting doesn't address your question appropriately, we sincerely apologize and welcome your clarification).
For APM, we performed zero-shot sampling. For RFDiffusion, we followed the guideline of binder design from the official repository, and design sequences using ProteinMPNN. We also include APM_PMPNN, which utilizes ProteinMPNN for sequence redesign based on the generated structures of APM, for alternative comparison with RFdiffusion. The evaluation settings remain consistent with the description in appendix D.4, with one exception: due to resource constraints during the rebuttal period, we only folded the top 8 sequences rather than the top 16 as described in the appendix. We reported average metrics, with the 'success' representing the proportion of samples (out of 8) that achieve both pLDDT > 80 and ipTM > 0.8.
The experimental results are shown in the table below. Overall, our method is comparable to RFDiffusion. Although these metrics have been proven by many studies to be predictive of wet lab experimental results, the actual effectiveness still requires validation through wet lab experiments.
Additionally, we provide a detailed visualization at https://anonymous.4open.science/r/Rebuttal-1784/binder.pdf. Based on this visualization, for 1ycr, both APM and RFDiffusion generate high-quality binders. For 4zqk and 4z18, while the overall structures show good pLDDT confidence in regions distant from the binding interface, the lower pLDDT at the interface indicate potentially inaccurate binding site or binding pose. For 1xiw, the folding models appear to struggle with accurately predicting beta-sheet interactions, despite both methods generating designs with good overall confidence.
| PDB ID | Method | dG | %dG < 0 | pLDDT | ipTM | Success |
|---|---|---|---|---|---|---|
| 1ycr | GroundTruth | -25.24 | - | 90.42 | 0.93 | - |
| APM | -37.94 | 90.00% | 66.28 | 0.67 | 25.0% | |
| APM_PMPNN | -33.27 | 90.00% | 71.10 | 0.70 | 50% | |
| RFDiffusion | -39.47 | 100% | 78.49 | 0.81 | 25.0% | |
| 4zqk | GroundTruth | -39.36 | - | 94.03 | 0.87 | - |
| APM | -45.27 | 90.00% | 80.18 | 0.39 | 0% | |
| APM_PMPNN | -43.33 | 77.50% | 79.10 | 0.36 | 0% | |
| RFDiffusion | -29.35 | 87.50% | 75.79 | 0.39 | 0% | |
| 4z18 | GroundTruth | -40.89 | - | 92.08 | 0.76 | - |
| APM | -54.24 | 55.00% | 69.28 | 0.35 | 0% | |
| APM_PMPNN | -63.37 | 55.00% | 74.28 | 0.34 | 0% | |
| RFDiffusion | -18.69 | 57.50% | 67.39 | 0.30 | 0% | |
| 1xiw | GroundTruth | -71.69 | - | 92.64 | 0.95 | - |
| APM | -43.25 | 85.00% | 73.27 | 0.62 | 12.5% | |
| APM_PMPNN | -46.96 | 82.50% | 72.08 | 0.70 | 12.5% | |
| RFDiffusion | -56.99 | 95.00% | 77.22 | 0.76 | 62.5% |
We sincerely appreciate your suggestions, which have directed our attention to broader and important areas. In the future, we will focus on enhancing our model's capabilities, particularly for practically significant tasks like binder design.
This paper tackles the problem of designing multi-chain protein complexes at the atomic level. The authors propose APM (All-Atom Protein Generative Model), consisting of three modules:
- Seq&BB Module: A flow-matching based generative model that handles the co-generation of protein sequence and backbone structure.
- Sidechain Module: Predicts sidechain conformations (parameterized by torsion angles) to complete the all-atom structure.
- Refine Module: Adjusts the sequence and structure using all-atom information to increase naturalness and resolve structural clashes.
The model incorporates ESM2-650M (a protein language model) to enhance protein sequence understanding and uses a two-phase training approach with the first phase focused on Seq&BB Module, Sidechain Module, and the second phase focused on joint training of all three modules. The model is trained on a mixture of single and multi-chain data from PDB, Swiss-Prot, AlphaFoldDB, and PDB biological assemblies.
The authors perform extensive insilico evaluation of APM and demonstrate competitive performance compared to baseline models. For single-chain protein folding and inverse-folding, APM either outperforms or matches the performance of ESMFold, ESM3, MultiFlow, and ProteinMPNN. For multi-chain protein folding and inverse-folding, APM outperforms Boltz-1 (without MSA) and ProteinMPNN. Furthermore, the authors demonstrate APM's ability to generate tightly bound protein complexes and perform ablations to showcase the importance of sidechain conformation information. On downstream antibody design targeting specific antigens and recptor-targeted peptide design, both supervised fine-tuning and zero-shot sampling variants of APM outperform application-specific models.
给作者的问题
N/A
论据与证据
- Claim: APM achieved leading performance compared to other co-design methods in all three tasks related to single-chain proteins. This claim is misleading for protein folding, as ESM3 has a slight edge for structure prediction as evidenced by Table 1. The differences are very small and it would be helpful to have interval estimates to assess significance.
方法与评估标准
Strengths:
- Use of all-atom representation for protein structure is well motivated and addresses a critical problem in multi-chain protein modeling.
- Flow-matching based generative modeling is well suited for the task at hand.
- Chosen datasets for training and evaluation are appropriate, diverse, and standard making it easier to compare with existing works.
- Relevant metrics like RMSD and TM-score for structure prediction evaluation as well as scTM, AAR for inverse-folding evaluation are used. Similarly, for application-specific tasks, using specialized metrics like DockQ, binding affinity, etc. aligns well with real world use cases.
Weaknesses:
- Sequence recovery aggregates binary decisions for correct sequence identity without taking into account the precise confidence of the model for the correct residue. Perplexity addresses this shortcoming, however, it is not included in the evaluation.
理论论述
Did not check theoretical claims.
实验设计与分析
Strengths:
- Careful curation of training dataset to prevent potential information leakage for antibody and peptide design tasks.
- Ablation study is performed to assess the impact of sidechain conformation information on (predicted) binding affinity.
- Both average and median performance metrics are reported.
- Downstream antibody and peptide design tasks which are relevant to real-world applications are evaluated.
- Performant models from literature are used for comparison.
Weaknesses:
- Inverse-folding and structure prediction evaluation for multi-chain complexes uses a test set of size 273, which makes it hard to assess generalizability of results.
- Structure prediction and inverse-folding evaluation for multi-chain complexes is only compared with Boltz-1 and ProteinMPNN, with the authors arguing that there are almost no other models that support multi-chain proteins. However, AlphaFold3 (https://www.nature.com/articles/s41586-024-07487-w) and Chroma (https://www.nature.com/articles/s41586-023-06728-8) could be appropriate models to compare with for structure prediction and inverse-folding, respectively.
- Ablation study is limited to a single evaluation with only one module ablated, making it difficult to assess the contribution of each module across different tasks.
- Lack of experimental validation of generated complexes. For instance, it's unclear how well pyRosetta's energy function would align with measurements obtained from binding assays in the wet lab.
- Only average/median performance metrics are reported as opposed to interval or standard deviation, which makes it hard to evaluate significance.
- While antibody and peptide design applications are presented, other important complex types (e.g., enzyme-substrate complexes) aren't evaluated.
补充材料
Did not review supplementary material.
与现有文献的关系
Most of the prior work has focused on factorizing protein representations into amino acid sequence and backbone structure. Furthermore, training of protein foundation models has typically been done on single-chain protein data. The authors' work addresses this gap by incorporating information about sidechain conformation into the generative model as well as directly training on multi-chain protein data. The authors' work is also novel in that it uses flow matching based generative models whereas past works have used diffusion based generative models.
遗漏的重要参考文献
The following references are missing:
- Chroma (https://www.nature.com/articles/s41586-023-06728-8): Generative model for protein complexes that takes the sidechain conformation information into account in addition to backbone coordinates and sequence identity.
其他优缺点
The paper is well written and comprehensive in describing the key details for reproducibility of both the model and the evaluation. Use of all-atom representation for protein structure is well motivated and addresses a critical problem in multi-chain protein modeling.
其他意见或建议
N/A
We appreciate your thorough review and helpful comments. We have carefully addressed your concerns below and welcome any additional feedback.
Q1: Claims And Evidence
A1: Thank you for highlighting the need for statistical validation. We conducted folding with APM and ESM3 using 20 seeds. For RMSD, ESM3 shows a marginally better mean (4.708±0.094 vs 4.828±0.077) with statistical significance (p < 0.05). For TM-score, APM achieves better performance (0.856±0.002 vs 0.828±0.002) with statistical significance (p < 0.05). We appreciate your reminder and will revise our statement to reflect that APM demonstrates "competitive" or "comparable" performance to ESM3.
| Method | RMSD (mean±std) | TMscore (mean±std) |
|---|---|---|
| ESM3 (1.4B) | 4.708±0.094 | 0.828±0.002 |
| APM | 4.828±0.077 | 0.856±0.002 |
Q2: Weaknesses on Perplexity
A2: Thank you for your valuable suggestion. To evaluate sequence quality, we compute perplexity using ProGen2-base across various inverse folding methods, including average, median, and std of perplexity. The results indicate that the sequences generated by APM achieve comparable perplexity to ground truth sequences.
| Method | ppl_avg | ppl_median | ppl_std |
|---|---|---|---|
| GroundTruth | 8.83 | 7.13 | 5.87 |
| APM | 8.74 | 8.10 | 4.01 |
| Multiflow | 10.86 | 10.94 | 2.66 |
| ESM3 | 8.64 | 7.90 | 4.23 |
| ProteinMPNN | 11.44 | 11.48 | 3.25 |
Q3: Experimental Designs Or Analyses and Weaknesses
A3.1: Thanks for your suggestion. We agree that a larger test set could enhance evaluation. Here, we focus on maximizing training data for design tasks, using only "samples missing cluster IDs" as the test set. This allowed us to retain more training data while ensuring reliable performance evaluation.
A3.2: Thanks for this suggestion. Due to policy restrictions, we are unable to access AF3's weights, which is why we opted for Boltz. Regarding Chroma, we added the results on unconditional multimer generation. The results demonstrate that APM can generate reasonable multi-chain structures with high binding affinity, significantly outperforming Chroma.
| Length | Model | dG_sc | dG_relax_bb+sc | RMSD |
|---|---|---|---|---|
| 50-100 | APM_all-atom | -72.44/-71.91 | -112.65/-116.98 | 1.05/0.95 |
| APM_bb | -64.30/-67.30 | -114.94/-114.45 | 1.06/1.03 | |
| Chroma | 113.64/46.51 | -83.96/-86.66 | 1.33/1.22 | |
| 100-100 | APM_all-atom | -91.61/-94.54 | -130.31/-134.57 | 1.04/0.94 |
| APM_bb | -36.74/-69.30 | -117.53/-118.13 | 1.17/1.12 | |
| Chroma | 89.47/22.97 | -60.53/-52.64 | 1.45/1.35 | |
| 100-200 | APM_all-atom | -44.02/-39.42 | -93.21/-73.09 | 1.35/1.21 |
| APM_bb | -3.42/-33.71 | -85.79/-69.12 | 1.58/1.42 | |
| Chroma | 79.97/35.86 | -59.32/-54.30 | 1.58/1.48 |
A3.3: Thank you for your question. The results presented in Table 4 is indeed an ablation study. We provide two types of dG scores: one with sidechain-only relax and another with both backbone and sidechain relax. The lower dG scores from sidechain-only relax indicate high-quality backbone generation without structural conflicts. The lower RMSD scores between generated and relaxed backbone demonstrate that the initial structures are already near optimal conformations, requiring minimal adjustment to reach energy minima. Notably, APM with the Refine Module significantly outperforms APM using only the Backbone Module on these metrics, validating the effectiveness of the Refine Module in complex design.
A3.4: Thank you for this suggestion. While we acknowledge the importance of wet lab experimental validation, we used pyRosetta's energy function as it is a widely accepted metric for evaluating binding affinity.
A3.5: For the folding task, we have already reported the std in A1. For antibody design and peptide design tasks, the metrics are based on multiple sampling. By adjusting the temperature strategy for sequence decoding in the unconditional single-chain task, we achieve improved results and additionally provide the std for RMSD and TMscore to demonstrate the stability of results(see table below).
| Method | Length 100 | Length 200 | Length 300 | |||
|---|---|---|---|---|---|---|
| scTM | scRMSD | scTM | scRMSD | scTM | scRMSD | |
| APM (original) | 0.92±0.11 | 3.65±5.37 | 0.88±0.12 | 5.06±6.57 | 0.87±0.14 | 7.33±6.76 |
| APM (updated) | 0.96±0.06 | 1.63±2.12 | 0.90±0.11 | 3.43±3.14 | 0.90±0.11 | 4.90±4.49 |
A3.6: Thanks for your suggestion. APM currently supports 20 standard amino acids. This impacts our ability to evaluate certain enzyme-substrate complexes, as substrates can be diverse molecules beyond proteins. While enzymes are typically proteins, substrates can vary widely, including carbohydrates and lipids.
Within the current scope of the submission, we focus on protein-protein interaction tasks. We are working on extending our framework to support a broader range of biomolecules.
Q4: Essential References Not Discussed
A4: Thank you for your kind reminder. We have mentioned AF3 in the Related Work (line 77). We will revise manuscript to make it clearer. Regarding Chroma, we will include it in the revision.
The paper proposes a novel approach to model multi-chain proteins for designing protein complexes and performing folding and inverse-folding tasks. The methodology achieves strong empirical results on pertinent datasets.
The AC and reviewers all appreciate the contributions of this paper. The authors have done a great job at addressing the reviewers comments and the materials provided during rebuttal significantly strengthen the results. We strongly encourage the authors to incorporate them in the revised manuscript. In particular it would be important to include statistical significance results, ProGen perplexity results, the binder task results, and the updated results on unconditional single-chain generation.