AlignAb: Pareto-Optimal Energy Alignment for Designing Nature-Like Antibodies
摘要
评审与讨论
AlignAb targets at the conflicting energy preferences, and guides the diffusion model towards Pareto optimality under multiple energy-based alignment objectives. Experimental results show that it achieves high stability and efficiency in producing a better Pareto front of antibody designs compared to baselines.
优点
(1) The motivation of multi-stage training is convincing, as the 3D complex data are limited compared to sequences. But more related literature should be mentioned in the main text.
(2) The energy-based alignment of diffusion models is novel and interesting. Via an iterative alignment, they can benefit from online exploration. Besides, a decaying temperature scaling is proposed to balance the exploration and exploitation during the sampling process.
(3) The empirical results seem promising, which verifies the effectiveness of AlignAb.
缺点
(1) All baselines are not energy-based, namely, they do not count on any energy function such as Rosetta to adjust the output. However, even with Pareto-optimal energy alignment, AlignAb did not achieve the best energy scores across all metrics. Particularly, AlignAb is based on DiffAb, but its CDR-Ag E_att is worse than DiffAb. This phenomenon may weaken the soundness of this proposed algorithm.
(2) No discussion regarding the efficiency of the training. It may be an issue regarding the iteration if the convergence speed is slow and the computation of energy is not fast.
(3) Code is currently not available.
问题
(1) Have you conducted ablation studies over the three-stage training? I am curious about the benefits of different training phases. Particularly, all baselines are not using the pretrained protein language models (PLM), which may trigger the issue of an unfair comparison.
(2) The multi-stage training is not new. I believe it is necessary to add relevant references like this [i] Besides, prior studies have also explored the combination of PLMs and geometric neural networks [ii] [iii]
[i] A Hierarchical Training Paradigm for Antibody Structure-sequence Co-design. NeurIPS 2023.
[ii] Enhancing Protein Language Models with Structure-based Encoder and Pre-training. ICLR 2023.
[iii] Integration of pre-trained protein language models into geometric deep learning networks. Communications Biology.
W1. All baselines are not energy-based, namely, they do not count on any energy function such as Rosetta to adjust the output. However, even with Pareto-optimal energy alignment, AlignAb did not achieve the best energy scores across all metrics. Particularly, AlignAb is based on DiffAb, but its CDR-Ag E_att is worse than DiffAb. This phenomenon may weaken the soundness of this proposed algorithm.
Response: Thank you for highlighting this point. While AlignAb only slightly underperforms DiffAb on average CDR-Ag E_att, it achieves better results on top-ranked samples under all evaluation metrics, which is the primary focus of this study. Our goal is to design high-quality antibody samples that are closer to natural antibodies in terms of energy-based metrics, therefore top-ranked samples are more relevant for practical applications. This demonstrates that AlignAb successfully enhances sample quality, aligning with our claim of designing better antibodies compared to baseline methods.
W2. No discussion regarding the efficiency of the training. It may be an issue regarding the iteration if the convergence speed is slow and the computation of energy is not fast.
Response: AlignAb demonstrates competitive efficiency, typically achieving optimal results in fewer than three iterations with fewer samples. While energy computation can be resource-intensive, we ensure consistency across methods and are exploring ways to further optimize computational efficiency in future work.
W3. Code is currently not available.
Response: We plan to release the code and detailed documentation upon publication to ensure reproducibility and facilitate understanding of our method.
Q1. Have you conducted ablation studies over the three-stage training? I am curious about the benefits of different training phases. Particularly, all baselines are not using the pretrained protein language models (PLM), which may trigger the issue of an unfair comparison.
Response: We have conducted an ablation study to evaluate the impact of incorporating pre-trained models to DiffAb by comparing its performance without alignment on AAR and RMSD metrics. The results are as follows:
| Metrics | HERN | MEAN | dyMEAN | ABGNN | DiffAb | AlignAb (w/o alignment) | AlignAb |
|---|---|---|---|---|---|---|---|
| AAR | 33.17 | 33.47 | 40.95 | 38.36 | 36.42 | 37.65 | 35.34 |
| RMSD | 9.86 | 1.82 | 7.24 | 2.02 | 2.48 | 2.25 | 1.51 |
While we were unable to directly assess the contribution of the pre-trained model to energy metrics due to computational constraints, the observed improvements suggest that the integration of pre-trained models enhances the quality of generated antibodies. Future work will further explore the role of pretraining in energy-based optimization to provide a more comprehensive analysis.
Q2. The multi-stage training is not new. I believe it is necessary to add relevant references like this [i] Besides, prior studies have also explored the combination of PLMs and geometric neural networks [ii] [iii]
Response: Thank you for the suggestion. While prior studies have explored multi-stage training and combining PLMs with geometric neural networks, we are the first to integrate pre-training, fine-tuning, and alignment into a unified framework for the specific task of antibody design. We will add relevant references to clarify how our approach builds upon and extends existing work in our final version.
Thanks for your response. However, there remain several issues.
AlignAb demonstrates competitive efficiency, typically achieving optimal results in fewer than three iterations with fewer samples. While energy computation can be resource-intensive, we ensure consistency across methods and are exploring ways to further optimize computational efficiency in future work.
It would be better to quantitatively analyze the computational efficiency instead of giving a very rough explanation. What is the exact time cost of iterations with fewer samples? What is the time expense of energy computation? Otherwise, readers like me can be pretty confused about "optimal results in fewer than three iterations with fewer samples". How many is called "fewer samples"? Can we achieve optimal results in "one iteration" or not?
While we were unable to directly assess the contribution of the pre-trained model to energy metrics due to computational constraints, the observed improvements suggest that the integration of pre-trained models enhances the quality of generated antibodies.
It is good to see the improvement brought by alignment. However, I have suggested authors take an ablation study over the pretrained language model instead of alignment, as their methods rely on pretrained BERT, which is trained on the online antibody sequences.
Overall, I am not satisfied with these answers to my question. It has already been nearly a week since the rebuttal began, but they have put little effort into addressing my concerns. It is inappropriate to always say exploring ways to further optimize computational efficiency in future work or Future work will further explore the role of pretraining in energy-based optimization to provide a more comprehensive analysis. Some studies can be delayed to future works, but some of my concerns are strongly related to the soundness and robustness of AlignAb. Based on this reason, I will maintain my score.
Response:
Thank you for raising these important questions.
It would be better to quantitatively analyze the computational efficiency instead of giving a very rough explanation. What is the exact time cost of iterations with fewer samples? What is the time expense of energy computation? Otherwise, readers like me can be pretty confused about "optimal results in fewer than three iterations with fewer samples". How many is called "fewer samples"? Can we achieve optimal results in "one iteration" or not?
The time cost per training iteration is approximately 40 minutes for 4,000 steps. The time cost for energy computation, including relaxation, takes around 25 minutes for 1,280 samples using parallel processing on a 32-core CPU. The observed speed for training, sampling, and evaluation is consistent with the original implementation of DiffAb.
In our ablation studies, we observe that optimization consistently converges within three iterations, with no significant improvements beyond this point, achieving optimal results using fewer than 3,840 preference samples. These studies highlight the benefits of AlignAb’s iterative optimization, as it consistently outperforms its offline version in all 5 ablation case studies. Additionally, we consider AbDPO as a naive version of our offline alignment method, as it can be approximated by removing the reward margin component from AlignAb. While AbDPO requires 20,000 alignment steps and 10,112 samples, AlignAb achieves optimal performance with only 12,000 alignment steps and 3,840 samples. This demonstrates that AlignAb improves computational efficiency with fewer training steps and fewer training samples without compromising performance.
It is good to see the improvement brought by alignment. However, I have suggested authors take an ablation study over the pretrained language model instead of alignment, as their methods rely on pretrained BERT, which is trained on the online antibody sequences.
Regarding the effects of the pre-trained protein language model, we apologize for any earlier confusion caused by our naming and for not fully addressing your concerns earlier. In our experiments, AlignAb (w/o alignment) represents the model using only the pre-training and fine-tuning stages, which is obtained by retraining DiffAb with the pretrained BERT model from ABGNN. Incorporating the pretrained PLM improves the generative diffusion model’s performance, as evidenced by its improved AAR and RMSD metrics compared to the original DiffAb. This highlights the value of pretrained PLMs in enhancing baseline models. Importantly, just like any other alignment methods, quality of the alignment samples is critical for alignment as has been suggested by recent works such as DEITA [1]. Therefore, we believe that our alignment method would only benefit from more performant models and more realistic and higher-quality samples for optimization. We hope this clarification addresses your concerns about the role of pretrained PLMs in AlignAb’s robustness.
Again, we apologize for the delay in addressing your concerns and for any dissatisfaction caused. We will include the above-mentioned points regarding computational efficiency and the effects of the pre-trained PLM in the appendix to provide greater clarity and transparency. Thank you for your feedback, and we appreciate your understanding.
[1] Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He. What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning, 2024.
This paper presents a three-stage framework for training deep-learning models specializing in antibody sequence-structure co-design. We first pre-train a language model using millions of antibody sequence data. Experiments demonstrate the advantages of the proposed framework in generating nature-like antibodies with high binding affinity.
优点
- The study employs advanced methods such as DPO/Pareto-DPO to perform alignment for antibody optimization.
- The work adopts a three-step approach of pre-training, fine-tuning, and alignment to develop an antibody optimization strategy, achieving state-of-the-art optimization results.
缺点
W1. Many recent works on antibody design and optimization are missing in the comparison, including RefineGNN [1], ADesigner [2], GeoAb [3], HTP [4], AbDiuser [5], etc. The lack of discussion and comparison with related work limits the validity of the article to be assessed. Although the article emphasizes antibody optimization through alignment methods based on generative models, it is challenging to integrate these alignment methods with refinement approaches like ABDesigner. However, could these alignment methods be utilized to perform antibody optimization with generative models such as AbDiffuser and GeoAB? Additionally, all new antibody design methods introduced this year should be included for comparison. If the code is not open-sourced, should these methods still be cited? The lack of discussion and comparison with so much related work is a major weakness.
W.2 The article employs Rosetta Energy as an alignment metric, which raises my considerable concerns. My primary issue is that the premise of the paper relies on the assumption that minimizing the Rosetta energy of the complex leads to better antibodies. However, it is well-known that forcefield energies have a weak correlation with measured binding affinity, typically around 0.3 [6,7]. This correlation can be improved to approximately 0.7 by incorporating various additional techniques, such as machine learning classifiers, feature ensembling, and molecular dynamics ensembles. However, these improvements are generally limited to single mutations of a known co-crystal structure.
In contrast, this study aims to use basic energy formats (with an expected correlation of ~0.3 for single mutations) to guide alignment and optimization across the CDR region. With each additional mutation, one would expect the correlation between energy and real binding affinity to drop exponentially. Thus, it is unclear how the method’s premise of optimizing for binding affinity holds any validity.
Additionally, force fields naturally favor certain amino acid types at the interface (e.g., hydrophobic residues) because they form strong interactions. However, real antibodies—especially therapeutic ones—are typically not hydrophobic in the CDRs, as hydrophobicity often results in non-specific binding, which is undesirable. This bias toward certain amino acids and interactions can also make antibody sequences less human-like, potentially triggering autoimmune responses. This study does not address or evaluate potential issues caused by force-field guidance, which could be mitigated by incorporating additional metrics, such as TAP scores [8] to ensure that the CDRs are not more hydrophobic or exhibit a worse charge distribution than the baseline model. Alternatively, language models like AntiBERTy could be used to assess whether the generated antibody sequence is plausible (e.g., similar to those in the test set) [9].
W3. Third, I believe a critical issue is the lack of fairness in the comparison experiments. The baseline approaches are generally trained on raw co-crystal structures that are not optimized by Rosetta and do not represent minimal energy states. As a result, it is natural that these models do not produce structures with minimal energies, as they are not fitting a distribution based on minimized energy conformations. In contrast, the proposed method aims to generate minimized structures, leading to results that are not directly comparable with those of the baseline models.
I recommend including models trained on minimized structures in the experimental tables to ensure a fair comparison. Additionally, results from models with post hoc optimization should be included, as this is a standard protocol. This approach would provide readers with a more robust and consistent basis for comparison, ensuring the rigor and accuracy of the experimental results.
W4. Finally, I did not observe a detailed description of the main model architecture in the article, nor the inclusion of commonly used comparison metrics, such as co-design amino acid recovery (AAR), root mean square deviation (RMSD), and the distribution of bond lengths and bond angles. These metrics are crucial because, if a model fails to generate realistic structures and sequences that conform to the distribution of the training samples, it raises a critical question: would such a structure, which deviates significantly from real antibodies, still retain immune functionality? If the generated protein deviates substantially from authentic antibody structures, it is questionable whether merely optimizing for energy is sufficient or even necessary in the context of antibody optimization. A rigorous assessment of the generated structures, in terms of their fidelity to real antibodies, is essential to determine whether they can function as effective immunological agents.
[1] Jin W, Wohlwend J, Barzilay R, et al. Iterative refinement graph neural network for antibody sequence-structure co-design. arXiv preprint arXiv:2110.04624, 2021.
[2] Tan C, Gao Z, Wu L, et al. Cross-gate MLP with protein complex invariant embedding is a one-shot antibody designer. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(14): 15222-15230.
[3] Lin H, Wu L, Yufei H, et al. GeoAB: Towards realistic antibody design and reliable affinity maturation. ICML2024.
[4] Wu F, Li S Z. A hierarchical training paradigm for antibody structure-sequence co-design. In: Advances in Neural Information Processing Systems, 2024, 36.
[5] Martinkus K, Ludwiczak J, Liang W C, et al. AbDiffuser: Full-atom generation of in-vitro functioning antibodies. In: Advances in Neural Information Processing Systems, 2024, 36.
[6] Miettinen, M. S., Rajendiran, A., & Muheim, C. Forcefield Energies and Binding Affinity: A Weak Correlation Study. RSC Publishing, 2020.
[7] Ambrosetti, F., Piallini, G., & Zhou, C. Evaluating Forcefield Energies in Protein Binding Studies. National Center for Biotechnology Information, 2020.
[8] Dunbar, J., & Deane, C. M. TAP Scores for Antibody Hydrophobicity and Charge Distribution. Oxford Protein Informatics Group, OPIG.
[9] Akbar, R., & Krawczyk, K. Evaluating Antibody Sequence Probabilities with AntiBERTy. arXiv preprint arXiv:2308.05027, 2023.
问题
All the questions are given in the weakness.
W3. Third, I believe a critical issue is the lack of fairness in the comparison experiments. The baseline approaches are generally trained on raw co-crystal structures that are not optimized by Rosetta and do not represent minimal energy states. As a result, it is natural that these models do not produce structures with minimal energies, as they are not fitting a distribution based on minimized energy conformations. In contrast, the proposed method aims to generate minimized structures, leading to results that are not directly comparable with those of the baseline models. I recommend including models trained on minimized structures in the experimental tables to ensure a fair comparison. Additionally, results from models with post hoc optimization should be included, as this is a standard protocol. This approach would provide readers with a more robust and consistent basis for comparison, ensuring the rigor and accuracy of the experimental results.
Response: We would like to clarify that all methods used in our experiments were performed under the same setup to ensure fairness. Specifically, each model generates 1,280 distinct samples, which are subsequently refined using Rosetta’s relax protocol to obtain minimized structures. This consistent approach ensures that all methods, including the baseline models, are evaluated on an equal footing with optimized conformations. We will emphasize this in the experiments section for clarity and rigor in the comparison.
W4. Finally, I did not observe a detailed description of the main model architecture in the article, nor the inclusion of commonly used comparison metrics, such as co-design amino acid recovery (AAR), root mean square deviation (RMSD), and the distribution of bond lengths and bond angles. These metrics are crucial because, if a model fails to generate realistic structures and sequences that conform to the distribution of the training samples, it raises a critical question: would such a structure, which deviates significantly from real antibodies, still retain immune functionality? If the generated protein deviates substantially from authentic antibody structures, it is questionable whether merely optimizing for energy is sufficient or even necessary in the context of antibody optimization. A rigorous assessment of the generated structures, in terms of their fidelity to real antibodies, is essential to determine whether they can function as effective immunological agents.
Response: We acknowledge the importance of metrics such as AAR and RMSD for assessing the realism and functionality of generated structures. These metrics, along with detailed descriptions of the main model architecture, have been included in the appendix of our paper. We will further highlight this in the main text for better accessibility. We also include the metrics results here for your information.
| Metrics | HERN | MEAN | dyMEAN | ABGNN | DiffAb | AlignAb (w/o alignment) | AlignAb |
|---|---|---|---|---|---|---|---|
| AAR | 33.17 | 33.47 | 40.95 | 38.36 | 36.42 | 37.65 | 35.34 |
| RMSD | 9.86 | 1.82 | 7.24 | 2.02 | 2.48 | 2.25 | 1.51 |
Weaknesses:
W1. Many recent works on antibody design and optimization are missing in the comparison, including RefineGNN [1], ADesigner [2], GeoAb [3], HTP [4], AbDiuser [5], etc. The lack of discussion and comparison with related work limits the validity of the article to be assessed. Although the article emphasizes antibody optimization through alignment methods based on generative models, it is challenging to integrate these alignment methods with refinement approaches like ABDesigner. However, could these alignment methods be utilized to perform antibody optimization with generative models such as AbDiffuser and GeoAB? Additionally, all new antibody design methods introduced this year should be included for comparison. If the code is not open-sourced, should these methods still be cited? The lack of discussion and comparison with so much related work is a major weakness.
Response: Thank you for highlighting this important point. We acknowledge the need to compare with a broader range of recent works and will expand the related work discussion to include RefineGNN, ADesigner, GeoAb, HTP, and AbDiffuser. While some of these methods focus on refinement rather than generative optimization, our alignment methods are designed to integrate seamlessly with generative models like AbDiffuser and GeoAb, providing a potential avenue for future exploration. For methods introduced this year, we will cite them where relevant, even if their code is not open-sourced, to acknowledge their contributions. Additionally, we believe that comparing these methods to AlignAb is unnecessary for the following two reasons:
-
AlignAb is energy-focused optimization. All of the above methods aim at improving metrics such as AAR and RMSD, which is not the objective of AlignAb. Additionally, most of the mentioned methods perform poorly in energy measurements based on the samples we tested. Since none of the above methods utilize energy measurements in their training objective, we think it’s sufficient to only select the most competitive baselines for comparison.
-
AlignAb is an alignment method. We believe that the most important contribution of our work is the proposed alignment method. Therefore, our focus of this study is to show how different alignment techniques (iterative optimization, reward margin, and temperature scaling, as proposed in the paper) would improve the DiffAb baseline. As an alignment method, AlignAb can be easily adapted to other deep learning based methods including both diffusion and graph based models. Regarding how AlignAb will benefit other state-of-the-art models should be an interesting direction for future research.
W2. the article employs Rosetta Energy as an alignment metric, which raises my considerable concerns. My primary issue is that the premise of the paper relies on the assumption that minimizing the Rosetta energy of the complex leads to better antibodies. However, it is well-known that forcefield energies have a weak correlation with measured binding affinity, typically around 0.3 [6,7]. This correlation can be improved to approximately 0.7 by incorporating various additional techniques, such as machine learning classifiers, feature ensembling, and molecular dynamics ensembles. However, these improvements are generally limited to single mutations of a known co-crystal structure...
Response: Thank you for raising these important concerns. We acknowledge that using Rosetta energy as an alignment metric has limitations, particularly regarding its weak correlation (~0.3) with measured binding affinity for single mutations, which may degrade further with multiple mutations. However, we emphasize that our primary objective is to demonstrate the utility of our alignment method rather than to claim absolute predictive power of Rosetta energy for binding affinity. AlignAb is agnostic to the specific reward function used and can be extended to incorporate more robust metrics, such as learned classifiers, feature ensembling, molecular dynamics, or TAP scores, to mitigate the biases and limitations associated with force-field-based metrics. Additionally, we agree that biases in amino acid preferences at the interface, such as favoring hydrophobic residues, could lead to undesirable properties in therapeutic antibodies. Future work will explore combining Rosetta energy with complementary metrics, such as TAP scores, AntiBERTy, or other language model-based plausibility assessments, to ensure human-likeness, appropriate charge distribution, and reduced hydrophobicity in the designed antibodies.
This paper introduces a three-stage framework aimed at training deep learning models for the co-design of antibody sequences and structures. The primary contributions in this paper include the development of multi-objective alignment algorithm for generating a superior Pareto front of models in terms of energy without additional data. This model can generate more natural-like antibodies with better rationality and functionality.
优点
This paper is well-written and has a clear motivation. In addition, the methodology section is detailed and easy to follow. The proposed framework can achieve state-of-the-art performance in generating more natural-like antibodies by aligning the model to favor configurations with low repulsion and high attraction energy at the binding site. It presents a three-stage framework (pre-training, transferring, and alignment) in antibody sequence-structure co-design, that extends DPO to achieve Pareto optimality with multiple energy-based alignment objectives.
缺点
Given that the proposed method is an extension of AbDPO, it would be beneficial to include AbDPO as a benchmark for comparison. Although there are reasonable metrics used in this paper, it is recommended that the authors include RMSD and AAR in their evaluation metrics to provide a more comprehensive understanding of the generated antibodies' quality and to align with established benchmarks.
问题
This paper mentioned the use of a decaying temperature schedule, inspired by epsilon-greedy learning from RL, to balance exploration and exploitation. What are your thoughts on employing more advanced strategies instead, such as UCB? Given that your work builds upon AbDPO, could you provide a more detailed comparison between this approach and AbDPO?
W1. Given that the proposed method is an extension of AbDPO, it would be beneficial to include AbDPO as a benchmark for comparison. Although there are reasonable metrics used in this paper, it is recommended that the authors include RMSD and AAR in their evaluation metrics to provide a more comprehensive understanding of the generated antibodies' quality and to align with established benchmarks.
Response: Thanks for pointing that out. Since AbDPO is not yet open sourced, we find it difficult to implement their methods and fully replicate their experiments due to missing technical details in their paper. We have included AAR and RMSD in Table 3 in Appendix B.1. We also include them here for your reference.
| Metrics | HERN | MEAN | dyMEAN | ABGNN | DiffAb | AlignAb (w/o alignment) | AlignAb |
|---|---|---|---|---|---|---|---|
| AAR | 33.17 | 33.47 | 40.95 | 38.36 | 36.42 | 37.65 | 35.34 |
| RMSD | 9.86 | 1.82 | 7.24 | 2.02 | 2.48 | 2.25 | 1.51 |
Q1. This paper mentioned the use of a decaying temperature schedule, inspired by epsilon-greedy learning from RL, to balance exploration and exploitation. What are your thoughts on employing more advanced strategies instead, such as UCB? Given that your work builds upon AbDPO, could you provide a more detailed comparison between this approach and AbDPO?
Response: While we employed a decaying temperature schedule for its simplicity, exploring advanced strategies like UCB is a promising direction for future work. Compared to AbDPO, which we consider a naive version of our method, AlignAb introduces iterative multi-objective optimization, reward margin, and temperature scaling for enhanced performance. AbDPO is not open-sourced, and we were unable to replicate their results, but our ablation studies demonstrate significant improvements over AbDPO-like approaches.
Thank you for the detailed clarification on the implementation challenges regarding AbDPO and the enhancements made in AlignAb. The comparison provided between the approaches gives a good understanding of the advancements in AlignAb. Accordingly, I maintain my original score.
This paper presents AlignAb, a three-stage framework for designing nature-like antibodies using deep learning. It involves pre-training on antibody sequences, guiding a diffusion model for joint sequence-structure optimization, and aligning the model to favor antibodies with desired antigen interactions. The approach aims to address energy preference conflicts and leverages online data without extra resources, generating high-affinity antibodies.
优点
This paper focuses on in silico antibody design, which is a crucial aspect of developing effective therapeutics. To address the issue of generating antibodies with low affinity due to the neglect of energy functions in existing literature, a three-stage framework is proposed to tackle the antibody design challenge. This framework employs AbDPO to guide the model in generating samples with lower energy.
缺点
-
The authors place excessive emphasis on the performance of their proposed method in terms of energy metrics to demonstrate the ability to design higher-affinity antibodies. However, the energy calculations rely on Rosetta, and thus the computed energy does not accurately represent the binding affinity between the antibody and the antigen. The authors do not clarify whether this evaluation metric is practically useful.
-
The model presented in this paper specifically optimizes for energy, making comparisons with other methods that do not focus on energy optimization somewhat unfair. Notably, the model utilizes known regions of antibodies and only designs the CDR3, effectively assuming that the binding will be consistent with the original antibody. In this context, metrics such as AAR and RMSD can reflect the positional and sequence differences relative to the original antibody.
-
While this paper proposes a three-stage framework, each stage employs existing methods. For instance, the pre-training phase is based on ABGNN, the transferring phase draws from DiffAb, and the alignment phase utilizes AbDPO and Diffusion-DPO to enhance energy. Overall, there appears to be limited technical innovation in this approach.
问题
-
Did the authors perform OAS deduplication, and does OAS include samples from the test set?
-
Regarding the RAbD benchmark, which contains 60 antibody-antigen complexes, the authors only utilized 42, which is even fewer than the 55 used by AbDPO. What is the reason for this discrepancy?
-
Currently, there is no research demonstrating that Rosetta's energy values are effective for real-world antibody design. Did the authors test other energy prediction methods for performance evaluation, such as those used in dyMEAN?
-
The authors did not compare their method with AbDPO, which is also mentioned in the article as focusing on energy optimization. Furthermore, the reported values for relevant metrics differ significantly from those reported for AbDPO, particularly for the MEA. Given that both methods use Rosetta, what accounts for this substantial difference?
-
The article does not provide code for easy reproduction and understanding.
-
In Figure 4, MEAN is also designed specifically for CDR3. Why does the framework structure of one of the chains differ from that of other methods? Additionally, what is the purpose of this figure? Does it indicate a high similarity to the original antibody?
Q1. Did the authors perform OAS deduplication, and does OAS include samples from the test set?
Response: We used the pre-trained model from ABGNN, which follows the same preprocessed OAS dataset. For testing, both AlignAb and ABGNN utilize a subset of the SAbDab dataset. However, details about potential data leakage, such as deduplication within OAS or overlap between OAS and the test set, were not addressed in the ABGNN paper. We have updated the draft to clarify this limitation and explicitly state it in the final version.
Q2. Regarding the RAbD benchmark, which contains 60 antibody-antigen complexes, the authors only utilized 42, which is even fewer than the 55 used by AbDPO. What is the reason for this discrepancy?
Response: The discrepancy in the number of complexes used arises from our stringent filtering criteria to ensure data quality and compatibility with the objectives of our study. Specifically, we excluded complexes that lacked complete structural data or failed to meet the requirements for accurate energy calculations and alignment during preprocessing. While this resulted in a smaller subset compared to AbDPO, we prioritized high-quality data to maintain the reliability of our results. We will clarify this in the final version of the paper and provide the specific filtering steps used for reproducibility.
Q3. Currently, there is no research demonstrating that Rosetta's energy values are effective for real-world antibody design. Did the authors test other energy prediction methods for performance evaluation, such as those used in dyMEAN?
Response: While we acknowledge the limitations of Rosetta energy values in directly reflecting real-world binding performance, we selected Rosetta due to its widespread use as a standard benchmarking tool in computational protein design. In this work, we focused on demonstrating improvements over baseline models within this framework. We did not test other energy prediction methods, such as those used in dyMEAN, due to the lack of open access to their energy evaluation pipelines and our emphasis on consistency across comparative benchmarks. However, we agree that exploring alternative or complementary energy evaluation methods, particularly those validated for real-world antibody design, is an important direction for future work.
Q4. The authors did not compare their method with AbDPO, which is also mentioned in the article as focusing on energy optimization. Furthermore, the reported values for relevant metrics differ significantly from those reported for AbDPO, particularly for the MEA. Given that both methods use Rosetta, what accounts for this substantial difference?
Response: The discrepancies are due to differences in implementation, preprocessing, and experimental setups, as AbDPO is not open-sourced and lacks detailed reproducibility information. While we attempted to replicate their method, our results differed, likely due to these factors. Additionally, AlignAb introduces enhancements like iterative optimization and reward margin, leading to different outcomes. To address this, we included an ablation study resembling AbDPO, showcasing the improvements from our approach.
Q5. The article does not provide code for easy reproduction and understanding.
Response: We plan to release the code and detailed documentation upon publication to ensure reproducibility and facilitate understanding of our method.
Q6. In Figure 4, MEAN is also designed specifically for CDR3. Why does the framework structure of one of the chains differ from that of other methods? Additionally, what is the purpose of this figure? Does it indicate a high similarity to the original antibody?
Response: We acknowledge the mistake in the visualization of MEAN and have corrected the figure in the updated version. This illustration provides a clearer and more intuitive way for qualitative comparison, showcasing that our design closely resembles the reference structure.
W1. The authors place excessive emphasis on the performance of their proposed method in terms of energy metrics to demonstrate the ability to design higher-affinity antibodies. However, the energy calculations rely on Rosetta, and thus the computed energy does not accurately represent the binding affinity between the antibody and the antigen. The authors do not clarify whether this evaluation metric is practically useful.
Response: Thank you for raising this important point. While Rosetta energy functions are an imperfect proxy for binding affinity due to their reliance on simplified models, we believe they provide a standard framework for benchmarking computational methods and demonstrating improvements introduced by AlignAb over baseline models. Importantly, AlignAb is agnostic to the specific energy function used and can seamlessly incorporate experimentally validated metrics or wet-lab-derived data, ensuring practical relevance in drug discovery. Future work will focus on integrating higher-fidelity metrics and conducting wet-lab experiments to directly evaluate the designed antibodies' real-world binding performance.
W2. The model presented in this paper specifically optimizes for energy, making comparisons with other methods that do not focus on energy optimization somewhat unfair. Notably, the model utilizes known regions of antibodies and only designs the CDR3, effectively assuming that the binding will be consistent with the original antibody. In this context, metrics such as AAR and RMSD can reflect the positional and sequence differences relative to the original antibody.
Response: We have included the comparison of AAR and RMSD in Table 3 in Appendix B.1. We also include them here for your reference.
| Metrics | HERN | MEAN | dyMEAN | ABGNN | DiffAb | AlignAb (w/o alignment) | AlignAb |
|---|---|---|---|---|---|---|---|
| AAR | 33.17 | 33.47 | 40.95 | 38.36 | 36.42 | 37.65 | 35.34 |
| RMSD | 9.86 | 1.82 | 7.24 | 2.02 | 2.48 | 2.25 | 1.51 |
W3. While this paper proposes a three-stage framework, each stage employs existing methods. For instance, the pre-training phase is based on ABGNN, the transferring phase draws from DiffAb, and the alignment phase utilizes AbDPO and Diffusion-DPO to enhance energy. Overall, there appears to be limited technical innovation in this approach.
Response: While our framework builds upon existing methods in each stage, the key innovation lies in the integration of these components with our novel alignment strategy, which introduces iterative optimization, reward margin, and temperature scaling to enhance energy-based optimization. This alignment approach not only improves upon existing methods like AbDPO and Diffusion-DPO but also demonstrates significant advancements in antibody design by achieving superior energy metrics and competitive results on secondary benchmarks. By adapting these methods into a cohesive framework, we provide a robust and scalable solution for antibody sequence-structure co-design, which we believe is a meaningful contribution to the field.
We thank the reviewers for their insightful questions and reviews. We have answered all the questions and addressed all the problems in detail in rebuttal and attached pdf.
In response to the reviewers' suggestions, these revisions include additional explanations, paragraphs, and tables to help readers understand the proposed method. Most importantly, we highlight the updated AAR/RMSD metrics with additional ablation studies which further demonstrate the efficacy of our proposed method. We hope these revisions address the reviewers' concerns and improve the overall quality of our paper.
Thank you again for your review!
Best regards,
Authors
The paper introduces AlignAb, a three-stage framework for designing nature-like antibodies using deep learning, focusing on sequence-structure co-design and energy-based alignment for optimizing antibody-antigen interactions.
Strengths include the novel approach to energy-based alignment, the multi-stage training methodology, and promising empirical results.
Weaknesses include reliance on Rosetta energy for alignment, which has a weak correlation with real binding affinity, limited technical innovation, and lack of comparison with energy-based methods.
The decision to reject is primarily based on the fundamental flaw in the core premise relying on Rosetta energy, lack of innovation, and insufficient comparison with existing methods.
审稿人讨论附加意见
Reviewer yybV and Reviewer E8F6 think the fundamental assumption of the article, which is based on Rosetta energy, is highly unreasonable. The authors do not addressed this key question.
Reject