8.2

/10

Poster3 位审稿人

最低5最高5标准差0.0

4.0

置信度

创新性2.7

质量3.0

清晰度3.0

重要性2.7

NeurIPS 2025

DualMPNN: Harnessing Structural Alignments for High-Recovery Inverse Protein Folding

Xuhui Liao,qiyu wang,Zhiqiang Liang,Liwei Xiao,Junjie Chen

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

TL;DR

DualMPNN is a template-guided dual-stream message passing network that leverages structural alignments to enhance inverse protein folding accuracy through alignment-aware attention.

摘要

关键词

Protein Inverse FoldingStructure AlignmentsMPNNHigh Recovery

评审与讨论

审稿意见

评分: 5置信度: 42025-06-30

This paper focuses on the task of inverse protein folding, where a protein backbone structure is given and a plausible amino acid sequence needs to be generated that folds into the given structure. A key contribution of the paper is the use of templates selected from a database using structural alignments, inspired by the common use of multiple sequence alignments (MSA) in forward folding. Specifically, the paper leverages Foldseek to find similar structures to the query structure that needs to be inverse-folded, and then uses these similar structures and the associated sequences to inform the sequence prediction of the new, given structure. This is implemented with a model called DualMPNN, a dual-stream message passing network that in parallel processes the query and the template structures. The method is validated on sequence recovery, perplexity and foldability metrics, and favourable results compared to baselines are reported.

优缺点分析

Strengths:

The use of structural alignments to find templates that can be leveraged in the inverse-folding task is an interesting and original idea and promising research direction.
When well-aligned templates are available (TM-Score > approx. 0.5), the proposed method outperforms previous methods in terms of sequence recovery rate and perplexity.
The proposed DualMPNN architecture is novel and appropriately designed for the task at hand.
Generally, inverse folding is a highly relevant task in protein design, an important scientific field with impactful applications. Hence, the paper addresses a problem of significance.

Weaknesses:

The method is inspired by MSAs in forward folding. MSA databases used in folding models are usually huge, consisting of many millions of sequences. In contrast, the paper leverages only the protein databank (PDB) to perform the structural alignments. However, the PDB is orders of magnitude smaller. I wonder whether this is reason for concern; this aspect is not discussed. For instance, could this lead to overfitting to the small PDB? What is the effect of different sizes of the database to find structural alignments? Related, the authors could extend this to not only the PDB, but also the AlphaFold Database (AFDB), which would offer a much larger database to find structural alignments (this may actually boost performance). I think the size of the database from which the templates are identified deserves more discussion and quantitative analysis.
The paper uses perplexity and sequence recovery as its two main metrics. However, in inserve-folding there can be multiple sequences that fold into the same structure. Hence, if the model outputs a different sequence than the ground truth one, it may be perfectly fine so long as this sequence still folds into the query structure. Therefore, I believe that sequence recovery and perplexity with the ground truth sequence should be interpreted very carefully. Instead, I believe that foldability is a critical metric -- ultimately, we are fine to predict whatever sequence we want, as long as the correct structure is recovered. However, on foldability, the proposed method is actually worse than ProteinMPNN (see avg RMSD in Table 3), the most popular inverse folding model in the literature (which does not rely on structure templates) -- this is concerning and needs more discussion or should be addressed.
In general, one would expect the model to be as good as ProteinMPNN when no templates are used or the templates are not well aligned, and better when useful templates can be found (high TM scores). However, as seen in Figure 2, the method performs worse than ProteinMPNN when the templates are not well aligned (TM score < 0.5). This is concerning.
Related, the method outperforms ProteinMPNN on sequence recovery when TM score>0.5. In general, this further supports the concern that the model is essentially overfitting to the templates, i.e. it only achieves good performance when the template TM score is high, i.e. when structures in the template database can be found that are highly similar to the query structure. In those cases, possibly the model largely copies the template sequence? The relation and similarity of the template and predicted sequences is not appropriately discussed or analyzed, unfortunately.
It would be interesting to analyze the types of sequences that DualMPNN predicts with the types of sequences that ProteinMPNN predicts. Which of them are more diverse? Does some have certain dominant amino acid types? It is known that ProteinMPNN-predicted sequences follow a distribution different from native sequences. However, I would expect DualMPNN's sequences to follow a more native-like sequence distribution, considering that native sequences inform the sequence prediction through the templates. This deserves a more detailed analysis.
The clarity, writing and presentation of the paper could be improved: For instance, the paper discusses many very low-level implementation details in the main text, some of which are difficult to follow (some related questions below). But at the end, the method uses relatively standard message passing neural networks, just with two streams connected via attention (which is an appropriate design, as pointed out under strengths). It would be better to present the non-novel details in the appendix, and instead focus in the main paper on the key novelties and save space for more discussions, analyses and experiments.

Conclusion: In summary, I think the direction of using structural alignments to inform sequence prediction in inverse folding is an interesting idea and promising research direction. However, there are many questions and concerns and I don't think the paper is ready for publication in its current form. Hence, I am leaning towards rejection.

问题

I have a few further questions and comments:

Please introduce $L_q$ in equation (1).
Discussion around equation (3): When exactly is an aligned i-j pair between query and template protein matched? The text does not describe the exact criteria for matching residue pairs -- and most likely those come with hyperparameters that could be worth exploring.
Line 132: I assume in $A_{template}=...$ , it should be $j_t$ and not $i_t$ ?
Line 205 and following: Why is the number of neighbors for each aggregated node so different in the query MPNN and the template MPNN? It's 48 and 4, respectively.
Line 213: The authors say they use "easy-search" mode. What does this mean? The author should explain the search algorithm and not just list Foldseek commands.

局限性

Yes.

最终评判理由

The authors provided extensive additional experiments and explanations in the rebuttal addressing my questions and concerns in a satisfactory manner. Therefore, I increased my score and am now happy to suggest acceptance.

格式问题

No concerns.

作者回复

2025-07-31

Responses to Weaknesses

Thank you for your careful review with regard to our manuscript. Your constructive comments are very helpful for further strengthening the presentation of our paper.

There may be some misunderstanding here. Our method takes a single protein structure as query, and only the structure with the highest similarity is selected as the template protein. The query and template protein are processed via two interactive branches, coupled through alignment-aware cross-stream attention mechanisms that enable exchange of geometric and co-evolutionary signals. In this study, we utilized PDB solely for identifying templates, not for large-scale representation learning as MSA Transformer. Therefore, concerns regarding overfitting caused by a smaller database are not applicable to our method. Additionally, PDB is the largest repository of experimentally determined protein structures, ensuring high-quality and reliable templates. While AFDB is indeed much larger, it is computationally predicted and may lack the same level of experimental validation. This is why we prefer PDB as the source for template protein queries.

The foldability was evaluated on three key metrics: TM score, pLDDT, and RMSD. Our method, DualMPNN, achieves the best performance on TM score and pLDDT, demonstrating superior structural similarity and confidence in predictions. While RMSD for DualMPNN is slightly higher than ProteinMPNN, the values for both methods are below 2Å, indicating that the predicted structures remain highly similar to the original structures. To further compare the foldability of sequences generated by DualMPNN and ProteinMPNN, we conducted experiments using the latest AlphaFold3. The results are summarized as follows. DualMPNN achieves the best performance across all three metrics. These results further validate the robustness of DualMPNN in generating foldable sequences.

Models	Success $\uparrow$	TMscore $\uparrow$	pLDDT $\uparrow$	RMSD $\downarrow$
ProteinMPNN(AF2)	94	0.860 $\pm$ 0.16	0.89 $\pm$ 0.10	1.36 $\pm$ 0.81
ProteinMPNN(AF3)	94	0.858 $\pm$ 0.18	0.88 $\pm$ 0.12	1.41 $\pm$ 0.76
DualMPNN(AF2)	94	0.862 $\pm$ 0.16	0.91 $\pm$ 0.10	1.47 $\pm$ 0.86
DualMPNN(AF3)	95	0.871 $\pm$ 0.16	0.92 $\pm$ 0.11	1.39 $\pm$ 0.80

We acknowledge that DualMPNN does not improve sequence recovery over ProteinMPNN when the TM score of template is below 0.5. This behavior is expected, as low TM scores typically indicate poorly structural alignment, which can't provide useful evolutionary information to the predictions. Nevertheless, as evidenced by the distribution of results (Fig. 2a in main paper), DualMPNN maintains comparable to ProteinMPNN in this range. Besides, ProteinMPNN itself achieves only ~0.4 sequences recovery in these cases. Notably, 78.57% of the test-template pairings have TM scores greater than 0.5, where DualMPNN consistently outperforms ProteinMPNN.

Furthermore, cases with TM scores below 0.5 often correspond to orphan sequences in protein data, which are widely recognized as more challenging for sequence recovery. While this remains a limitation, our method demonstrates the practical value in the majority of real-world scenarios where usable templates are available.
We appreciate the concern regarding potential overfitting to templates. However, our model does not simply copy template sequences; instead, it effectively learns the relationships and evolutionary correlations between the query and template. To rigorously test for data leakage, we stratified test sequences by their sequence identity to the templates and compared the sequence recovery across these bins. The results in the below table show that DualMPNN demonstrates robust performance across all sequence identity bins, achieving a recovery rate of 60.6% even when sequence identity is below 0.3. This result underscores that our model effectiveness stems from learning homology features from templates, not sequence leakage. Furthermore, as template quality improves (higher sequence identity), recovery rates increase significantly, which aligns with expectations and further validates DualMPNN’s ability to effectively leverage high-quality templates for improved sequence recovery.

Seq Identity (Test vs Templates) <0.3 0.3-0.5 0.5-0.7 0.7-0.9 0.9-0.99
Number of Test Samples 803 153 79 53 32
ProteinMPNN Recovery 48.5% 52.4% 48.0% 50.7% 51.8%
DualMPNN Recovery 60.6% 68.4% 83.8% 87.1% 95.5%

Additionally, to validate the generalization ability of DualMPNN, we partitioned the test proteins based on their structural similarity (TM-score) against training proteins. As shown below, DualMPNN consistently outperforms ProteinMPNN across different similarity ranges, demonstrating its ability to generalize even in scenarios with low similarity between test and training proteins.

TM-score(Test vs Train) <0.3 0.3-0.5 0.5-0.7 0.7-0.9 0.9-0.99
Number of Test Samples 706 307 73 29 5
ProteinMPNN Recovery 50.9% 45.3% 40.9% 45.1% 48.1%
DualMPNN Recovery 66.9% 60.5% 58.6% 74.5% 71.6%

Seq Identity (Test vs Templates)	<0.3	0.3-0.5	0.5-0.7	0.7-0.9	0.9-0.99
Number of Test Samples	803	153	79	53	32
ProteinMPNN Recovery	48.5%	52.4%	48.0%	50.7%	51.8%
DualMPNN Recovery	60.6%	68.4%	83.8%	87.1%	95.5%

Thank you for the insightful suggestion. We agree that analyzing the types of sequences predicted by DualMPNN and ProteinMPNN, as well as their respective amino acid distributions, is an important aspect for understanding model performance. To investigate this further, we analyzed the amino acid distributions of sequences generated by both DualMPNN and ProteinMPNN on the CATH test set, comparing them to the ground truth (natural) sequences. The results are summarized in the table below. Notably, the bond values indicates a maximum distribution offset when compared to the natural residue distribution. This analysis reinforces the idea that DualMPNN produces more biologically realistic sequences by incorporating the structural and sequence information from templates.

Besides, perplexity is also a metric on plausibility in generative protein models. On the CATH dataset, DualMPNN achieves a Perplexity of 3.18 (Table 1 in main paper) , which indicates that the model generates highly probable sequences under its learned distribution.

Amino Acid (%)	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
Natural	7.8	1.3	6.1	7.3	4.5	6.7	2.4	5.9	6.1	9.2	1.5	4.2	4.4	3.9	5.1	5.8	5.5	7.3	1.3	3.6
DualMPNN	9.0 (+1.2)	1.1 (-0.2)	6.1 (+0.0)	9.3 (+2.0)	4.4 (-0.1)	6.8 (+0.1)	1.9 (-0.5)	5.6 (-0.3)	6.2 (+0.1)	9.8 (+0.6)	1.1 (-0.4)	3.8 (-0.4)	4.7 (+0.3)	2.5 (-1.4)	4.7 (-0.4)	5.5 (-0.3)	5.2 (-0.3)	7.8 (+0.5)	1.1 (-0.2)	3.4 (-0.2)
ProteinMPNN	6.6 (-1.2)	1.0 (-0.3)	6.4 (+0.3)	14.6 (+7.3)	3.7 (-0.8)	6.4 (-0.3)	1.2 (-1.2)	4.9 (-1.0)	9.1 (+3.0)	10.8 (+1.6)	0.63 (-0.87)	3.6 (-0.6)	5.2 (+0.8)	1.0 (-2.9)	3.5 (-1.6)	4.9 (-0.9)	4.5 (-1.0)	7.5 (+0.2)	1.0 (-0.3)	3.7 (+0.1)

Thank you for your constructive suggestions regarding the clarity and presentation of the paper. We will move non-novel and technical details to the appendix in revised version, allowing the main paper to focus on our key contributions, novelties, and more in-depth discussions, analyses, and experiments.

Responses to Questions

$L_q$ refers to the length of query proteins.
Thank you for your question regarding the criteria for matching residue pairs in Eq. (3). The aligned residue pairs between the query and template protein are pre-defined during the query-template alignment process. Specifically, residue pairs are considered aligned if the atomic distance between them is less than 5 Å, which is the default threshold used by foldseek. This threshold is passed as a parameter to the model and defines the aligned structural domains. We used this setting without any modifications, but we agree that exploring the impact of varying this hyperparameter could be an interesting direction for future work.
Your suggestion is absolutely accurate. In Line 132, the symbol $i_t$ should indeed be corrected to $j_t$ . Thank you for pointing out this issue. We will address this in the revised version to ensure clarity and avoid any potential confusion for readers.
Initially, we used 48 neighbors in the Template Branch, consistent with the Query Branch. However, we later experimented with reducing the number of neighbors in the Template Branch to 4. This change did not degrade sequence recovery performance, while also reducing memory usage. Therefore, we adopted 4 neighbors in the Template Branch for the final model to strike a balance between performance and computational efficiency.
The term "easy-search" refers to the default search mode in Foldseek. This mode is designed to efficiently identify structurally similar protein candidates without requiring extensive parameter tuning or alignment strategy adjustments. Since our primary goal was to find structurally similar template candidates for each query protein, we opted for this default mode to simplify the search process. In the revised manuscript, we will improve the description by explaining the search algorithm more clearly and avoid including tool-specific command details.

评论- Thank you for rebuttal

2025-08-05

I would like to thank the authors for their detailed reply, addressing all my questions and providing extensive additional experiments to support the arguments.

The experiments are indeed valuable and address concerns and questions around overfitting, the sequence distribution of DualMPNN, and model performance compared to ProteinMPNN. The additional results are strong.

I would like to ask the authors to include all new experiments and explanations in the final version of their paper. I am happy to raise my score and suggest acceptance.

审稿意见

评分: 5置信度: 42025-07-03

The paper proposes a dual stream network architecture for the inverse folding problem which incorporates information from structural alignments by cross-attention between two instances of a base inverse folding model such as ProteinMPNN. The authors demonstrate improved performance compared to baselines and highlight the importance of good templates.

优缺点分析

The proposed idea is simple and clean, and the text is well written and easy to follow (apart from a few typos, see below). The inverse folding problem is currently very relevant in the context of protein design, and the experiments are well conducted and focus on a few key aspects such as ablations and the dependence on template quality.

I am only concerned about data leakage from the alignment step, especially when provided during evaluation. One might argue that DualMPNN gains unfair advantage through the additional information which was not available to ProteinMPNN. As the similarity threshold is quite high (99%), a near-identical template could potentially be selected for a test protein, in which case the low-hanging fruit for the model is simply to map the template sequence to the query sequence. It would be good if the authors take great care to discuss this point in the paper, as it is a little sensitive topic.

I, in principle, would recommend to accept the paper as it is well executed, but I reserve to change my assessment based on the author discussion of this critical point.

问题

Leakage, see above. Are templates at test time limited to training samples or is data leakage otherwise prevented?
Is the template strictly required at inference or can the model be run without? Is the performance then comparable to a bad template (with low TM) or worse?
How were the hyperparameters for the similarity threshold (99%) and the neighbors in query and template model (which also differ strongly) chosen?
How long did it take to run the alignment for all training data? This might also be relevant in the limitations. Is the dataset available?

The following questions are more for my own interest, I don’t require them to be answered:

Does the template branch focus more on the node label or the geometric edge label information? Do you have any indications how this is weighted, eg through an ablation of the geometric edge features?
It would be interesting to see how DualMPNN performs on structural holdouts (eg from FoldSeek clustering) compared to ProteinMPNN, which would point towards generalization to novel folds, although I understand that this is out of scope for this rebuttal.

局限性

As mentioned above, if the runtime is significantly increased through the alignment, that should be mentioned.
Also, while the availability of templates is discussed, the authors discard this with the argument that in a real-world scenario there will likely be some template available. I would argue here that one of the primary applications of inverse folding models is the sequence design for novel protein folds, where such templates might not be available. (I would just recommend here to let the limitation stand as it is without trying to remedy it.)

最终评判理由

The authors have addressed my concerns on data leakage, in fact, the performance gains on low sequence similarity are rather impressive. I would recommend to also include these additional results in the final revision.

I am sure the tool will prove useful to the protein design community (if released accessibly with code and data ;) and I hence recommend to accept the submission.

格式问题

Minor comments:

Some typos: 41 “blueprint for”? 42 “with randomly initialized”? 48 “nodes in the”, 197 “on”, 209 “on”, 255 “a recovery”
Give references for statements made in line 234-239
L250: TM score to what? Most similar template?
L291: mean TM/pLDDT?
It would be good to harmonize query/target notations. Sometimes these are just single-letter subscripts, sometimes full words, and sometimes only the template has a subscript which is omitted in the query.

作者回复

2025-07-31

Responses to Questions

We appreciate the reviewer for raising concerns about potential data leakage during evaluation. To rigorously address this and demonstrate DualMPNN's generalization capability, we conducted two targeted analyses:

(1) Impact of Sequence Identity with Templates. We stratified test sequences based on their sequence identity to templates into 5 bins, and then compared the sequence recovery between ProteinMPNN and DualMPNN across these bins. The results in the below table demonstrate that most test-template pairs exhibit low sequence identity (<0.5). Crucially, even at very low sequence identity (<0.3), DualMPNN achieves 60.6% sequence recovery, significantly outperforming the 48.5% sequence recovery of ProteinMPNN. This strongly indicates that the performance improvement of our model does not stem from data leakage but rather a result of effectively leveraging the evolutionary information from homologous structural templates to enhance recovery. Additionally, as the sequence identity further improved, our model can fully exploit this information to better assist sequence recovery for query proteins. This is both reasonable and in alignment with expectations.

Sequence Identity between Test Samples and Templates	< 0.3	0.3 ~ 0.5	0.5 ~ 0.7	0.7 ~ 0.9	0.9 ~ 0.99
Number of Test Samples	803	153	79	53	32
ProteinMPNN Recovery	48.5%	52.4%	48.0%	50.7%	51.8%
DualMPNN Recovery	60.6%	68.4%	83.8%	87.1%	95.5%

(2) Impact of Structural Similarity to Training Data. To further validate the generalization, we partitioned the test proteins based on their structural similarity (TM-score) relative to the closest training proteins. The results in the below table show that DualMPNN consistently outperforms ProteinMPNN across all structural similarity bins, including the lowest region (<0.3 TM-score). This confirms DualMPNN's robustness to novel protein folds absent from the training set.

TM-score of Test Samples against Train Set	< 0.3	0.3 - 0.5	0.5 - 0.7	0.7 - 0.9	0.9 - 0.99
Number of Test Samples in bins	706	307	73	29	5
ProteinMPNN Recovery	50.9%	45.3%	40.9%	45.1%	48.1%
DualMPNN Recovery	66.9%	60.5%	58.6%	74.5%	71.6%

Together, these experiments demonstrate that DualMPNN’s performance improvements arise from its capacity to utilize structural templates and generalize to novel folds, not data leakage.

If there is no template information at inference, the model will degrade as the base model ProteinMPNN. However, it remains operational in this scenario.
The similarity threshold of 99% was selected specifically to exclude cases where the query and template were identical. For the number of neighbors in the template branch, we initially employed 48 neighbors. However, we reduced this to 4 neighbors after observing that recovery performance remained uncompromised while memory usage decreased dramatically. Consequently, the final configuration for the template branch utilizes 4 neighbors.
The alignment was performed using Foldseek. For the CATH4.2 dataset, the multi-structure pairwise search was completed in about 1 hour, which is approximately 5.6 proteins per second on 64 CPU cores. The template dataset is Protein Data Bank (PDB), a publicly available resource already integrated into the Foldseek server for direct use.
In the template branch, the similar performance observed when reducing the number of neighbors from 48 to 4 indicates that geometric edge features are not particularly critical for the model’s performance. Instead, the model primarily learns sequence-related features from the template protein, which are embedded in the node representations. This highlights that the node features, rather than edge features, play a more significant role in the template branch.

According to your suggestion, to evaluate the generalization of our model on novel proteins, we clustered the structural similarities between the test and train sets and extracted test proteins with low similarity to the training set. Subsequently, we filtered out 523 proteins from the original 1120 test proteins, forming a new “structural holdouts set”. We evaluated the performance of both ProteinMPNN and DualMPNN on this filtered test set and compared to their performance on the full test set. The results in the below table show that DualMPNN maintains strong recovery accuracy even for structurally novel proteins, achieving a recovery rate of 61.96% compared to 45.50% for ProteinMPNN. In addition, the results of "Impact of Structural Similarity to Training Data" in the response of Q1 also emphasize the generalization of DualMPNN. Our model leverages existing information to assist in sequence recovery, and as long as the templates are of sufficiently high quality, it can further enhance recovery rates.

Model	All Rec.%	All Perplexity	Short Chain Rec.%	Short Chain Perplexity	Single Chain Rec.%	Single Chain Perplexity
ProteinMPNN (test set)	49.87	4.57	36.35	6.21	34.43	6.68
DualMPNN (test set)	65.51	3.18	55.97	4.42	52.41	5.04
ProteinMPNN (structural holdouts set)	45.50	5.67	34.36	8.13	29.14	9.99
DualMPNN (structural holdouts set)	61.96	3.59	56.34	4.46	47.43	5.73

Responses to Limitations

As previously mentioned, we employed Foldseek to conduct the structural search task for the CATH dataset. The search completed in approximately 1 hour, processing ~5.6 proteins per second, executed on 64 CPU cores. For each query protein, Foldseek identified multiple candidate template structures. Crucially, these search results were computed once and can be stored locally for reuse in subsequent analyses.
We acknowledge the limitation that our method relies on the availability of template structures, which may not always exist for orphan protein folds. However, in practical scenarios, many protein design tasks are based on modifying existing protein structures. In such cases, a large number of homologous proteins or structural fragments can serve as templates. Our method is particularly advantageous in this context, as it does not require a perfect structural match for sequence recovery. Instead, it can match and utilize fragments of templates to guide the design process.

Nevertheless, we recognize this as a limitation when applying our approach to proteins with entirely novel folds and will leave this point as a clear area for future improvement.

Responses to Paper Formatting Concerns

Thank you for your valuable feedback. We will address the following points in the revised manuscript:

Typos: The noted issues (lines 41, 42, 48, 197, 209, 255) will be corrected before publication.
References: For lines 234-239, we will include the suggested reference: Sikosek, T., & Chan, H. S. (2014). Biophysics of protein evolution and evolutionary protein biophysics.
L250: We will clarify that the “TM score” refers to the similarity between test proteins and their closest template proteins.
L291: It means TM-score.
Notations: We will harmonize the query/target notations for consistency throughout the manuscript.

2025-08-05

I thank the authors for exhaustively answering my questions and appreciate the additional results, which addressed my concerns about data leakage. I am hence keeping my recommendation to accept.

审稿意见

评分: 5置信度: 42025-07-06

This paper presents DualMPNN, a dual-stream message passing neural network that improves inverse protein folding by leveraging structurally aligned templates through TM-align and attention-based feature fusion. The model shows significant gains in sequence recovery and perplexity over existing methods (e.g. ProteinMPNN), particularly when high-quality templates are available. Extensive ablation and benchmark evaluations demonstrate the model’s robustness and architectural soundness.

优缺点分析

Strengths

Comprehensive ablation studies demonstrate the contribution of each architectural component: template node initialization improves recovery from 49.9% to 61.3%, cross-stream attention further boosts it to 64.8%, and TM-score modulation leads to a final recovery rate of 65.4%, highlighting the additive benefit of each mechanism.
Meaningful improvement over baselines: DualMPNN outperforms ProteinMPNN and GraDe-IF across standard benchmarks, consistently achieving higher recovery rates and lower perplexity across CATH, TS50, and T500 datasets.
The model avoids overfitting in low-TM-score regions due to the masking of unaligned residues and the use of TM-score–weighted attention, which downregulates unreliable template signals. This is supported by empirical evidence: in Figure 2(a), proteins with TM-score < 0.3 show no significant drop in recovery rate compared to ProteinMPNN; and in Figure 2(b), low TM-score cases (blue and orange points) cluster near the diagonal, indicating that DualMPNN neither helps nor harms when template similarity is poor.
Attention score modulation by the global TM-score allows the model to selectively leverage high-quality templates, leading to recovery gains over 30% when TM-score > 0.9.

Limitations

Triviality of the core insight: While the paper quantifies the relationship between TM-score and sequence recovery, the central finding, i.e. that higher structural similarity leads to better recovery, is intuitive.
Potential data leakage: The authors only exclude near-identical templates (TM-score > 0.99, same PDB ID), and hence, residual homology may still bias results. The absence of explicit sequence identity filtering raises concerns about the true independence of training and test pairs.
Lack of functional assessment: The study evaluates sequence recovery but does not test whether the generated sequences are functional, foldable, or biologically plausible beyond AlphaFold confidence metrics.
No direct test for hallucination: The paper claims robustness to low-quality templates, but does not evaluate whether it generates unrealistic or misleading sequences in such cases. There is no analysis of output diversity, entropy, or structural viability in weak-alignment settings.

问题

Suggestions for improvement

Stricter homology filtering: To ensure generalization and avoid template leakage, consider filtering training/test pairs using sequence identity thresholds (e.g., <30%), in addition to TM-score and PDB-based exclusions.
Include SCOP-based evaluations: Evaluating the model on SCOP-classified domains would help validate whether DualMPNN generalizes across different evolutionary and topological classifications, complementing the current CATH-based benchmarks.
Why not use AlphaFold3? Given the availability of AlphaFold3, the authors should either validate foldability using the updated model or strongly justify the continued use of AlphaFold2. Additionally, they could leverage structures excluded from AlphaFold2 training but included in AlphaFold3 as a valuable test set for assessing generalization.
Incorporate functional or structural plausibility metrics: Beyond recovery and perplexity, authors could evaluate conservation of functional motifs, residue-level confidence, or validate structure-function relationships via experimental assays or in silico function prediction.
Analyze diversity and entropy under poor templates: To support the claim of robustness under low-quality alignments, the model’s output diversity and per-residue entropy could be measured and compared to baselines, particularly in low TM-score regimes.
Several complementary metrics could provide deeper insight into the biological plausibility, structural consistency, and functional relevance of the inferred sequences. For instance, structural similarity metrics such as RMSD after in silico folding, confidence scores from AlphaFold (e.g., pLDDT), and physicochemical assessments such as packing quality or solvation energy can help evaluate whether the designed sequences are not only recoverable but also foldable and functionally viable.

局限性

Limitations: The authors could strengthen the discussion by acknowledging key limitations, such as:

There is a lack of functional validation for generated sequences.
Potential data leakage due to residual homology between training and test sets.
Reliance on AlphaFold2 for foldability without discussing more recent tools like AlphaFold3.

最终评判理由

I am increasing my score to accept following a thorough and thoughtful rebuttal.

Resolved Issues:

Homology concerns were addressed with sequence identity–stratified analysis.
Foldability validated using AlphaFold3.
Additional biophysical and diversity metrics support plausibility and robustness under poor templates.

Remaining Issues:

SCOP-based evaluation is deferred but acknowledged.
Functional validation is still lacking but clearly stated as a limitation.

Given the new empirical results, I now recommend acceptance.

格式问题

Nothing to report

作者回复

2025-07-31

Thank you for your careful review and constructive suggestions with regard to our manuscript. While the correlation between structural similarity and sequence recovery is intuitive, our work is the first to effectively incorporate structural alignments into inverse folding via TM-score–modulated attention and dual-stream message passing. By turning this simple idea into a practical and robust framework with strong empirical gains, our method not only advances performance but also provides a useful paradigm for leveraging structural prior knowledge in protein design.

Responses to Questions:

To illustrate the homology filtering and data leak issues, we evaluated the effect of homology by partitioning test samples according to their sequence identity with templates. All test samples were binned accordingly into five sequence identity ranges, and then compared the sequence recovery between ProteinMPNN and DualMPNN across these bins.

As shown in the table below, DualMPNN demonstrates robust performance across all sequence identity bins, achieving a recovery rate of 60.6% even when sequence identity is below 0.3. These results underscore that our model’s effectiveness does not stem from sequence leakage, but rather from its ability to learn homology features from templates of varying quality. Furthermore, as template quality improves (higher sequence identity), recovery rates increase significantly, which aligns with expectations and further validates our model’s capacity to effectively leverage high-quality templates for improved sequence recovery. We will include these analyses in the revised version.

Sequence Identity between Test Samples and Templates	< 0.3	0.3 - 0.5	0.5 - 0.7	0.7 - 0.9	0.9 - 0.99
Number of Test Samples	803	153	79	53	32
ProteinMPNN Recovery	48.5%	52.4%	48.0%	50.7%	51.8%
DualMPNN Recovery	60.6%	68.4%	83.8%	87.1%	95.5%

We agree that SCOP provides a rigorous framework for evaluating generalization across evolutionary and topological classes. While time constraints prevent us from completing SCOP-based experiments during the rebuttal period, we will include them in the revised version of the paper.

To illustrate model generalization in the meantime, for each test sample, we identified its maximum TM-score relative to the training set and grouped the test set into five TM-score ranges. Both ProteinMPNN and DualMPNN were then evaluated across these structural similarity bins to analyze their generalization performance under varying levels of structural similarity between the training and test sets. As shown below, DualMPNN maintains strong performance even in low structural similarity regions, demonstrating its ability to generalize beyond closely related folds. These results already provide solid evidence of our model’s robustness across diverse structural families and great generalization performance.

TM-score of Test Samples against Train Set	< 0.3	0.3 - 0.5	0.5 - 0.7	0.7 - 0.9	0.9 - 0.99
Number of Test Samples	706	307	73	29	5
ProteinMPNN Recovery	50.9%	45.3%	40.9%	45.1%	48.1%
DualMPNN Recovery	66.9%	60.5%	58.6%	74.5%	71.6%

To further validate the foldability of sequences generated by DualMPNN, we have predicted their structures using AlphaFold3 and compared the results with AlphaFold2 predictions. These results demonstrate the excellent foldability of sequences generated by DualMPNN, achieving strong folding performance on both AF2 and AF3. This further highlights the reliability and robustness of DualMPNN. The outcomes are summarized below.

Models	Success ↑	TM-score ↑	pLDDT ↑	RMSD ↓
ProteinMPNN (AF2)	94	0.860 ± 0.16	0.89 ± 0.10	1.36 ± 0.81
ProteinMPNN (AF3)	94	0.858 ± 0.18	0.88 ± 0.12	1.41 ± 0.76
DualMPNN (AF2)	94	0.862 ± 0.16	0.91 ± 0.10	1.47 ± 0.86
DualMPNN (AF3)	95	0.871 ± 0.16	0.92 ± 0.11	1.39 ± 0.80

To further evaluate the rationality of the generated sequences, we evaluated DualMPNN across the following metrics on CATH test set:

(1) Perplexity: Perplexity is a widely used proxy for plausibility in generative protein models. On the CATH dataset, DualMPNN achieves a PPL of 3.18 (Table 1 in main paper), which indicates that the model generates highly probable sequences under its learned distribution.

(2) Physicochemical Properties: We evaluated two key sequence-level biophysical indicators: GRAVY score (hydropathy) and polar residue composition. DualMPNN shows distributions that closely match natural sequences, and significantly outperforming ProteinMPNN in mimicking these properties (Average denotes the average score of all samples).

GRAVY	[-0.8,-0.6)	[-0.6,-0.3)	[-0.3,0)	Average
Natural sequences	10.45%	34.46%	35.09%	-0.3233
DualMPNN	7.95%	32.14%	36.52%	-0.2586
ProteinMPNN	16.79%	39.29%	19.82%	-0.5404

Percentage of polar residue	35%~45%	45%~55%	55%~70%	35%~70%	Average
Natural sequences	9.73%	57.14%	31.07%	97.94%	52.21%
DualMPNN	12.77%	58.09%	26.29%	97.15%	51.30%
ProteinMPNN	5.98%	50.71%	40.27%	96.96%	54.39%

(3) Amino Acid Distribution: We compared the residue distribution of sequences generated by DualMPNN and ProteinMPNN against that of natural sequences. Notably, the bond values indicate maximum offsets compared to the natural residue distribution. The results below demonstrate that DualMPNN effectively mimics natural amino acid residue distributions. By leveraging template priors, DualMPNN better captures the nuanced residue patterns of natural sequences compared to ProteinMPNN.

Amino Acid (%)	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
Natural sequences	7.8	1.3	6.1	7.3	4.5	6.7	2.4	5.9	6.1	9.2	1.5	4.2	4.4	3.9	5.1	5.8	5.5	7.3	1.3	3.6
DualMPNN	9.0 (+1.2)	1.1 (-0.2)	6.1 (+0.0)	9.3 (+2.0)	4.4 (-0.1)	6.8 (+0.1)	1.9 (-0.5)	5.6 (-0.3)	6.2 (+0.1)	9.8 (+0.6)	1.1 (-0.4)	3.8 (-0.4)	4.7 (+0.3)	2.5 (-1.4)	4.7 (-0.4)	5.5 (-0.3)	5.2 (-0.3)	7.8 (+0.5)	1.1 (-0.2)	3.4 (-0.2)
ProteinMPNN	6.6 (-1.2)	1.0 (-0.3)	6.4 (+0.3)	14.6 (+7.3)	3.7 (-0.8)	6.4 (-0.3)	1.2 (-1.2)	4.9 (-1.0)	9.1 (+3.0)	10.8 (+1.6)	0.63 (-0.87)	3.6 (-0.6)	5.2 (+0.8)	1.0 (-2.9)	3.5 (-1.6)	4.9 (-0.9)	4.5 (-1.0)	7.5 (+0.2)	1.0 (-0.3)	3.7 (+0.1)

These results suggest that DualMPNN generates low-perplexity sequences while capturing physicochemical patterns and foldability signals, resembling natural proteins and outperforming ProteinMPNN in producing biologically plausible sequences, even without explicit functional training.

To assess robustness under low-quality templates, we selected a subset of TM-score < 0.5 (240/1120) test samples and evaluated output diversity and sequence entropy. These results suggest that DualMPNN does not introduce excessive randomness under weak structural alignment, performing comparably to natural sequences in terms of diversity and entropy. This demonstrates its robustness to low-quality templates and resistance to hallucination.

Metrics Diversity Entropy
Natural sequences 0.5108±0.2733 2.801±1.040
ProteinMPNN 0.4935±0.2664 2.659±1.054
DualMPNN 0.5204±0.2456 2.835±0.926

Metrics	Diversity	Entropy
Natural sequences	0.5108±0.2733	2.801±1.040
ProteinMPNN	0.4935±0.2664	2.659±1.054
DualMPNN	0.5204±0.2456	2.835±0.926

We have added experiments on packing quality and solvation energy. Solvation energy is divided into two components, Solvation Polar Energy and Solvation Hydrophobic Energy, for a detailed presentation. While packing quality is not a directly quantifiable single metric, we evaluated van der Waals interaction energy (VDW Energy) and van der Waals clash energy (VDW Clash Energy) as proxies to reflect packing quality. The results in the below table show that DualMPNN demonstrates a balanced performance across solvation and packing-related metrics, highlighting our model's ability to generate sequences with high structural quality and stability, closely resembling natural proteins.

Metrics (kcal/mol)	Solvation Polar	Solvation Hydrophobic	VDW Energy	VDW Clash Energy
Natural sequences	239.47±149.92	-232.41±149.97	-176.66±112.95	10.87±10.44
ProteinMPNN	233.46±147.80	-237.21±155.32	-176.80±114.11	8.87±7.08
DualMPNN	236.57±150.70	-236.21±153.66	-176.27±112.69	8.60±7.50

评论- response to authors

2025-08-06

Thank you for the careful response and the additional analyses. The rebuttal has reasonably addressed my comments, and I will therefore increase my score to accept.

最终决定Accept (poster)

2025-09-17

(5,5,5) This paper introduces DualMPNN, a message passing network for inverse protein folding / protein design, which builds upon the ProteinMPNN model. The key innovation is using FoldSeek to find similar protein structures, akin to how sequence-to-structure prediction models leverage related sequences in the forward prediction task. Reviewers found the use of structural alignments intuitive and an original and promising research direction. Reviewers raised concerns about data leakage and generalization, but the authors provided extensive analyses and additional experiments during the rebuttal, satisfactorily addressing all points and strengthening the case for acceptance.