/10

Poster4 位审稿人

最低3最高4标准差0.5

ICML 2025

CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

Junbo Yin,Chao Zha,Wenjia He,Chencheng Xu,Xin Gao

OpenReview PDF

提交: 2025-01-15更新: 2025-07-24

TL;DR

A diffusion model capable of designing functional proteins comparable to natural proteins.

摘要

关键词

functionprotein designdiffusion modelmulti-objective

评审与讨论

审稿意见

评分: 42025-02-23

The paper proposes introducing multiple conditions into the protein sequence design process based on DPLM using a method similar to ControlNet. It achieves the integration of various conditions through the designed RCFE and AGFM modules. The performance on protein design tasks with multiple conditions shows a significant improvement compared to baseline models.

给作者的问题

I don't have other questions.

论据与证据

In my view, this paper is more focused on algorithmic improvements for a specific application, and therefore does not include many new claims. Instead, it represents a natural extension of existing models and methods. Overall, the claims in this paper are reasonable and supported.

方法与评估标准

The proposed methods in this paper and the Evaluation Criteria used during the assessment are reasonable. The core method of the paper is to integrate different condition information into DPLM using a ControlNet-like approach. Since ControlNet has already been thoroughly validated as an effective method for introducing generation conditions into diffusion models, this approach is reasonable. The Evaluation Criteria in this paper also follow the evaluation methods used in previous work.

理论论述

This paper is more "application-oriented," and therefore does not include many theoretical claims.

实验设计与分析

The structure of this paper is clear and easy to follow. However, the experimental section needs further improvement, including additional analysis and more thorough discussion.Overall, the experimental design and analysis in this paper are reasonable, as the paper attempts to validate the impact of different conditions on model performance and demonstrates that the introduction of multiple conditions improves the generation performance of Diffusion Language Models. However, there are a few issues that need further validation:

The DPLM model may encounter some mode collapse issues, such as generating sequences with many repeated segments. However, this paper does not discuss the impact of this issue, such as whether mode collapse is mitigated after introducing additional conditions. I believe this discussion is necessary.
More case studies are needed. For example, visualizations of the model’s generated results when a motif is given as a constraint could further validate the model's ability to adhere to different types of conditions.
The results in Table 2 need more explanation. CFP-GEN achieves excellent performance on three indicators, except for scTM and pLDDT, where there is a significant performance gap compared to the best baseline model. What is the reason for this phenomenon? Could it be due to overfitting to the data?

补充材料

This paper does not provide supplementary materials.

与现有文献的关系

This paper is an extension and elaboration of the existing DPLM (Diffusion Protein Language Model), incorporating methods such as condition integration. Compared to ESM3, this model supports more types of conditional inputs.

遗漏的重要参考文献

The related work discussed in the paper is reasonable.

其他优缺点

The structure of this paper is clear and easy to follow. However, the experimental section needs further improvement, including additional analysis and more thorough discussion.

其他意见或建议

I don't have other comments.

作者回复

2025-03-31

Reviewer s6DK

We appreciate your recognition of the novelty and strong performance of our method. Your questions raise important points, and we provide detailed clarifications and new quantitative results below. We would be happy to receive any additional constructive feedback.

Q1. Mitigation of sequence pattern collapse over DPLM.

A1. Thank you for careful review. To assess whether CFP-GEN mitigates the mode collapse issue observed in DPLM, we analyzed the frequency of repeated n-gram patterns (n = 2, 3, 4, 5, 6) in the generated sequences from the GO-conditioned generation results in original Table 1. Real protein sequences from the validation set were used as a positive control.

Methods	2-gram	3-gram	4-gram	5-gram	6-gram
Positive Control	404	164	0	0	0
DPLM	315	462	104	46	26
CFP-GEN (w/ GO)	363	351	15	9	8
CFP-GEN (w/ GO and IPR)	365	332	9	5	4
CFP-GEN (w/ GO, IPR and Motif)	377	336	4	1	1

As shown, CFP-GEN produces a similar number of 2-grams to real proteins, while significantly reducing the number of longer repetitive n-grams, especially 4-gram to 6-gram patterns. Notably, the more functional conditions (e.g., GO, IPR, Motif) are provided, the fewer repetitive patterns appear in the output, indicating better sequence quality and reduced mode collapse. These results provide strong evidence that CFP-GEN effectively alleviates the mode collapse issue observed in DPLM. Theses discussions will be included in the revised supplementary material. Thanks!

Q2. More visualization results given motif as constraints.

A2. Thank you for the helpful suggestion. We agree that visualizing protein structures designed under specific motif constraints can provide valuable insights into the model's design choices under different functional conditions. However, due to the limitations of the rebuttal format, we are unable to include full visualizations here. We will provide these results in the revised supplementary material. We appreciate your understanding.

Q3. Discussion on scTM and pLDDT in Table2.

A3. Thank you for your careful and insightful observation. We have conducted an in-depth analysis to better understand the structural performance gap between CFP-GEN and DPLM baseline model.

We found that the slight drop in scTM and pLDDT scores is mainly due to its tendency to generate more novel structural segments. Specifically, we analyzed the distribution and transformation of secondary structure elements—namely alpha helices (H), beta strands (E), and coils (C)—within the designed proteins in original Table 2.

Using structural alignments between the designed and real target proteins, we categorized the transitions of secondary structure elements. For example, H→H represents a correctly preserved alpha helix, while C→H indicates a region originally a coil being redesigned as a helix. The results are shown below (format: local average pLDDT / number of secondary structure):

Method	H→H	E→E	C→C	C→H	C→E
DPLM	91.34 / 758,185	94.10 / 307,397	86.13 / 584,587	84.03 / 85,871	92.42 / 61,861
CFP-GEN	90.02 / 763,497	92.50 / 309,725	83.46 / 581,072	82.23 / 88,565	90.56 / 62,409
Δ Difference	–1.32 / +5,312	–1.60 / +2,328	–2.67 / –3,515	–1.81 / +2,694	–1.86 / +548

We observe that CFP-GEN produces more H (helix) and E (strand) segments, while the number of C (coil) regions is reduced. Many coil regions are transformed into more structured elements (C→H or C→E). This behavior reflects CFP-GEN’s design preference: to replace non-functional, flexible regions (coils) with more functionally relevant secondary structures (helices and strands).

While this results in a slight drop in local confidence scores (e.g., pLDDT), likely due to the absence of global conformational energy optimization, it reflects a function-oriented design strategy. We consider this a reasonable trade-off in the context of novel functional protein generation, and view it as a promising direction for future improvement. In particular, we plan to explore energy-based reinforcement learning frameworks to further optimize this aspect. The above analysis and discussion will be included in the revised supplementary material. We appreciate the reviewer’s valuable observation.

审稿人评论

2025-04-02

Thanks for your rebuttal! It addressed all of my concerns, so I've raised my score to 4. I think it's a solid paper — best of luck with the final decision!

作者评论

2025-04-06

Thank you very much for your positive feedback and for raising your score.

Your suggestions have been invaluable in improving the clarity and completeness of our work, and we will incorporate them into the revised version.

We sincerely thank you again for your support and wish you all the best!

审稿意见

评分: 32025-03-10

This paper presents CFP-GEN, a large-scale diffusion language model developed for Combinatorial Functional Protein Generation under multiple constraints from diverse modalities. CFP-GEN facilitates de novo protein design by jointly incorporating functional, sequence, and structural constraints. It employs an iterative denoising process to refine protein sequences while conditioning on various functional annotations (such as GO terms, IPR domains, and EC numbers), sequence motifs, and 3D structural features.

To achieve this, the model introduces two key modules: (1) Annotation-Guided Feature Modulation (AGFM), which dynamically adjusts sequence representations based on composable functional annotations, and (2) Residue-Controlled Functional Encoding (RCFE), which explicitly encodes critical residues and captures residue interactions and evolutionary relationships.

Additionally, CFP-GEN supports the integration of 3D structural constraints through off-the-shelf structure encoders. Experimental results show that CFP-GEN can generate novel proteins with functionality comparable to natural proteins and achieves a high success rate in designing multifunctional proteins.

给作者的问题

No.

论据与证据

See Experimental Designs Or Analyses

方法与评估标准

Yes, the proposed method is well designed for the task of protein generation.

理论论述

Not applicable, as there is no proof or theoretical claims.

实验设计与分析

The paper claims that CFP-GEN enables combinatorial functional protein generation under multiple constraints from diverse modalities through the introduction of the AGFM and RCFE modules. However, the experimental evidence provided does not fully support these claims.

In the Benchmarking Protein Functional Performance experiments, the influence of training data is not sufficiently controlled, leaving open the possibility that performance gains may stem from data memorization rather than the effective use of composable functional annotations. Furthermore, existing models such as ProGen2 and ZymCTRL could, in principle, incorporate multiple annotation types via prompt engineering by extending their vocabularies. The absence of comparative experiments with such baselines raises concerns about whether AGFM provides a meaningful advantage for handling multimodal constraints.

Similarly, in the Functional Protein Inverse Folding experiments, CFP-GEN utilizes additional functional labels during generation. Since these labels are not equally available to baseline methods, this experimental setup does not provide clear evidence isolating the effectiveness of RCFE in controlling functional sites or capturing residue-level interactions.

Furthermore, the paper lacks ablation studies on the model architecture, which are necessary to isolate the contributions of AGFM and RCFE. Without these analyses, it remains unclear how much each component contributes to the overall performance.

补充材料

Yes, all of the supplementary material was reviewed.

与现有文献的关系

The proposed CFP-GEN method is built upon prior diffusion protein language models, and extends these models by dynamically adjusting representations.

遗漏的重要参考文献

N/A

其他优缺点

Please refer to the Summary

其他意见或建议

I strongly advise the authors to conduct additional experiments to make the paper stronger.

作者回复

2025-03-31

Reviewer p2fD

We sincerely thank the reviewer for the positive feedback. We have carefully addressed the concerns below with new analyses and additional experiments, which will be incorporated into the final version. We welcome any further suggestions.

Q1. In-depth analysis of performance gain.

A1. We appreciate the reviewer’s insightful concern. Here, we provide evidence that the performance gains are not a result of memorizing known sequences:

Novelty and Diversity (plese see Q1 in Reviewer 1Xgk):

Our generated sequences exhibit much higher novelty and diversity compared to real proteins. This suggests that CFP-GEN does not simply replicate patterns of real proteins, but instead learns to generate truly novel sequences.
Mutation Analysis (plese see Q2 in Reviewer 1Xgk):

We observe plausible mutations even within conserved regions. These mutations tend to preserve functionally critical motifs while introducing reasonable variation, learning generalizable design principles.
Secondary Structure Distribution (plese see Q3 in Reviewer s6DK):

CFP-GEN tends to transform non-functional and flexible coil regions into functional alpha helices and beta strands. It reflects a function-oriented redesign strategy.

Taken together, these findings indicate that CFP-GEN’s improvements stem from its ability to learn function-guided design principles, rather than overfitting to the training data.

Q2. Discussion on annotation-guided generation with ProGen2 and ZymCTRL.

A2. We agree that existing autoregressive (AR)-based PLMs such as ProGen2 and ZymCTRL can potentially support more annotations. However, we found that diffusion models offer several advantages: 1. AR models generate strictly left-to-right, limiting their flexibility for tasks like motif scaffolding, which require conditioning at specific positions. In contrast, diffusion models allow arbitrary conditioning and precise position control, essential for realistic protein design. 2. Diffusion models are more readily sequence representations to discriminative tasks, while AR-based models are primarily designed for generation-only tasks. 3. Aligning multimodal prompts into a unified token space is challenging for AR models. Our diffusion framework supports flexible multimodal fusion, enabling separate encoding and seamless integration of each modality.

Since ProGen2 lacks public training code, we extended ZymCTRL by enlarging its vocabulary to include GO and IPR classes, but this did not improve performance. We attribute this to inconsistent annotation formats: for instance, EC:xxxx is split into multiple tokens in ZymCTRL instead of a single semantic unit, making it hard to integrate with GO:xxxx or IPR:xxxx, which are typically treated as discrete categories.

In contrast, CFP-GEN treats each EC/GO/IPR label as an independent class, and integrates them through learned embeddings with additive fusion. We believe more work is needed to enable effective prompt engineering for AR models across diverse annotations, and we welcome future comparisons as such approaches become available. We appreciate the reviewer’s understanding.

Q3. Inverse folding with additional functional labels .

A3. Since none of the existing inverse folding methods support functional labels (e.g., GO terms) as input, we implemented a heuristic baseline (DPLM+DeepGO). Specifically, we used DPLM to generate 20 candidate sequences per backbone, then applied DeepGO to predict GO terms and selected the sequence with the highest overlap with the labeled GO:

Methods	AAR	MRR	Fmax	scTM	pLDDT
DPLM	66.94	0.721	0.552	0.883	85.33
DPLM+DeepGO	67.29	0.726	0.559	0.886	85.49
CFP-GEN (w/ GO)	72.05	0.866	0.571	0.887	83.28

The marginal gains of this pipeline suggest that functional labels are hard to integrate into existing inverse folding models, motivating our development of CFP-GEN, an end-to-end solution that jointly reasons over structure and function. We will include this discussion in the final version. Thank you!

Q4. The contributions of AGFM and RCFE.

A4. Sorry for this confusion. The ablation studies are actually presented in the GO-conditioned generation results in Table 1 of the manuscript:

CFP-GEN (w/ GO and IPR) corresponds to using AGFM only, with an MRR of 0.779.
CFP-GEN (w/ Motif) corresponds to using RCFE only, with an MRR of 0.839.
CFP-GEN (w/ GO, IPR and Motif) combines AGFM and RCFE, achieving the best performance with an MRR of 0.870.

These results indicate that both modules independently contribute to performance, and their combination yields additive benefits. We will make this more clear in the final version. Thanks!

审稿意见

评分: 32025-03-13

This paper introduces CFP-GEN, a diffusion-based language model for combinatorial functional protein generation that integrates multimodal constraints. The proposed Annotation-Guided Feature Modulation and Residue-Controlled Functional Encoding modules enable flexible conditioning across diverse modalities. The model demonstrates superior performance in functional sequence generation, inverse folding, and multi-objective protein design.

给作者的问题

Can you provide examples of generated sequences that failed to achieve desired functional properties to better understand the model’s limitations?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

Yes. While the paper’s empirical focus is reasonable, a brief theoretical discussion on why AGFM and RCFE improve convergence or functional control would add value.

实验设计与分析

The authors should conduct additional analysis of failure cases to provide insights into the model's limitations.

补充材料

Yes, I have read all the sections.

与现有文献的关系

Expanding the discussion of CFP-GEN's relation to inverse folding techniques and highlighting distinctions from concurrent multimodal PLM designs would improve the paper's clarity.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

The integration of multimodal constraints into a unified framework addresses a critical gap in controllable protein generation. The composable conditioning mechanism (AGFM/RCFE) is a meaningful advance over single-modality approaches. 2. Comprehensive experiments across three tasks (functional generation, inverse folding, multi-functional design) validate the method’s superiority. The use of state-of-the-art function predictors and structural metrics strengthens credibility.
High success rates in designing multi-functional enzymes and improved sequence recovery in inverse folding suggest tangible applications in enzyme engineering and drug discovery.

Weaknesses:

The dataset filters out low-frequency functional annotations (e.g., GO/IPR terms with <100 sequences), potentially limiting generalizability to rare functions. While results on held-out validation sets are strong, long-tail performance remains unverified.
The use of a frozen, off-the-shelf structure encoder (GVP-Transformer) without fine-tuning may restrict structural optimization. While the authors claim that the pretrained cross-attention layer from DPLM can be used directly without fine-tuning, this assertion lacks experimental validation. Ablation studies on structure-conditioned generation are lacking.

其他意见或建议

N/A.

作者回复

2025-03-31

Reviewer JkEt

We appreciate your thoughtful comments and have addressed your concerns in detail below. The corresponding clarifications and expanded discussion will be reflected in the final version. We welcome any valuable suggestions you may have.

Q1. Generalizability on rare functional annotations.

A1. To evaluate the generalizability of CFP-GEN on long-tail functions, we constructed an expanded dataset by relaxing the filtering threshold: we included all GO terms with ≥30 sequences in SwissProt, resulting in a total of 726 GO terms (a significant increase from 375 in the original dataset).

This updated dataset exhibits a typical long-tail distribution:

The head 20% of GO terms (146 classes) cover 72.7% of the sequences,
The tail 20% (145 classes) cover only 2.4% of the sequences.

We report generation results conditioned on GO/IPR labels, stratified by class frequency:

Frequency Segment	#GO	% Seq.	MRR↑	MMD↓	MMD-G↓	mic. F1↑	mac. F1↑	AUPR↑	AUC↑
Head (Top 20%)	146	72.7%	0.725	0.072	0.043	0.598	0.529	0.409	0.763
Medium (60%)	435	24.9%	0.702	0.074	0.044	0.601	0.497	0.370	0.745
Tail (Bottom 20%)	145	2.4%	0.687	0.114	0.072	0.565	0.496	0.362	0.744

These results demonstrate that CFP-GEN maintains robust performance even in tail categories. While a slight performance drop is observed in the tail segment (e.g., MRR of 0.687 vs. 0.725 in head), the overall scores remain strong. This suggests that CFP-GEN has learned generalizable design principles that extend beyond well-represented functions, enabling it to generate proteins even for underrepresented functional categories. These discussions will be added to the supplementary material. Thanks!

Q2. Structure-conditioned generation with fine-tuned weight.

A2. Thank you for the insightful comment. To address this, we conducted additional experiments to compare frozen vs. fine-tuned structure encoder and cross-attention layers within CFP-GEN.

Setting	AAR ↑	MRR ↑	Fmax ↑	scTM ↑	pLDDT ↑
CFP-GEN (Frozen)	73.53	0.875	0.575	0.888	83.48
CFP-GEN(Finetuned)	76.39	0.882	0.581	0.889	83.53

As shown, fine-tuning further improves performance, while the frozen variant already performs strongly—especially in low-resource settings with limited structural labels. These results highlight CFP-GEN’s flexibility, allowing full fine-tuning when data is sufficient. We will include these results in Table 2 of the manuscript for clarification. Thank you!

Q3. CFP-GEN's relation to other inverse folding and multimodal PLM works.

A3.

Multimodal PLMs: As discussed in the related work section, concurrent efforts such as DPLM support both sequence and structure modalities. However, in practice, inference is performed using only one modality at a time, without effective multimodal fusion. Furthermore, DPLM does not support function labels. While ESM-3 enables multimodal inputs, its function relies on a limited set of free-text keywords, which cannot adequately capture complex functions (GO/IPR/EC descriptions). In contrast, CFP-GEN supports true multimodal fusion of sequence, structure, and functional labels, and demonstrates superior performance over both DPLM and ESM-3 in original Table 1.
Inverse Folding: Existing works, such as ProteinMPNN and ESM-IF, generate sequences based only on protein backbone structures. By comparison, CFP-GEN introduces a new design paradigm that incorporates functional labels as additional input. This enables function-aware inverse folding, improving the AAR, as shown in Table 2. We believe this is a practical and meaningful extension to traditional inverse folding.

These discussions will be added to the revised related work section.

Q4. Failed examples of the generated sequences.

A4. Since CFP-GEN partially builds upon DPLM, it occasionally inherits the mode collapse issues observed in DPLM, leading to highly repetitive segments. As discussed in Q1 in Reviewer s6DK, we demonstrated that CFP-GEN largely mitigates this. However, we acknowledge that mode collapse can still occur sometimes:

UniprotID: Q56217
...GLALLLLLLLLLLLLPLPPPPPPPPPPPPPPPPPP...

UniprotID: D4GWC8
...KEMKEAEEAEAEAEKKAEAEAEKKAEAEAEKKAEEKKEE...

This type of failure can be mitigated by introducing more diverse conditions and adjusting diffusion hyperparameters. We also plan to incorporate reinforcement learning feedback to further reduce such failure modes in future work. Thanks!

审稿意见

评分: 42025-03-14

This paper proposes a novel protein language model, CFP-GEN, which leverages discrete diffusion generation to design functional proteins. The key innovation lies in incorporating annotated protein labels, such as Gene Ontology (GO) terms, InterPro (IPR) domains, and Enzyme Commission (EC) numbers, during diffusion training, similar to classifier-free guidance in diffusion models. Additionally, CFP-GEN allows conditioning on protein structure. Comprehensive experiments demonstrate that CFP-GEN outperforms previous protein sequence generative models in generating proteins with accurate GO terms, IPR domains, and EC numbers.

给作者的问题

As mention above, one of the most exciting aspects of protein design is creating novel proteins that do not exist in nature but can perform specific functions. In this paper, the proposed method demonstrates that the generated sequences closely match the properties of the given condition labels. However, have you tested CFP-GEN’s ability to generate truly de novo proteins? For instance, if conditioned only on a specific EC number, can CFP-GEN generate sequences with novel structural motifs rather than sequences closely resembling known proteins? Additionally, have you analyzed the diversity of the generated sequences and structures? Do they all contain conserved regions, or does CFP-GEN exhibit variation in its outputs?

论据与证据

The authors state that previous PLMs typically generate protein candidates based on a single-condition input from a specific modality. However, providing more references here would strengthen the argument by situating CFP-GEN more clearly within the broader landscape of protein generation models.

The remaining claims, such as CFP-GEN’s ability to design multi-functional proteins and its improved performance in inverse folding, are well-supported by experimental evidence. The reported results in Tables 1 and 2 demonstrate that CFP-GEN outperforms previous models in functional protein generation, structural fidelity, and multi-objective optimization, validating the effectiveness of its multimodal conditioning approach.

方法与评估标准

The method is based on the diffusion protein language model DPLM, embedding all annotations into the diffusion conditioning module. The authors also propose the Annotation-Guided Feature Modulation (AGFM) module, which effectively adjusts the intermediate representations, and the Residue-Controlled Functional Encoder (RCFE), which enhances controllability over the generated sequences compared to previous approaches.

The authors evaluate the model on a protein sequence generation task using different annotations as prompts. For evaluation, they use DeepGO-SE for predicting Gene Ontology (GO) terms, InterProScan for homology-based annotation, and CLEAN for catalytic function prediction. Comprehensive experiments demonstrate that CFP-GEN, by leveraging annotation-based conditioning, outperforms other protein language models in generating functionally relevant sequences.

理论论述

The paper does not focus on formal theoretical claims.

实验设计与分析

As mention above, the authors evaluate CFP-GEN on a protein sequence generation task using different annotations as prompts. For evaluation, they utilize DeepGO-SE for predicting Gene Ontology (GO) terms, InterProScan for homology-based annotation, and CLEAN for catalytic function prediction. The experimental results demonstrate that CFP-GEN, by incorporating annotation-based conditioning, outperforms other protein language models in generating functionally relevant sequences.

Additionally, the authors assess the model on the inverse folding task, showing that incorporating more conditioning information reduces the uncertainty in sequence generation and leads to a higher sequence recovery rate. This highlights CFP-GEN’s ability to generate sequences that are both structurally and functionally consistent.

补充材料

The authors provide comprehensive supplementary material, including detailed descriptions of the datasets, evaluation metrics, hyperparameter settings, implementation details of existing PLMs, and an introduction to multi-catalytic enzymes.

与现有文献的关系

This paper presents a novel and creative approach to protein sequence generation by incorporating protein label annotations into a diffusion-based language model. Given the increasing importance of functional protein design in biotechnology and drug discovery, this approach has the potential to influence future research and applications in protein engineering.

遗漏的重要参考文献

其他优缺点

Strengths: The authors successfully integrate the three most common annotation labels—GO terms, IPR domains, and EC numbers—into a single protein language model. This multimodal conditioning approach makes protein design more controllable, allowing for more precise functional protein generation compared to previous single-condition models.

Weaknesses: Although the authors conduct comprehensive experiments demonstrating that conditioning on annotations leads to the generation of functionally relevant proteins, it is unclear whether the model truly generates de novo proteins or primarily memorizes sequences from the training database. Instead of relying on extensive conditioning to generate highly specific proteins that closely resemble known sequences, the authors could explore using fewer conditions to assess whether the model can still generate novel yet functional proteins. This would provide stronger evidence of the model’s ability to innovate beyond known protein sequences.

其他意见或建议

line 355 natural sequence -> protein sequence

作者回复

2025-03-31

Reviewer 1Xgk

Thanks so much for acknowledging the novelty of our work and for providing thoughtful and constructive comments. We provide clarifications to your concerns below, which we will incorporate into the final version. Please let us know if you have any further valuable comments or suggestions.

Q1. The novelty and diversity of the generated proteins.

A1. Thank you for your insightful advice. To examine whether our model generate truly de novo proteins, we selected 7 diverse EC numbers from different top-level categories: Oxidoreductases, Transferases, Hydrolases, Lyases, Isomerases, Ligases, and Translocases. For each EC number, CFP-GEN generates 30 sequences conditioned only on the EC number. We then compared these generated sequences with 30 real proteins from the enzyme validation set with the corresponding EC number. Novelty is computed by measuring how different each generated sequence is from its most similar real protein in the training set, while diversity is computed by capturing how different the generated sequences are from the overall training set. To ensure both metrics are interpretable in the same direction, we subtract the scores from 1 (i.e., higher is better).

R1-Table 1. Novelty comparison between real and designed proteins.

Method	EC:1.5.1.5	EC:2.7.11.1	EC:3.6.4.13	EC:4.2.1.33	EC:5.2.1.8	EC:6.1.1.20	EC:7.1.2.2
Real Proteins	0.254	0.234	0.334	0.252	0.296	0.215	0.221
CFP-GEN	0.379	0.390	0.412	0.303	0.302	0.265	0.449

R1-Table 2. Diversity comparison between real and designed proteins.

Method	EC:1.5.1.5	EC:2.7.11.1	EC:3.6.4.13	EC:4.2.1.33	EC:5.2.1.8	EC:6.1.1.20	EC:7.1.2.2
Real Proteins	0.698	0.764	0.676	0.654	0.745	0.612	0.589
CFP-GEN	0.748	0.766	0.725	0.565	0.714	0.677	0.760

We observe that CFP-GEN consistently achieves higher sequence novelty across all 7 EC numbers, demonstrating its strong potential for de novo protein design beyond simply replicating known sequences. Moreover, the generated sequences exhibit high intra-class diversity in 5 out of the 7 EC categories. These results suggest that the model has learned a more generalized representation of functional proteins, rather than overfitting to training examples, and effectively avoids mode collapse. We will update the final version to reflect these discussions.

Q2. Examples of conserved regions with mutations.

A2. Thank you for raising this important point. We selected two representative case studies (e.g., EC: 1.5.1.5 and EC: 4.2.1.33) and performed sequence alignments between CFP-GEN-generated sequences and known proteins from the same EC class. We observed that CFP-GEN introduces mutations at specific positions within conserved regions, rather than simply copying them. The alignment results are presented below:

CFP-GEN(EC:1.5.1.5): ...GTPVFVHAGPFANINHGANS...   
Real Proteins:       ...GTPAFVHGGPFANIAHGNSS...    
				     ...GTPLVVHAGPFANIAHGNSS...    
				     ...GTPVFVHAGPFANIAHGNSS...  
				     ...GTPVLVHAGPFANIAHGNSS...
				           ↑↑  ↑      ↑  ↑↑

CFP-GEN(EC:4.2.1.33): ...MTIVCGDSHTSTHGAFGALA...  
Real Proteins:        ...MTVVCGDSHTSTHGAFGCLA...
				      ...MTIACGDSHTSTHGAFGAIA...
				      ...TTIVCGDSHTSTHGAFGALA...
				      ...MTIACGDSHTSTHGAFGNIA...
				         ↑ ↑↑             ↑↑

The mutated positions are indicated by arrows. In the first case, the CFP-GEN-designed sequence preserves the core motif (e.g., GTP…GPFANI), while introducing mutations such as A→V, L→F, or A→N at semi-conserved sites. In the second case, mutations (e.g., T→M, V→I, C→A) occur at non-critical positions, while maintaining key motifs (CGDSHTSTHGAFG), supporting the model’s ability to retain functional cores while exploring novel sequence variations. This discussion and the corresponding examples will be included in the revised supplementary material. Thank you!

审稿人评论

2025-04-02

Thank you for the additional experiments. Regarding the generated proteins, did you identify any novel motifs? For instance, in the case of the EC:1.5.1.5 enzyme, the GTP…GPFANI region is expected to contain a conserved structural motif. Did you observe any newly generated proteins doesn't have this specific motif? I understand that structural comparison can be quite time-consuming, it would be valuable to include some structural comparisons in the final version of the paper.

Overall, I believe this is a solid paper, and I will maintain my current score.

作者评论

2025-04-06

Thank you for your insightful suggestion and support for our work. Following your recommendation, we conducted a detailed structural comparison between the generated proteins and known structures. Below, we provide representative examples, and additional results will be included in the revised supplementary material.

To better characterize these novel motifs, we also provide their secondary structure (SS) annotations (e.g., helix (H), strand (E), coil (C)). These results demonstrate that CFP-GEN is capable of generating entirely new structural motifs, while still maintaining functional viability.

Uniprot ID: A3M4Z0 
CFP-Gen:           …VSLLQEYVTWEMGKLEKLES…
SS Annotation:     …HHHHHHHHCCCCHHHHHHHH…
Real Protein:      …AGFIRRYVSWQPSPLEHIE…
SS Annotation:     …HHHHHHHHHCCCCHHHHHH…

Uniprot ID: A3M9Y1 
CFP-Gen:           …RPLNQTMPQALALLAPEQRPTVWHQ…
SS Annotation:     …HHHHHHHHHHHHHCCHHHCCEEEEE…
Real Protein:      …AKALNERLPPALKQLEVPLNIFHQ…
SS Annotation:     …HHHHHHHHHHHHHCCCCCEEEEEE…

Uniprot ID: A6NJ78
CFP-Gen:           …IRIYVNSELEEIEQALKSAERVLAPGGRLSIIS…
SS Annotation:     …HHHHHHHHHHHHHHHHHHHHHHHCCCCCEEEEE…
Real Protein:      …LRIFVNNELNELYTGLKTAQKFLRPGGRLVALS…
SS Annotation:     …HHHHHHHHHHHHHHHHHHHHHHEEEEEEEEEEE…

Uniprot ID: Q8T9Z7 
CFP-Gen:           …AAERQTTFNDMIKIALESVLLGDASGPEGQ…
SS Annotation:     …HHHHHHHHHHHHHHHHHHHHHHHHCCHHHC…
Real Protein:      …VPHQLENMIKIALGACAKLATKYA…
SS Annotation:     …HHHHHHHHHHHHHHHHHHHHHHCC…

Uniprot ID: Q9HW26
CFP-Gen:           …PVAQALDALESKLVDFSALT…
SS Annotation:     …HHHHHHHHHHHHHHHHHHHH…
Real Protein:      …TVEQARERLQEKFDWLRREASAEELAGF…
SS Annotation:     …HHHHHHHHHHHHHHHHHHHHHHHHHHHH…

Uniprot ID: Q9KLJ3 
CFP-Gen:           …LRIISATAKKLGMSMDN…
SS Annotation:     …HHHHHHHHHHHHHHHHH…
Real Protein:      …NIRIIQTLCDLAGIAQDKA…
SS Annotation:     …HHHHHHHHHHHHHHHHHHH…

Uniprot ID: Q9R4E4
CFP-Gen:           …GTTMRLMAGVLAGQPFFSVL…
SS Annotation:     …HHHHHHHHHHHHHCCCCEEE…
Real Protein:      …AATGCRLTMGLVGVYDFDSTFI…
SS Annotation:     …HHHHHHHHHHHHHHCCCEEEEE…

Uniprot ID: Q9A874 
CFP-Gen:           …YTRHEYFRRILCQMIGRWVEAGEAPPADIPLLGEMVKNICFNNARDYF…
SS Annotation:     …HHHHHHHHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHHHHHHHHHH…
Real Protein:      …IPARHDVARRVDSAFLARMVAEHRMDLVEAEELIVDLTYNLPKKAY…
SS Annotation:     …HHHHHHHHHHHHHHHHHHHHHHCCCHHHHHHHHHHHHHHHHHHHHH…

最终决定Accept (poster)

2025-05-01

This paper introduces CFP-GEN, a novel protein language model that employs discrete diffusion generation for functional protein design. Its key innovation is the integration of annotated protein labels—such as Gene Ontology (GO) terms, InterPro (IPR) domains, and Enzyme Commission (EC) numbers—during diffusion training, analogous to classifier-free guidance in diffusion models. Additionally, CFP-GEN supports structure-conditioned generation. Extensive experiments demonstrate that CFP-GEN surpasses existing protein sequence generation models in producing proteins with accurate functional annotations.

The reviewers commend the paper for its clear writing and highlight that the unified multimodal constraint framework addresses a critical gap in controllable protein generation. Results in Tables 1 and 2 show that CFP-GEN outperforms prior methods in functional protein generation, structural fidelity, and multi-objective optimization, validating its multimodal conditioning approach. While some concerns were raised regarding additional experimental analyses, connections to inverse folding and multimodal PLM works, and the contributions of AGFM and RCFE, the reviewers were largely satisfied with the authors' clarifications during the discussion period.

Thus, I recommend acceptance.