Diffusion on Language Model Encodings for Protein Sequence Generation
A continuous latent diffusion framework for protein sequence generation with strong performance and versatile conditional generation capabilities.
摘要
评审与讨论
The paper introduces DiMA, a latent diffusion framework that works on protein language model representations. While protein sequence design has advanced with discrete and autoregressive methods, the potential of continuous diffusion has been under-explored. DiMA is developed through a systematic exploration of architectural choices and diffusion components, enabling it to generalize across multiple protein encoders with 8M to 3B parameters. It achieves high performance across different types of protein representations like sequence-only, dual-decodable, and multimodal ones using the same architecture and training approach. DiMA is extensively evaluated against existing methods using multiple metrics across two protein modalities. It generates novel, high-quality, and diverse protein sequences, outperforming baselines in various tasks such as unconditional generation, family-specific design, motif scaffolding, and fold-conditioned generation. With only 35M parameters, it shows strong performance on multiple benchmarks. DiMA also demonstrates versatile functionality in conditional generation tasks. This work offers a universal continuous diffusion framework for protein sequence generation, providing both architectural insights and practical applications across diverse protein design scenarios.
给作者的问题
- Line 98: I'm wondering whether the normalize is applied over the length dimension s or the hidden dimension d. From the last sentence, it seems that the normalization is over the length dimension, which seems weird for generation.
- Do the authors also use ESM-2 decoder for diffusion on CHEAP and SaProt representations?
- What's the rationale behind the heurisitc approach that the reconstruction loss should exhibit an approximately linear increase over diffusion time?
- Typically, the quality of generated samples from diffusion models can be controlled by adjusting the temperature. I'm wondering how the authors set the temperature values in different experiments and for different baselines.
论据与证据
All claims are well supported by convincing evidence and experiments.
方法与评估标准
Although the method is directly built on existing latent diffusion methods and protein language models, which may limit its novelty, I find it conceptually strightforward and useful in practice, even generalizable to sequence-structure co-design. I especially like the author's comprehensive ablation study and exploring the training and inference strategies for diffusion models instead of directly copying parameterization from image domain, which was commonly done in previous works.
理论论述
N/A
实验设计与分析
The evaluation of the method is thorough and convincing, covering different aspects of the generated samples and extending to more interesting conditional tasks. I believe the experiments are convicing enough to demonstrate the advantages of the proposed method.
补充材料
I only quickly go through the supplementary material.
与现有文献的关系
See Summary.
遗漏的重要参考文献
N/A
其他优缺点
See above.
其他意见或建议
- The formatting of the paper requires improvement. For instance, the titles of the paper and its subsections should start with uppercase letters. An example would be "Diffusion on Language Model Encodings for Protein Sequence Generation."
- Line 51: "We demonstrate that continuous diffusion on protein embeddings enables effective sequence redand structure generation across multiple tasks and encoder architectures." typo: "redand" -> "and"
Thank you for your review and positive assessment of our work. We appreciate your recognition of our systematic ablation studies and approach to developing protein-specific diffusion parameterizations rather than simply adopting techniques from the image domain. Below, we address each point raised.
C1, C2. Formatting and typos.
Thank you for pointing out the formatting issues and typos. We will fix the formatting inconsistencies, including proper capitalization of titles and section headings.
Q1. Line 98: I'm wondering whether the normalize is applied over the length dimension or the hidden dimension . From the last sentence, it seems that the normalization is over the length dimension, which seems weird for generation.
We apologize for the unclear description. The normalization is applied over the hidden dimension rather than the sequence length dimension . Specifically, for each component , we precompute the mean and variance over the training data and then apply normalization using these statistics to achieve zero mean and unit variance. We will clarify this in the revised manuscript.
Q2. Do the authors also use ESM-2 decoder for diffusion on CHEAP and SaProt representations?
We use the same approach across all encoder architectures. For CHEAP and SaProt representations, we fine-tune their corresponding pretrained decoders alongside the diffusion model, similar to our approach with ESM-2. We found that low-dimensional embeddings (e.g., ESM-2 8M, =320) are less robust to small perturbations than higher-dimensional ones (ESM2/SaProt 650M, =1280, CHEAP, =1024). Fine-tuning the decoders helps minimize these effects during diffusion generation.
Q3. What's the rationale behind the heurisitc approach that the reconstruction loss should exhibit an approximately linear increase over diffusion time?
The key issue with standard noise schedules (linear, cosine) is that they corrupt data very gradually at small timesteps. This means the model spends significant training resources on nearly trivial denoising tasks where input and target are almost identical. Our approach ensures the difficulty of the denoising task increases steadily with each timestep. By designing a noise schedule where reconstruction loss grows linearly with diffusion time, we create learning problems of incrementally increasing difficulty, allowing the model to make consistent progress throughout training rather than facing an abrupt jump in task complexity.
Q4. Typically, the quality of generated samples from diffusion models can be controlled by adjusting the temperature. I'm wondering how the authors set the temperature values in different experiments and for different baselines.
For baseline comparisons, we use the sampling parameters recommended by the original authors of each method. For autoregressive models like ProGen2 and ProLLAMA, which show suboptimal quality and collapse to highly repetitive sequences on default settings, we performed grid searches to identify optimal temperature and top-p values. DiMA has two analogous parameters for navigating the quality-diversity trade-off: the number of generation steps and self-conditioning rate. The dependencies of quality, diversity and novelty on these parameters are shown in (figures 2 and 3 on https://tinyurl.com/icml25re).
We look forward to incorporating your feedback in our final version and thank you again for your time and insightful comments.
Thanks for the author's response, which addresses most of my concerns. For Q3, I'm a little concerned about the saying that small noise is nearly trivial - this requires the model to learn the fine-grained denoising capability.
When working with protein encodings, reconstruction at small noise levels turns out to be quite robust. For example, testing DiMA with CHEAP representations (which enables dual-decoding into both sequence and structure) we observe that at =0.05, sequence reconstruction accuracy remains 100%, and structural RMSD stays below 0.2Å (figures 5 and 6 at https://tinyurl.com/icml25re). Our schedule leverages this robustness by allocating more training to challenging noise levels rather than the nearly lossless stages with →0.
This paper introduces DiMA, a continuous latent diffusion model that creates (novel) protein sequences using protein language model (PLM) hidden representations. Unlike other approaches that use discrete diffusion or step-by-step generation, DiMA explores continuous diffusion to make better sequences. It works well with different protein encoders (ESM-2, CHEAP, and SaProt to name a few) and performs strongly in tasks such as adding motifs, creating fold-specific sequences, and designing protein families. Experimental (in silico) results show that DiMA can make diverse, new, and structurally sound protein sequences that often beat existing methods.
给作者的问题
- How does DiMA compare in speed (training/inference) to discrete diffusion models such as ESM3, DPLM?
- Since the sequence representation can also share some merits with the structure prediction models (such as Alphafold2) such that people repurpose the PLM to folding (eg., ESMFold), can you train or fine-tune DIMA with structure-based tasks and data? i.e. explicitly incorporating structure information into account.
论据与证据
The paper explore the proposed paradigm (DiMA) to generate protein sequences comparable to larger models, where the major claim is that DiMA demonstrates strong performance on multiple sequence-design based benchmarks, which i find is however not well supported by the experiment (in table 2, not yet outperform other baseline models).
方法与评估标准
The proposed method, as an application of the well-established latent diffusion models, is straightforward and makes sense. In regard of methodology, there is no novel theory or method is proposed. The evaluation criteria basically make sense with authors’ justification but not following previous practice.
理论论述
No theoretical claim is found in this application-oriented paper.
实验设计与分析
The experiments are comprehensive by design, covering several useful ablation studies across different tasks for sequence generation. The extended results in the appendix also demonstrate the solidness of experiments. Personally speaking, the use of swissprot and AFDBv4-90 is not very typical for sequence-based protein language models.
补充材料
I carefully check the model details and the definition of metrics. I also roughly go through the extended experimental results (figure and tables) in the appendix.
与现有文献的关系
This work is situated at the intersection of (latent) diffusion models, protein language models and the protein design. It involves engineering the foundational protein language models (related: ESM, DPLM, etc.) and use denoising diffusion models over the latent space.
遗漏的重要参考文献
No, but I recommend the authors to explicitly set a “related work” section in the main text rather than in the appendix, which is important for general readers to grasp a proper context of this paper.
其他优缺点
Strengths:
- Clear problem and methodology
- Comprehensive testing/benchmarking experiments and comparison
Weaknesses:
- Limited theory behind the proposed method, making is a straightforward application practice of well-established method
- Not comprehensive discussion of potential limitations or failure cases
- Not strong performance compared to existing methods which may weaken the contribution of this work
其他意见或建议
- Line 187-202. the long list of bullet points is not reader-friendly. Please consider re-organize it.
- A discussion on how to scale up DiMA (maybe by combining it with structural models like AlphaFold) would be valuable.
Thank you for your thorough review and constructive comments. We appreciate your recognition of our work's comprehensive experimental design and clear methodology.
W1. Limited theory behind the proposed method, making is a straightforward application practice of well-established method
While latent diffusion models have been established in other domains, our work demonstrates that naive application of continuous diffusion to protein sequences yields poor results. Our ablation studies in Table 1 show that standard diffusion implementations struggle with protein generation, achieving only ~60 pLDDT compared to DiMA's 83.3 pLDDT. We demonstrate that domain-specific engineering is essential for effective latent diffusion models.
The challenge of applying diffusion to proteins is further evidenced by existing approaches requiring custom modifications. For example, discrete model DPLM [1] requires specialized initialization and sampling strategies [1, Appendix D.1], MultiFlow requires distillation steps, and so on.
Our work contributes by establishing a principled framework for latent diffusion that generalizes across different representation spaces (from 8M to 3B parameter sequence-based and multimodal encoders) and tasks with a single architecture. As shown in our fold-conditioning experiments, this enables DiMA to achieve stronger performance than specialized structure-generation models like RFDiffusion (TM-score 0.93 vs 0.48).
W2. Not comprehensive discussion of potential limitations or failure cases
Thank you for this suggestion. In our revision, we will add a dedicated discussion section addressing current challenges and future opportunities. Structure generation in our approach relies on heavyweight folding models like ESMFold. We see significant potential for enhancing performance through joint training of the diffusion model with lightweight structure prediction components. This approach could both improve computational efficiency and structural awareness of the model.
W3. Not strong performance compared to existing methods which may weaken the contribution of this work
As shown in Figure 2, DiMA, MultiFlow, RFDiffusion, and DPLM each demonstrate Pareto-optimal performance on the quality-diversity tradeoff, with each dominating in different aspects. Considering model sizes, DiMA shows remarkable performance, using two orders of magnitude fewer parameters than e.g. DPLM. Table 8 demonstrates that only DiMA and MultiFlow achieve balanced metrics on this tradeoff, with the notable advantage that DiMA trains exclusively on sequences while MultiFlow requires both sequence and structure data, making DiMA more scalable due to its independence from limited 3D structural data.
Q1. How does DiMA compare in speed (training/inference) to discrete diffusion models such as ESM3, DPLM?
We analyzed inference speed by generating 100 sequences of length 250 with 5 repetitions (for details see Figure 4 at https://tinyurl.com/icml25re). DiMA-35M is ~10x faster than DPLM-150M, which is slowed by using a larger model at each generation step. Structure models are notoriously computationally demanding. We have not measured training speed, but we expect DiMA to be faster as it requires only forward encoder passes, whereas discrete models need backward encoder passes as well.
Q2. Can you train or fine-tune DIMA with structure-based tasks and data?
Yes, DiMA incorporates structural information in two ways:
-
We use local structural tokens for each amino acid passed with the sequence into the SaProt encoder. This allows explicit use of 3D structure in the scaffolding task, achieving strong results (details in section 3.6.1, appendix E.8, and Figure 3).
-
We employ DiMA with the CHEAP encoder based on ESMFold, enabling dual decoding (sequence and structure) from sequence alone. This approach outperforms RFDiffusion in fold-specific generation using explicit structure guidance (details in sections 3.6.3 and appendix E.9, and Figure 4).
C1. Line 187-202. the long list of bullet points is not reader-friendly. Please consider re-organize it.
Thank you for this feedback. We will reorganize this section in the revision to improve readability.
C2. A discussion on how to scale up DiMA (maybe by combining it with structural models like AlphaFold) would be valuable.
Thank you for this suggestion. We are currently exploring the integration of structural generation models with our latent diffusion approach for co-generation. In the revision, we will add a discussion section highlighting future directions and the potential benefits of combining DiMA with structural models.
We thank you for your thoughtful feedback and suggestions. We would be grateful if you would consider raising your score, in case we have addressed your concerns. Please let us know if any aspects still need clarification.
Thank the authors for their response, which has basically addressed my concerns.
I look forward to seeing DIMA as a better-designed “principled framework” as you put, a framework can be readily and easily applied for any PLMs, which can be very valuable for the whole community. I hope the authors can benefit from my reviewing comments on improving the paper quality. I decide to raise my score to 3.
The paper introduces DiMA, a latent diffusion approach for protein sequence generation leveraging pre-trained embeddings. The authors consider sequence-only, structural, and sequence-structure joint embeddings. DiMA produces novel and high pLDDT samples. Conditional generation tasks based on protein family, motif scaffolding, and infilling are shown.
给作者的问题
See above
论据与证据
Yes
方法与评估标准
Yes
理论论述
N/A
实验设计与分析
Yes
补充材料
No
与现有文献的关系
The paper is contextualized well within the literature, both with respect to different pre-trained embeddings and generative sequence models.
遗漏的重要参考文献
Concurrent work that the authors will be interested in:
Lu, A. X., Yan, W., Robinson, S. A., Yang, K. K., Gligorijevic, V., Cho, K., ... & Frey, N. (2024). Generating All-Atom Protein Structure from Sequence-Only Training Data. bioRxiv, 2024-12.
其他优缺点
Latent diffusion over pre-trained embeddings is an interesting and timely idea. The DiMA evaluation is thorough, as are the benchmarks and baselines.
Can the authors comment on the 4JHW and 5YUI PDB IDs as failure modes? What causes DiMA and other methods to struggle with these particular cases?
It would be more informative to include a graphic overview of the model and approach as Fig 1, rather than the noise schedule.
Distance metrics like FD-seq and other metrics should be introduced and defined in the main text.
其他意见或建议
See above
Thank you for your review and the valuable suggestions. We appreciate your recognition of DiMA's thorough evaluation and benchmarking approach. Below, we address each point raised.
Concurrent work that the authors will be interested in.
Regarding the work by Lu et al. (2024), we became aware of PLAID and evaluated it after our submission, just before receiving the reviews. We evaluated PLAID-100M (https://github.com/amyxlu/plaid) using the same protocol, we apply to all the models in our work. Our analysis shows that DiMA substantially outperforms PLAID, especially in terms of protein quality:
| Model | pLDDT ↑ | ProGen-ppl ↓ | ESM2-pppl ↓ | scPerplexity ↓ | Rep | CD₀.₅ | CD₀.₉₅ |
|---|---|---|---|---|---|---|---|
| PLAID-100M | 53.48 | 14.982 | 13.46 | 2.294 | 0.0007 | 1.0 | 1.0 |
| DiMA-35M | 83.4 | 9.00 | 5.80 | 1.78 | 0.010 | 0.969 | 1.000 |
We will include these results in the final manuscript, specifically we will update Section 3.5 ('Comparison with Large Pretrained Models') and Figure 2.
Can the authors comment on the 4JHW and 5YUI PDB IDs as failure modes? What causes DiMA and other methods to struggle with these particular cases?
These cases highlight limitations in the current benchmark rather than specific model shortcomings. Our analysis reveals that even when using reference sequences, the resulting structures after ESMFold prediction fail to meet the benchmark success criteria. For 4JHW, the reference sequence yields motif RMSD exceeding 6.0Å with pLDDT around 30, far below the thresholds for success (RMSD ≤ 1.0Å, pLDDT ≥ 70). Similarly, 5YUI produces motif RMSD above 3.0Å. The detailed results of our analysis on these challenging cases are presented in Table 1 at [https://tinyurl.com/icml25re].
Recent work [1] has recognized these benchmark limitations and is developing improved evaluation protocols, specifically addressing them in Table 4 of their preprint.
It would be more informative to include a graphic overview of the model and approach as Fig 1, rather than the noise schedule.
We appreciate your suggestion about including a graphic overview of our model approach. In the final version, we are commited to add a figure to illustrate DiMA's architecture and workflow.
Distance metrics like FD-seq and other metrics should be introduced and defined in the main text.
We will also ensure that distance metrics like FD-seq and other evaluation metrics are properly introduced and defined in the main text to improve clarity.
Thank you again for your constructive feedback, which will help us improve the final manuscript.
I thank the authors for their response. They have addressed the points raised in my review and I will raise my score.
The authors have proposed a continuous diffusion framework, named DiMA. DiMA consists of three modules, i.e. 1) frozen pLMs like ESM2 to extract latent embedding for a given protein sequence, 2) a continuous diffusion module to generate latent embedding for noise and 3) a decoder that maps the latent embeddings to amino acid sequences. It can be seen as a kind of knowledge distillation that injects information from pre-trained large pLMs, like ESM or SaProt, into DiMA. Through comprehensive and systematic experiments, the authors claim that the performance of DiMA is consistent with (or better than) specialized baselines in different scenarios like unconditional generation and conditional generation, such as motif scaffolding and fold-conditioned generation.
给作者的问题
- The authors introduced DiMA, a continuous diffusion method for protein sequence generation. However, they did not adequately demonstrate the necessity of using continuous diffusion compared to the discrete diffusion approach employed in DPLM and multiflow, which appears more intuitive for sequence generation.
- Do DiMA have a sequence length preference? Say we separate protein sequences by their length into different bins, like [0-50],[50-100],…, what’s the performance of DiMA in each group, in terms of quality, diversity and novelty?
- As shown in Table 3, DiMA archives consistent quality (i.e. pLDDT) comparing with other baselines while sacrificing diversity and novelty (Table 3). Is there any approach to elevating diversity and novelty without reducing the size of ESM encoder (so that the generation quality may keep at the same level) ?
- Do you have a plan to release code?
论据与证据
The claims are supported by clear evidence. The authors claimed the performance of DiMA matchs or exceeds the baselines. Table 2 compares DiMA with other generative baselines in unconditional scenario, Figure 3 in motif scaffolding-condtioned, Table 10 in family-specific conditioned and Table 14 in fold-conditioned scenario. Besides, the authors have conducted a detailed ablation study to identify key components in Table 1
方法与评估标准
The proposed continuous diffusion model for generating amino acid sequences is logical. The evaluation criteria, such as pLDDT and TM-score for structural quality and perplexity for sequence quality, are appropriate. These criteria comprehensively address quality, diversity, and novelty in generation tasks.
理论论述
The model illustration in Session 2 and Appendix C.1 includes details on Noise Schedule Optimization and Self-Conditioning.
实验设计与分析
The overall experimental designs are completed and comprehensive. For benchmark, DiMA are compared with 5 groups of baselines. An ablation study is conducted to find key contributed module in DiMA. For illustrate the performance of DiMA, the authors have conducted extensive experiments both on unconditional generation and conditional generation including motif-scaffolding, family-specific and fold-conditioned.
补充材料
Supplementary material is very comprehensive, including the detail of model architecture, explanation of each evaluation metric, and additional results for ablation study and benchmarks on conditioned scenarios.
与现有文献的关系
Essential prior works like EvoDiff(Alamdarietal.,2023), DPLM(Wangetal.,2024), Multiflow(Campbelletal.,2024), are appropriately cited and compared with proposed method as baseline methods.
遗漏的重要参考文献
The discussion of related works is comprehensive.
其他优缺点
Strengths:
- The experiment designs are well-done, demonstrating the ability of DiMA in different generation scenarios.
- The manuscript is well-written and easy to read.
Weaknesses:
- The paper is not so technical novel. Diffusion is widely applied for protein sequence in the community, as is the pretrained pLMs. Techniques like noise schedule optimization and long-skip connection are mainly engineering improvements.
其他意见或建议
- The authors should explain the meaning of ‘sd-10” in Figure 1.
- On line 208 of page 4, the authors have mentioned “padding omitting” decreases the performance, but the detail influence is not shown in Table 1.
- On line 236 of page 5, the author asserts that DPLM generates longer sequences compared to DiMA. What Length Determination strategy (Appendix C.3) is employed for DiMA inference here? Is the same strategy applied to DPLM to ensure a fair comparison?
We sincerely thank you for your thoughtful and detailed feedback. We appreciate the time you have taken to review our work and your positive comments about our paper. We aim to address your concerns and questions below.
W1. On technical novelty.
While diffusion models are not new, our work shows that naive application of continuous diffusion to proteins yields poor results (~60 vs 83.3 pLDDT, Table 1). Our systematic ablations quantify the impact of each component. As reviewer Y6i7 notes, we contribute by 'exploring training and inference strategies instead of directly copying parameterization from image domain'. This enables DiMA to achieve strong performance across diverse representation spaces (from 8M to 3B parameter sequence-based and multimodal encoders) and multiple tasks with a single architecture - outperforming specialized models like RFDiffusion in fold conditioning (TM-score 0.93 vs 0.48).
C1. On the meaning of "sd-10".
Thank you for noting the unclear meaning of "sd-10" in Figure 1. This refers to our noise schedule based on the formula , where d=10 determines the rate of information decay. We will rename this to "tan-10" in the revised manuscript for better clarity.
C2. On padding omitting.
The "w/o length sampling" entry in Table 1 corresponds to the model trained with padding and without length sampling. We will clarify this connection in the revision.
C3. On sequence length determination.
For fair comparison across all methods, we apply the same length sampling strategy during inference: sequence lengths are sampled from the empirical distribution of the training set.
The difference noted on line 236 refers to protein domains (distinct structural/functional units within proteins), not overall sequence length. While we control overall sequence length distribution across methods, domain length distributions emerge naturally from each model's generation process. Our analysis shows DiMA generates domain length distributions closely matching natural proteins, whereas DPLM skews toward longer domains (Figure 18C).
Q1. On the necessity of continuous vs. discrete diffusion.
Thank you for this important question. While discrete diffusion may seem more intuitive for sequences, continuous representations have proven highly effective in protein domains including representation learning (ESM, ProtT5), structure prediction (AlphaFold, ESMFold), and backbone generation. Recent work (CHEAP [1]) confirms continuous representations capture richer protein features.
Continuous diffusion offers several advantages:
- Direct application of established score-based techniques like classifier and classifier-free guidance without requiring discrete approximations
- Seamless integration with multimodal representations (CHEAP, SaProt) that jointly capture sequence and structure
- More stable and efficient training compared to discrete spaces
- Fine-grained optimization of diffusion parameters, as shown in our ablation studies
Our experiments demonstrate that this approach not only achieves strong performance but also enables structure-aware generation and fold-conditioning that are challenging in purely discrete frameworks.
We believe both approaches have merits, and our work contributes to understanding how continuous diffusion can be effectively applied to protein design. We will expand this discussion in our final manuscript.
Q2. On sequence length preference.
Thank you for this question. We conducted additional experiments and our results show that DiMA maintains consistent performance across all sequences length ranges, with quality metrics closely tracking the natural protein distribution (figure 1 at https://tinyurl.com/icml25re). DiMA achieves more stable diversity than DPLM-3B, which shows a drop in diversity for longer sequences.
Q3. On control over diversity and novelty.
Thank you for raising this point on balancing quality, diversity, and novelty. Our framework offers two knobs to control this trade-off without changing the encoder architecture (figures 2, 3 at https://tinyurl.com/icml25re):
- The number of sampling steps. Reducing the number of steps increases diversity and novelty and maintains reasonable quality.
- The self-conditioning rate parameter (w), which controls how much the model relies on its previous predictions during sampling. Lower values of w (0.6-0.8) yield higher diversity and novelty with a modest quality trade-off.
Q4. On the code release.
We are currently preparing our codebase for public release and plan to make it available upon publication.
We are grateful for your detailed feedback and will incorporate your suggestions in the final version of our manuscript. Please let us know if any questions still need clarification.
[1] https://www.biorxiv.org/content/10.1101/2024.08.06.606920v2
Thank the authors for their response, which has addressed my concern. I will raise the score to 4.
The authors present a novel continuous diffusion framework, termed DiMA. DiMA is composed of three key modules: Frozen pre-trained language models (pLMs), a continuous diffusion module and a decoder. DiMA demonstrates the ability to generate novel protein samples with high pLDDT scores. Furthermore, it supports conditional generation tasks, including protein family-specific design, motif scaffolding, and sequence infilling. All reviewers vote for acceptance.