4.8

/10

Poster4 位审稿人

最低3最高7标准差1.8

4.0

置信度

正确性2.5

贡献度2.5

表达2.8

NeurIPS 2024

Absorb & Escape: Overcoming Single Model Limitations in Generating Heterogeneous Genomic Sequences

Zehui Li,Yuhao Ni,Guoxuan Xia,William Beardall,Akashaditya Das,Guy-Bart Stan,Yiren Zhao

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

This paper proposes a post-training sampling method to perform compositional generation from autoRegressive mdoels and diffusion models for DNA generation.

摘要

关键词

Computational BiologyGenomicsDeep LearningGenerative Model

评审与讨论

审稿意见

评分: 3置信度: 42024-07-11

The authors propose to generate DNA sequences using pre-trained DMs and then modifying the segments (randomly selected), through autoregressive models.

优点

The problem that the authors are trying to address, i.e. heterogeneity of the sequences due to existing multiple different element is valid question in drug discovery, specifically for very long DNA sequences.

缺点

The main weakness of the paper is that the authors didn't validate the proposed approach in real world long sequences, and the simulated data is not convincing. Apart from that it is not clear how they train the baselines for multiple tissues, and the fact that the authors generalized the claims about all the DMs but only consider a version that they proposed by themself is somehow making the claims not acceptable in some sense.

问题

How did you train the model for DDSM in the case of multiple species? How did you incorporate species information?
How do the results look in the same dataset that DDSM and DFM used for evaluation, specifically in promoter design and enhancer design in a single species? It is unclear how the proposed DMs function initially in such scenarios.
What are the benefits of the proposed DMs and why can't one use previous pre-trained DMs, such as DDSM?
What is the effect of absorption and escape in unconditional generation?
How did you choose the length of segments in real-world and synthetic cases?
What is the effect of the segment selection?
Why one cannot use evolutionary based models for each segment? At least as a baseline.
Can you show couple of generated samples for each method and showing how the segments look like in real and generated data?

局限性

The main limitation of the paper is that it is very dependent to the pre-trained models, as well as selections of segments in a write manner.

作者回复

2024-08-06

We appreciated for your feedback. We have provided additional experiment results in response to your questions, this includes applying A&E algorithm on the Dirichlet Flow Matching in the new task and answers to common questions from other reviewers. In the following, we respond to your questions about this paper.

Q1. A&E performance on the DDSM task

We ran A&E on this task with a default threshold $T_{\text{absorb}}=0.85$ . The same dataloader and random seed as the DFM repository were used. The A&E model comprises AR and DM components, specifically the language model and distilled DFM checkpoints provided in the DFM repository. As shown in the table below, A&E achieved state-of-the-art results of 0.0262 on the test split, despite using the second best DM, distilled DFM.

Method	MSE $\downarrow$
Bit Diffusion (bit-encoding)	0.0414
Bit Diffusion (one-hot encoding)	0.0395
D3PM-uniform	0.0375
DDSM	0.0334
Language Model	0.0333
Linear FM	0.0281
Dirichlet FM	0.0269
Dirichlet FM distilled	0.0278
A&E (Language Model+Dirichlet FM distilled)	0.0262
	.

Q2. Validate the proposed approach in real world long sequences

The evaluation in Section 5 is based on real-world DNA sequences from the EPD database. For details on the dataset construction process, please refer to the global response. The only simulated data used is in Section 3, which illustrates the limitations of AR models and DM in handling heterogeneous data. Additionally, we have provided results on the performance of A&E in the transcription profile conditioned promoter sequence design task from the DDSM paper (see global response and Answer to Q3 below).

Q3. How did you train the model for DDSM in the case of multiple species?

We did not train DDSM for multi-species conditioning. The DDSM is evaluated in an unconditional setting in Section 5.1, where the training data are promoter sequences from {human, rat, Macaca mulatta, mouse}.

Q4. The benefits of the proposed DMs and why can’t one use previous pre-trained DMs, such as DDSM?

The Latent Diffusion Model (LDM) proposed in this paper serves as a baseline model. We found that this simple baseline outperformed existing models such as DDSM in the unconditional generation scenario (shown in Table 3 of Section 5.2). Our aim is not to prove that the proposed baseline LDM outperforms other discrete DMs. Instead, the focus of this paper is to demonstrate the effectiveness of the proposed sampling algorithm A&E by showing that the composed model can outperform single models.

Q5. What is the effect of absorption and escape in unconditional generation?

We have included additional results below. As you can see, the A&E algorithm achieves the best performance in unconditional generation.

Model	EPD(256bp)			EPD(2048bp)
Model	S-FID↓	Cor_TATA↑	MSE_TATA↓	S-FID↓	Cor_TATA↑	MSE_TATA↓
VAE	295.0	-0.167	26.5	250.0	0.007	9.40
BitDiffusion	405	0.058	5.29	100.0	0.066	5.91
D3PM(small)	97.4	0.0964	4.97	94.5	0.363	1.50
D3PM(large)	161.0	-0.208	4.75	224.0	0.307	8.49
DDSM(TimeDilation)	504.0	0.897	13.4	1113.0	0.839	2673.7
DiscDiff(Ours)	57.4	0.973	0.669	45.2	0.858	1.74
A&E(Ours)	3.21	0.975	0.379	4.38	0.892	0.528

Q6 How did you choose the length of segments in real-world and synthetic cases?

Algorithm 1 requires segment annotations (start and end positions of each segment) to be available. However, this information is not available in most use cases. To overcome this issue, Algorithm 2 is proposed to run without knowing the true segment annotations, as detailed in line 193, Section 4.2. In both cases, the user does not need to choose the segment lengths.

For synthetic data used in the toy example, we already know the segment annotations by construction.

Q7 What is the effect of the segment selection.

Similar to Q6, both algorithm 1 and algorithm 2 don't need to select segment. For line 5 in alg1, the segments could be sampled randomly.

Q8 Why one cannot use evolutionary based models for each segment?

In practical DNA generation, where the user only has DNA sequences without annotations, Algorithm 2 handles this scenario. In addition, conventional promoter design methods based on existing promoter libraries span only a short section of the promoter.

Q9 Examples of how the segments look like in real and generated data

The examples of generated sequences are available from the following Anonymous repo. Note that no segment annotations are available for the training data; we apply Algorithm 2 for generation.

审稿意见

评分: 7置信度: 42024-07-12

This paper addresses limitations in existing methods for generating genomic sequences, which struggle to capture the heterogeneous nature of DNA. The authors propose a new framework called Absorb & Escape (A&E) that combines the strengths of autoregressive (AR) models and diffusion models (DMs) to generate more realistic DNA sequences. They first analyze the shortcomings of AR models and DMs when used individually for heterogeneous sequence generation. Then they introduce A&E, which alternates between refining segments using an AR model (Absorb step) and updating the full sequence (Escape step). A practical implementation called Fast A&E is also presented. The authors evaluate their method on a large dataset of promoter sequences from 15 species. They compare Fast A&E against state-of-the-art baselines on metrics like motif distribution similarity, sequence diversity, and functional properties when inserted into genomes.

优点

Absorb & Escape is a plausible solution to addressing the limitations of autoregressive models and diffusion models
theoretical analysis of their method -- convergence proof
comparisons against previous diffusion models on promoter generation
comparison across multiple species DNA generation
deploys various metrics to assess generated sequences
new training data that considers multispecies

缺点

If trained on original DDSM dataset, how would DiscDiff perform? I ask because the DDSM seems to perform quite poorly in terms of MSE in Table 3, but its performance was reasonable even when compared with Dirichlet Flow Matching (Stark et al, arxiv 2023).
The explanation of Autoregressive model struggling to capture difference between p1 and p2 with a single set of parameters could be true, but hierarchical AR models coUnuld learn latent segment states. So, unclear why deep AR models not learning the heterogeneous structure is challenging in practice.
Regulatory sequences can be quite diverged across species where a promoter in one species may not be a strong promoter in another species. So, what is the benefit of multi-species promoter pretraining? Perhaps comparing vs just on human promoters (i.e. DDSM dataset using CAGE-seq) could help appreciate the nuances.
In 5.1, under baseline models, DNADiffusion is cited but not explored in Table 3. HyenaDNA is not benchmarked in Table 3. It would be good to get a genomic language model baseline since the purpose of DiscDiff is to resolve issues of diffusion models and autoregressive models.
It's not clear how metrics were calculated. How were motif distributions calculated? ALso, what about MSE of the conditional generation of the model (using Sei as an oracle). THis was done in the DDSM paper as well as DFM.
Dirichlet flow matching performed better than DDSM and so it would be worth benchmarking on this (but may be outside scope for this paper). At least mention the new diffusion models that beat out DDSM and the open question remains whether they require absorb and escape or whether they can also benefit.
Fig 3 doesn't show any comparisons with other diffusion models (eg. DDSM)
Table 5 should show distribution of enformers predictions to see if generated distribution are even close to natural sequences. The presented metrics are difficult to assess beyond relative comparisons.

问题

Questions are integrated within Weaknesses (above).

局限性

While the method is very interesting, potentially bridging the limitations of autoregressive models and diffusion models, the evaluation is a bit on the weak side. This makes it difficult to assess any advance. Better evaluations could help improve this work to demonstrate the true gains. It would also be worthwile to explore absorb & escape using other SOTA diffusion models (eg. DFM).

作者回复

2024-08-06

We appreciate your feedback. Following your suggestions, we have provided additional results comparing A&E with DDSM and DFM in the global response, along with clarifications to address common questions from other reviewers. Below, we provide detailed responses to each of your questions.

Q1. If trained on the original DDSM dataset, how would DiscDiff perform?

We believe the DDSM task is easier than unconditional and class-conditioning generation. The DDSM task uses a transcription profile, a real-value array of the same length as the DNA sequence, as the condition, allowing the generative model to learn a direct mapping from condition to sequence.

This is evident when applying the DDSM model to the unconditional generation task proposed in this paper during the exploration stage. Initially, using the original score net design from DDSM for unconditional generation, it could not learn the training set distribution. We scaled up the original score net by 10 times, which allowed DDSM to perform reasonably in unconditional sequence generation.

Due to time constraints, we could not adapt the DiscDiff architecture to the DDSM task. However, we provide additional experiments with the A&E algorithm on the DDSM task by using the pretrained language model as the AR component and the DFM distilled model as the DM component. As shown below, A&E achieved a new state-of-the-art on this task.

Method	MSE $\downarrow$
Bit Diffusion (bit-encoding)	0.0414
Bit Diffusion (one-hot encoding)	0.0395
D3PM-uniform	0.0375
DDSM	0.0334
Language Model	0.0333
Linear FM	0.0281
Dirichlet FM	0.0269
Dirichlet FM distilled	0.0278
A&E (Language Model+Dirichlet FM distilled)	0.0262

Q2. Why deep AR models do not learn the heterogeneous structure

As detailed in line 120 section 3 of the original paper, given the assumption that a sequence consists of independent segments $seg_1$ and $seg_2$ , AR models struggles to learn the independence between two elements $x_k \in seg_1$ and $x_{2k} \in seg_2$ . This issue would be more challenging when the training data is insufficient. As AR models have intrinsic assumptions about token dependencies, overcoming this challenge could be difficult.

Q3. Benefit of multi-species promoter pretraining and Additional Experiment on DDSM dataset

We can train a single model for each species for promoter generation. However, for certain species (e.g., P. falciparum in EPD), we have limited data, which is insufficient for training a generative model. By training on cross-species data, our model leverages genomic similarities across different species, enabling promoter sequence generation for species with limited data. We expect future foundation models for genomics to make use of all available data, just like how LLMs are making use of the web-scale data, and being able to perform downstream tasks with contextual prompts.

See the response to Q1 for the additional exp. results.

Q4. DNADiffusion is cited but not explored in Table 3

We thank the reviewer for pointing it out. This is a typo. We mislabeled DNADiffusion as BitDiffusion in Table 3. We ran the training code from the DNADiffusion repository and benchmarked it on the EPD datasets. The details of the experiments are shown in Appendix D.

Q5 a) How motifs distribution were calculated

We use EPDnew analysis tools - motif distribution for plotting the motif distribution and get the motif distribution data.

Q5 b) How MSE is calculated in conditional generation

We clarify the definition of metric in the final section of global response. Note that Sei model is used for evaluating unconditional generation as Sei is for human genome, For conditional generation, the MSE is the MSE between the motif distribution of the generated sequence and nature sequence. As detailed in line 250 section 5.2.

Q6. Benchmarking against DFM

See the response to Q1.

Q7. Fig 3 doesn’t show any comparisons with other diffusion models (e.g., DDSM)

One reason is that training on the whole datasets is expensive. We first select the best-performing DM and AR components with unconditional generation in section 5.1 of the original paper. Secondly, section 5.2 demonstrates the effectiveness of the A&E algorithm by showing A&E outperforms the single model.

Q8. Distribution of enformers predictions

We are happy to include the single track expression level prediction results in the appendix of final draft, but due to the limitation of one page PDF, we don’t have additional space to include it here.

评论- Satisfactory response

2024-08-10

I thank the authors for their clarifications and additional experiments. I believe this is now a solid contribution and therefore will maintain my score of accept.

审稿意见

评分: 3置信度: 42024-07-12

The authors introduce a novel approach, called Absorb & Escape (A&E), for generating DNA sequences by combining the strengths of autoregressive (AR) and diffusion models (DMs). The authors argue that existing single-model approaches struggle with the heterogeneous nature of genomic sequences, which consist of multiple functionally distinct regions. Their method initially generates a sequencing using a DM, and then refines sequence segments using an AR model. They evaluate their approach on various design tasks. The authors claim improved performance over existing methods, particularly in generating sequences that satisfy complex constraints and exhibit properties similar to natural genomic sequences.

优点

The combination of AR and DM is innovative, and these approaches do seem to excel and struggle at different aspects of the problem.

Biological sequence design is generally an interesting research area with potentially valuable applications.

缺点

The biological application lacks compelling justification. The authors do not clearly explain why generating synthetic promoter sequences is necessary or advantageous compared to using and optimizing existing genomic promoters. A comparison with established promoter optimization methods is missing, which is crucial for demonstrating the practical value of this approach in genomic applications.

The paper suggests broader applicability of the A&E method beyond DNA design, but this is not sufficiently supported by the presented experiments. Additional tasks and datasets would be necessary to make a convincing case for the method's general effectiveness.

The paper lacks an analysis of how the various hyperparameters, such as the Absorb Threshold (T_Absorb), affect the model's performance. A sensitivity analysis would provide valuable insights into the robustness of the method.

The use of Sum of Squared Errors (SSE) to evaluate generated promoters against natural promoters is questionable. Given that Enformer makes predictions for a large region of sequence (896 consecutive 128 bp bins), this metric may not accurately reflect the quality or functionality of the generated promoters, which cover 1-2 bins.

问题

Algorithm 2 assumes access to token-level probabilities from the diffusion model (p_DM), which is not a standard output for most diffusion models. The authors do not explain how they obtain these probabilities, which is a critical missing piece of information for understanding and implementing their method.

There is insufficient exploration of how the model learns to differentiate between different genomic regions (e.g., coding sequences, promoters, introns). Does the AR model produce different generations in these different regions? This analysis would strengthen the claim that the method effectively handles heterogeneous sequences.

The description of the Eukaryotic Promoter Database (EPD) content is vague. E.g. why would a promoter database contain coding sequence. A more specific explanation of the types of sequences it contains would help readers understand the dataset's composition and relevance.

In Algorithm 2 line 3, I cannot determine exactly what p^DM refers to. I.e. is this a global probability of the entire sequence or specific to a local region. It’s presence within the line 2 loop suggests locality.

The convergence criteria for Algorithm 1 are not specified, which makes it difficult to assess the algorithm.

局限性

Yes

作者回复

2024-08-06

We appreciate your feedback on the paper. In our general response, we included additional results comparing A&E with other methods on additional datasets, a sensitivity analysis of the hyperparameter $T_{absorb}$ , an explanation of how to define token-level probability from the diffusion model, and other common questions proposed by reviewers. We suggest reviewing the global response first. Below, we provide detailed responses to each of your questions.

Q1. Sensitivity Analysis over $T_{absorb}$

Figure 23 of the global response provides a sensitivity analysis of the A&E algorithm over $T_{absorb}$ . As $T_{absorb}$ increases, the motif plot correlation between generated DNA and natural DNA first increases and then flattens. A larger $T_{absorb}$ encourages more frequent corrections by the AR model, improving the quality of generated sequences. However, a smaller $T_{absorb}$ is more computationally efficient as it requires fewer function evaluations. A&E is robust over a wide range of $T_{absorb}$ . While it is best to use the validation dataset to choose the optimal value, we found that a value of 0.85 is generally appropriate for different tasks and scenarios.

Q2. Additional Task and Dataset

Following your suggestion, we have compared A&E with other SoTA discrete generation algorithm on the transcription profile conditioned promoter sequence design task from DDSM paper. As shown in Table 6 of the uploaded PDF, A&E with Language Model as the AR component and DFM distilled as the DM component achieves the smallest MSE of 0.0262, outperforming DFM and DDSM. This confirms the effectiveness of A&E across various tasks. Please see the global response for more details about the additional experiment.

Method	MSE $\downarrow$
Bit Diffusion (bit-encoding)	0.0414
Bit Diffusion (one-hot encoding)	0.0395
D3PM-uniform	0.0375
DDSM	0.0334
Language Model	0.0333
Linear FM	0.0281
Dirichlet FM	0.0269
Dirichlet FM distilled	0.0278
A&E (Language Model+Dirichlet FM distilled)	0.0262

Q3. Motivation for Promoter Generation Task

Similar to prior work (ExpGAN [1]), conventional promoter design methods based on existing promoter libraries span only a short section of the promoter. In contrast, deep generative models for promoter design explore the distribution properties of promoter sequences, enabling the generation of new promoters that classical methods cannot achieve. This motivation is shared by the DDSM and DFM papers and other DNA generation tasks.

Q4. How Enformer predictions are used for Evaluation

The Enformer outputs are used to show how the promoter sequence influences the downstream target gene instead of the bins corresponding to the promoter sequence itself. We inserted the generated promoter before the target gene, e.g., TP53, and checked how the promoter changed the expression level of the target gene. While the promoter sequence is only 128 bp, accounting for 1 bin in the prediction, the target gene is more than 10,000 bp long. For example, TP53 is about 25,000 bp, accounting for 195 bins in the Enformer output. Using SSE, we aggregate the change in expression levels across all the cell lines.

Q5. Token-Level Probabilities from the Diffusion Model ( $p^{DM}$ )

$p^{DM}$ at line 3 of Algorithm 2 should be $p^{DM}(\mathbf{x}_i) \in \mathbb{R}^{L\times4}$ , representing the token level emission probability from the Diffusion Model. $p^{DM}(\mathbf{x}_i)$ can be retrieved for most of the discrete diffusion algorithms such as DDSM, DFM, and DiscDiff. For DNA generation, $p^{DM}(\mathbf{x})$ are usually stored as a variable logits $\in \mathbb{R}^{L\times4}$ , where each row $p^{DM}(\mathbf{x}_i)$ represents token level probability distribution over {A,T,G,C}. $p^{DM}(\mathbf{x})$ are produced after the last sampling steps. In the latent diffusion model used in this paper, it is produced by the second-stage decoder (Appendix A), while in DDSM and DFM, it is produced by a 1D-CNN (output_channel=4) in the score net.

Q6. Description of EPD Dataset

We detailed the construction process of the DPE dataset in the global response. Briefly, we downloaded all the promoter records (30 million) from EPD datasets. Each record is a triple of (sequence, species, cell types, expression levels). Sequences could be duplicated in these records, so we aggregated these records by sequence, producing 160K unique sequences. Each sequence is a promoter-downstream sequence centered around the TSS of length 256 or 2048.

Q7. How the Model Learns to Differentiate Between Different Genomic Regions

Following the suggestion from reviewer crFN, we performed BLASTN on the two halves of sampled sequences against promoter-like and protein-like segments from the training set. A larger score indicates higher similarity. The results confirm that A&E better handles heterogeneity than the AR model.

Blasting Score	Nature Promoter	Nature Protein
A&E Promoter	19.40	18.88
A&E Protein	18.89	19.00

Blasting Score	Nature Promoter	Nature Protein
Hyena Promoter	18.95	18.98
Hyena Protein	18.86	18.88

Q8. Convergence criteria for Algorithm 1

The convergence criteria of Alg 1 is $\tilde{\mathbf{x}}^t$ converge to a fixed distribution, which can be achieved through $|\tilde{\mathbf{x}}^{t+1} - \tilde{\mathbf{x}}^t\| < \delta$ .

[1] Zrimec, J., et al., 2022. Controlling gene expression with deep generative design of regulatory DNA

评论- Additional Clarification About Enformer Prediction

2024-08-13

We sincerely appreciate your valuable feedback. Below, we would like to provide additional clarification on how to interpret the output from Enformer.

Additional Clarification About Enformer Prediction

We understand that the CAGE assay displays aligned reads at transcription start sites (TSSs). However, given that many genes, including TP53, EGFR, and AKT1, possess multiple TSSs. While we can only take the bin value corresponding to one TSS, the exact number of TSSs can vary depending on the specific cell type and experimental conditions. Therefore, focusing solely on the bin corresponding to the TSS immediately following the promoter might overlook other important TSSs. This is why we consider the entire prediction range corresponding to the gene. The output from Enformer reflects the CAGE track values across all TSS positions, ensuring more complete coverage of transcription start sites.

审稿意见

评分: 6置信度: 42024-07-13

This paper presents a sampling approach called Absorb and Escape (A&E) that combines the strengths of diffusion models (DMs) and autoregressive models (AR models) for generating DNA sequences. The authors rightly point out that DNA sequences are generally composed of segments that do not follow the same distribution (i.e. they are heterogeneous sequences). Then, using a toy example, they clearly illustrate the shortcomings of DMs and AR models for modelling heterogeneous sequences but point out that they have complementary strengths, motivating them to combine these models. Their proposed A&E algorithm (and the practically useful Fast A&E algorithm) can be used at sampling time to leverage these complementary strengths - it first samples a sequence from a DM before iteratively refining it using an AR model. After training DMs and AR models on a new dataset of DNA sequences, the authors show that using Fast A&E leads to better samples when compared to using DMs or AR models alone. They also show that the sequences are sufficiently diverse with Fast A&E samples being less diverse than DM samples while being more diverse than AR model samples.

优点

Originality: The main novel contribution of this paper is the A&E algorithm for combining the strengths of DMs and AR models. The algorithm is very well-motivated, both based on prior work and the toy example. It is also simple to implement and understand. Combining DMs and AR models for generating DNA sequences has also not been attempted by previous work to the best of my knowledge.
Quality: The toy example is very useful and clearly illustrates the problems with current generative approaches for DNA generation. However, I have reservations about the dataset and task being used for the main evaluation although I am generally convinced of the A&E algorithm's usefulness. My questions and suggestions are listed in the next section.
Clarity: The paper is very well-written and easy to understand. However, a few details about the evaluations are missing, I have listed them in the next section.
Significance: The A&E algorithm is likely going to be useful for many computational genomics researchers. Even beyond DNA generation, I think the algorithm is an interesting and intuitive way to blend DMs and AR models and it could lead to more research in this area.

缺点

Reservations about the main evaluation

The authors use gene sequences from the Eukaryotic Promoter Database (EPD) for training and testing various models. I do not see any details about how exactly these sequences are constructed. From Table 2 and Figure 4, I assume that the authors are extracting sequences of various lengths (either 256bp for 2048bp) centered at the TSS, but the paper needs a more comprehensive description of the data processing since this data is used in all of the main experiments.
I am not convinced of the utility of modelling these DNA sequences as presented in the paper. The authors train conditional diffusion models where the conditions are various species. I cannot think of a use case for sequences generated using this modelling strategy - there is no way to tune the promoter strength or target gene. Other DNA generation models use arguably better tasks. For example, DNADiffusion uses diffusion models to generate cell-type-specific regulatory elements which could be useful for synthetic biology applications. I would like to understand the authors' motivation for using this task.
The paper could be made much stronger by using the same evaluations as the previous papers to show that the usage of A&E with the DMs proposed in those papers (maybe in combination with HyenaDNA for AR modelling) leads to improved performance.
Furthermore, the authors somewhat abandon their focus on the heterogeneity of the DNA sequences being modelled in the main results. Although it is clear from Figure 4 that the motifs are located in the promoter as expected, one could perform two more simple analyses to show that A&E actually helps in modelling heterogenous sequences:
- If half the training sequence is promoter-like and the other half is protein-like, the authors could show that this is true in the sampled sequences as well (maybe by BLASTing the two halves against the promoter-like and protein-like segments from the training set).
- An even simpler analysis would be to look at the protein-like segment of the samples and show that it obeys the rules of protein coding sequences - starts with a start codon and there are no premature stop codons.

问题

In addition to the questions/suggestions above, I have the following minor ones below:

Since the absorb step is reliant on using an AR model to refine certain segments, is it better to use a masked language model instead of an AR model? This way, context from both sides of the segment can be incorporated and the probability approximation will be better.
How are probabilities for a sequence/sequence segment extracted from the DM?
There are discrepancies between the model names mentioned in lines 222-223 and those in Table 3.

局限性

Apart from the assumptions in Appendix C, I could not find any other discussion of the limitations. I can think of the following other limitations:

The evaluation scheme is based on a single dataset, it is unclear how A&E will generalize to other datasets/settings.
Tuning $T_{absorb}$ seems non-trivial.

I do not foresee any negative societal impacts.

作者回复

2024-08-06

We sincerely appreciate your insightful reviews. Following your suggestions, we have presented additional experiments on DDSM dataset and more detailed description on the dataset in the general response and uploaded PDF. Hopefully our general reply addresses your concerns regarding the selection of $T_{absorb}$ and the extraction of $p^{DM}(\mathbf{x}_i)$ . Below, we provide detailed responses to each of your questions.

Q1. Using the same evaluations as previous papers

We have hereby presented additional experiment results on the transcription profile conditioned promoter sequence design task presented in DDSM paper with A&E. As shown in Table 6 of the uploaded PDF, A&E with the Language Model as the AR component and DFM distilled as the DM component achieves the smallest MSE of 0.0262, outperforming DFM and DDSM. This confirms the effectiveness of A&E across various tasks.

Method	MSE $\downarrow$
Bit Diffusion (bit-encoding)	0.0414
Bit Diffusion (one-hot encoding)	0.0395
D3PM-uniform	0.0375
DDSM	0.0334
Language Model	0.0333
Linear FM	0.0281
Dirichlet FM	0.0269
Dirichlet FM distilled	0.0278
A&E (Language Model+Dirichlet FM distilled)	0.0262

Q2: Details about how exactly these sequences are constructed

We have detailed the construction process of the EPD dataset in the global response. Briefly, we downloaded all the promoter records (30 million) from EPD datasets. Each record is a triple of (sequence, species, cell types, expression levels). Sequences could be duplicated in these records, so we aggregated these records by sequence, producing 160K unique sequences (as mentioned in the line 218 of original paper). Each sequence is a promoter-downstream sequence centered around the transcription start site (TSS) of length 256 or 2048.

Q3: Why species-wise generation and not cell-type conditioning?

There are two reasons for setting species conditioning as a task. First, promoter sequences from different species follow different distributions, making it an excellent testbed for benchmarking various generative algorithms. In contrast, cell-type conditioning is less distinct since all cell types share the same genome, and regulatory elements can bind to specific cell types only under certain conditions. However, if enough samples are provided, each sequence will appear in all cell types. In theory, the optimal task format should be (species, cell type, expression) co-conditioning, and this paper presents the first step towards that goal.

Additionally, while we can train a single model for each species to generate promoters, limited data is available for certain species (e.g., P. falciparum in EPD), which is insufficient for training a generative model. By training on cross-species data, our model leverages genomic similarities across different species, enabling promoter sequence generation for species with limited data. One use case of unconditional promoter generation is expanding existing promoter libraries, crucial for synthetic biology. We expect future foundation models for genomics to make use of all available data, just like how LLMs are making use of the web-scale data, and being able to perform downstream tasks with contextual prompts.

Q4. Heterogeneity of the DNA sequences in the main results

Following your suggestion, we performed BLASTN on the two halves of the sequences sampled by A&E against promoter-like and protein-like segments from the training set. A larger score indicates higher similarity. We will add these additional results to our Appendix.

Blasting Score	Nature Promoter	Nature Protein
Absorb Escape Promoter	19.4	18.88
Absorb Escape Protein	18.89	19

The results confirm the existence of heterogeneity in the generated and training sequences.

Q5. Is it better to use a masked language model instead of an AR model?

We recognize this as a valuable suggestion. A masked language model could potentially improve the quality of generated sequences by considering the bidirectional context $P(x_{i:j}|x_{0:i-1},x_{j+1:L})$ . However, AR model with caches (storing intermediate results for previous generated tokens) could potentially be computationally efficient as it requires at most sequence length number of function evaluations. In contrast, a masked language model needs to consider the whole context each time.

Q6. How are probabilities for a sequence/sequence segment extracted from the DM?

$p^{DM}(\mathbf{x}_i)$ can be retrieved for most discrete diffusion algorithms (such as DDSM, DFM, DiscDiff). In DiscDiff, logits are available after the second stage decoder layer (Appendix A). In general, $p^{DM}(\mathbf{x})$ are usually stored as a variable logits $\in \mathbb{R}^{L\times4}$ , where each row $p^{DM}(\mathbf{x}_i)$ represents probability distribution over {A,T,G,C} token. More details are available in our general response.

Q7. Discrepancies between the model names mentioned in lines 222-223 and those in Table 3.

We thank the reviewer for pointing it out. This is a typo, the BitDiffusion in Table 3 should be DNADiffusion. We retrain the DNADiffusion on our dataset and reported the results in Table 3.

Q8. How A&E generalises to other datasets/settings and evaluations

This is addressed with the additional experiment and more clarification about metrics and datasets in the general response. We believe motif distribution is essential to DNA generation, as it directly measures how DNA functions.

Q9. Tuning $T_{absorb}$ seems non-trivial.

We have found a default value of 0.85 effective in many settings. A sensitivity analysis of the algorithm on $T_{absorb}$ is provided in the general response.

2024-08-11

I have read the authors’ rebuttal. I hope you’ll include these clarifications and additional information in your revision.

I don’t understand how you conditioned your model on the dense CAGE profile from the DDSM paper.

I appreciate the difficult in evaluating generated DNA sequence quality, but motif composition is so simple that it’s really not very compelling.

I think you’re misunderstanding Enformer predictions. The length of the target gene is irrelevant; Enformer predicts many assays but the closest to gene expression from a promoter is CAGE, and CAGE will produce aligned reads at the TSS only, unaffected by gene length.

Altogether, I’ll maintain my current scores.

2024-08-11

Thanks for your reply. I would like to further clarify the following questions:

I’m unclear about how you conditioned your model on the dense CAGE profile from the DDSM paper.

Our evaluation was conducted using the pretrained checkpoint from the DFM repository. The A&E algorithm is a general approach that can be applied to various autoregressive (AR) models and diffusion models (DM). In the additional experiment we presented, A&E utilized the language model and distilled DFM checkpoints as the AR and DM components. Specifically, A&E can be implemented by modifying the PromoterModule class in the original DFM repository. No further changes were made to the underlying models or the score network.

Regarding the Motif Distribution as the Evaluation metric

We have indeed provided additional evaluation metrics, such as FID-based metrics, for human promoters. Although we have some reservations about the benchmark provided in the DDSM, we agree with the reviewers that it is still important to use it, as it better contextualizes our proposed approach within the existing literature. For other species, using motif distribution is a straightforward approach to assess the quality of generated sequences. For example, if a generative algorithm cannot correctly place a TATA-box in the appropriate position, it is generally unlikely to produce a valid promoter sequence.

PS: I wonder if the reviewer's comment might have been placed in the wrong thread.

2024-08-12

Thank you for the detailed response! Most of my questions have been answered but I keep my original score as I am still unconvinced by the main evaluation task in the paper although the additional evaluations are useful. I have a few suggestions:

Regarding Q3: I am still unconvinced of the utility of generating new promoter sequences as presented in the paper. I cannot think of a promoter generation scenario that would require us to only condition on the species (or even cell type) without conditioning on promoter strength. I understand that each species can have a different distribution of promoters but each promoter in the distribution has a different strength (i.e. how much expression it can drive). In most generation scenarios, we want a promoter that drives a certain level of expression and not just a random promoter from the distribution of promoters for that species. Therefore, I encourage the authors to consider incorporating expression strength into their conditioning. Alternatively, using an evaluation task similar to DNADiffusion where they only model cell-type-specific regulatory elements could be more meaningful since any random sequence from the distribution being modeled could be potentially useful.
Regarding Q5: Given the length of sequences being modeled, I doubt that the efficiency improvements would be very significant when using a masked language model vs. an AR. The improvement in probability estimation might be worth the slight increase in runtime.

评论- Further Clarification about promoter generation task

2024-08-13

We sincerely appreciate your suggestions, which are indeed very helpful. However, we would like to offer some additional clarification regarding the purpose of the promoter generation task.

Further Clarification about promoter generation task

We agree that incorporating promoter strength into the conditioning is a logical extension of our work and something we plan to explore in future iterations. However, we believe that the current approach provides a necessary first step in mapping the regulatory landscape across species, which could lead to new insights that are not apparent when focusing solely on promoter strength.

In contexts such as synthetic biology or evolutionary studies, understanding the full range of available promoters within a species can provide valuable insights into the species’ inherent regulatory potential, independent of expression strength. This foundational knowledge can then inform subsequent research where specific promoter strengths are prioritized. For example, a recent study published in Science [1] demonstrated that human promoter sequences might follow relatively simple syntax. Such studies require a thorough examination of native promoter sequences, and generative models can offer significant insights in these scenarios.

Additionally, from a benchmarking perspective, if a generative algorithm struggles with species-wise generation, it would likely face challenges in other settings where only class information is used for conditioning. Conversely, strong performance in species-specific class conditioning suggests the algorithm may also perform well when conditioning on other factors, such as expression strength classes conditioning.

[1] Dudnyk, K., Cai, D., Shi, C., Xu, J. and Zhou, J., 2024. Sequence basis of transcription initiation in the human genome. Science, 384(6694), p.eadj0116.

2024-08-13

Thank you for the further clarification! Since this paper proposes interesting methods and the benchmarking argument proposed by the authors is valid, I am increasing my score.

作者回复

2024-08-06

We appreciated the valuable feedback from all reviewers. Overall, the reviewers agree that our proposed algorithm, A&E, is well-motivated and clearly illustrates the limitations of AutoRegressive (AR) models and Diffusion Models (DMs).

Most reviewers (crFN, m7HE, hZwW) recognize the significant real-world impact of the task of DNA generation. Both reviewers m7HE and crFN believe that the proposed method, A&E, is a plausible solution to bridge the limitations of existing AR models and DMs. Additionally, reviewer crFN suggests that A&E “could lead to more research beyond DNA generation.” While there are some concerns about the evaluation presented in the paper, the additional results below show that A&E achieves state-of-the-art performance on both existing DDSM datasets and the more challenging EPD datasets.

Benchmarking A&E against DFM and DDSM

Reviewers suggest benchmarking A&E against DDSM[1] and DFM[2] on the transcription profile conditioned promoter sequence design task used in DDSM paper. We ran A&E on this task with a default threshold $T_{absorb}=0.85$ . The same evaluation procedure as the DFM repository were used. The A&E model comprises AR and DM components, specifically the language model and distilled DFM checkpoints provided in the DFM repository. As shown in the table below (this is also Table 6 in the uploaded PDF file), A&E achieved state-of-the-art results of 0.0262 on the test split. We are happy to include the results of A&E with different combinations of DM and AR components in the final draft of the paper.

Method	MSE $\downarrow$
Bit Diffusion (bit-encoding)	0.0414
Bit Diffusion (one-hot encoding)	0.0395
D3PM-uniform	0.0375
DDSM	0.0334
Language Model	0.0333
Linear FM	0.0281
Dirichlet FM	0.0269
Dirichlet FM distilled	0.0278
A&E (Language Model+Dirichlet FM distilled)	0.0262

Sensitivity analysis of $T_{Absorb}$

In response to reviewers Y7is and crFN's request, we include sensitivity analysis results to show the influence of the hyperparameter $T_{\text{Absorb}}$ of the A&E algorithm in Figure 23 (uploaded as a one-page PDF).
As $T_{\text{Absorb}}$ increases, the motif plot correlation between generated DNA and natural DNA first increases and then flattens. This is because a larger $T_{\text{Absorb}}$ encourages more frequent corrections made by the AR model, which generally improves the quality of generated sequences. However, a smaller value of $T_{\text{Absorb}}$ is more computationally efficient as it requires fewer function evaluations of the AR model. In practice, we found that a value of 0.85 is generally appropriate for different tasks and scenarios, and will add this sensitivity analysis to the appendix of our paper.

Clarification about $p^{DM}(\mathbf{x}_i)$ from Diffusion Model

$p^{DM}$ at line 3 of Algorithm 2 should be $p^{DM}(\mathbf{x}_i) \in \mathbb{R}^{L\times4}$ , representing the token level emission probability from the Diffusion Model. To make it clearer, while $p^{DM}(\mathbf{x}_i)$ is difficult to obtain for continuous DMs, such as DDPM. $p^{DM}(\mathbf{x}_i)$ can be retrieved for most of the discrete diffusion algorithms (such as DDSM, DFM, DiscDiff). For DNA generation, $p^{DM}(\mathbf{x})$ are usually stored as a variable logits $\in \mathbb{R}^{L\times4}$ , where each row $p^{DM}(\mathbf{x}_i)$ represents probability distribution over {A,T,G,C} token. A concrete example is DFM’s official implementation.

EPD dataset construction process

Issues with the DDSM dataset While the DDSM dataset [1] has brought attention to the task of DNA generation, we want to caution against using it as a general benchmark. This is because, in the DDSM task formulation, the condition CAGE expression value is an array of real-valued signals with the same length as the sequence to be generated. This differs from most existing generation tasks, where the condition is typically a class. In fact, rather than a generation task, it is more akin to a machine translation task, borrowing a metaphor from NLP. This motivated us to develop a more challenging generation task using the EPD dataset.

EPD Dataset Figure 22 in the uploaded PDF illustrates the construction process of the EPD dataset. Promoter-downstream sequence pairs are curated from EPD. While the EPD database contains 30 million samples, we aggregated the sequences to avoid repetition. This differs from the DDSM dataset, where the same promoter could appear in different instances. Two sets of datasets were created: 1)EPD (256): A 256bp window centered on a gene’s transcription start site (TSS), split into an upstream promoter and a downstream segment. 2)EPD (2048): A 2048bp window centered on TSS, covering broader genetic regions.

Metrics: Motif Distribution Motif plots have been widely used in prior DNA generation works([3,4,5]). In this work, the motif plots are obtained using EPDnew analysis tools. To improve upon prior work, we compute the MSE and the Correlation between two motif frequency distributions $y_1, y_2 \in \mathbb{R}^L$ , with the following formulas to qualitatively evaluate the differences between two motif plots. The MSE evaluates the distance of motif distributions, while correlation evaluates the change in the shape of motif distributions. $\text{MSE} = \frac{1}{L} \sum_{i=1}^L (y_{1}^i - y_{2}^i)^2$

$r = \frac{\sum_{i=1}^L (y_{1}^i - \overline{y_1})(y_{2}^i - \overline{y_2})}{\sqrt{\sum_{i=1}^L (y_{1}^i - \overline{y_1})^2 \sum_{i=1}^L (y_{2}^i - \overline{y_2})^2}}$

[1] Avdeyev, P., et al. DDSM [2] Stark, H., et al. DFM [3] Wang, Y., et al. promoter design [4] Taskiran, I.I., et al. synthetic enhancers. [5] Zrimec, J., et al. ExpGAN

2024-08-13

Thank you so much for your insightful feedback on our paper! We hope you found our responses useful. As the discussion period is coming to a close, please feel free to ask any remaining questions you may have. We're happy to provide further clarification!

最终决定Accept (poster)

2024-09-25

The paper shed light on important weaknesses of AR and DM-based models for genomic sequence design and proposes an intuitive and promising solution combining both models.

The authors have provided convincing additional experiments and clarification that alleviate initial concerns on the evaluation and the methodology. We strongly urge them to incorporate these into the manuscript as these significantly strengthen the quality of the submission.

In particular it is critical to add the benchmarking experiments, sensitivity analysis, as well as the insightful comments on the pertinence of promoter generation, and the rationale for considering the EPD dataset.

Absorb & Escape: Overcoming Single Model Limitations in Generating Heterogeneous Genomic Sequences

摘要

评审与讨论

优点

缺点

问题

局限性

Q1. A&E performance on the DDSM task

Q2. Validate the proposed approach in real world long sequences

Q3. How did you train the model for DDSM in the case of multiple species?

Q4. The benefits of the proposed DMs and why can’t one use previous pre-trained DMs, such as DDSM?

Q5. What is the effect of absorption and escape in unconditional generation?

Q6 How did you choose the length of segments in real-world and synthetic cases?

Q7 What is the effect of the segment selection.

Q8 Why one cannot use evolutionary based models for each segment?

Q9 Examples of how the segments look like in real and generated data

优点

缺点

问题

局限性

Q1. If trained on the original DDSM dataset, how would DiscDiff perform?

Q2. Why deep AR models do not learn the heterogeneous structure

Q3. Benefit of multi-species promoter pretraining and Additional Experiment on DDSM dataset

Q4. DNADiffusion is cited but not explored in Table 3

Q5 a) How motifs distribution were calculated

Q5 b) How MSE is calculated in conditional generation

Q6. Benchmarking against DFM

Q7. Fig 3 doesn’t show any comparisons with other diffusion models (e.g., DDSM)

Q8. Distribution of enformers predictions

优点

缺点

问题

局限性

Q1. Sensitivity Analysis over TabsorbT_{absorb}Tabsorb​

Q2. Additional Task and Dataset

Q3. Motivation for Promoter Generation Task

Q4. How Enformer predictions are used for Evaluation

Q5. Token-Level Probabilities from the Diffusion Model (pDMp^{DM}pDM)

Q6. Description of EPD Dataset

Q7. How the Model Learns to Differentiate Between Different Genomic Regions

Q8. Convergence criteria for Algorithm 1

Additional Clarification About Enformer Prediction

优点

缺点

Reservations about the main evaluation

问题

局限性

Q1. Using the same evaluations as previous papers

Q2: Details about how exactly these sequences are constructed

Q3: Why species-wise generation and not cell-type conditioning?

Q4. Heterogeneity of the DNA sequences in the main results

Q5. Is it better to use a masked language model instead of an AR model?

Q6. How are probabilities for a sequence/sequence segment extracted from the DM?

Q7. Discrepancies between the model names mentioned in lines 222-223 and those in Table 3.

Q8. How A&E generalises to other datasets/settings and evaluations

Q9. Tuning TabsorbT_{absorb}Tabsorb​ seems non-trivial.

I’m unclear about how you conditioned your model on the dense CAGE profile from the DDSM paper.

Regarding the Motif Distribution as the Evaluation metric

Further Clarification about promoter generation task

Benchmarking A&E against DFM and DDSM

Sensitivity analysis of TAbsorbT_{Absorb}TAbsorb​

Clarification about pDM(xi)p^{DM}(\mathbf{x}_i)pDM(xi​) from Diffusion Model

EPD dataset construction process

Q1. Sensitivity Analysis over $T_{absorb}$

Q5. Token-Level Probabilities from the Diffusion Model ( $p^{DM}$ )

Q9. Tuning $T_{absorb}$ seems non-trivial.

Sensitivity analysis of $T_{Absorb}$

Clarification about $p^{DM}(\mathbf{x}_i)$ from Diffusion Model