5.7

/10

Rejected3 位审稿人

最低5最高6标准差0.5

4.3

置信度

正确性3.0

贡献度2.3

表达3.0

ICLR 2025

DNALONGBENCH: A Benchmark Suite For Long-Range DNA Prediction Tasks

Wenduo Cheng,Zhenqiao Song,Yang Zhang,Shike Wang,Danqing Wang,Muyu Yang,Lei Li,Jian Ma

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

TL;DR

Proposing a benchmark dataset for long-range DNA prediction tasks

摘要

关键词

long-range DNA benchmarklong-range DNA modelinglong-range DNA foundation models

评审与讨论

审稿意见

评分: 6置信度: 42024-10-29

The paper proposes DNALongBench, a new benchmark for genomic tasks that consider long-range dependencies of up to 1 million base pairs. The benchmark provides five tasks, enhancer-target gene interaction, expression quantitative trait loci, 3D genome organization, regulatory sequence activity, and transcription initiation signal, with different requirements. The authors evaluate five models per task, including a task-specific expert model, a CNN baseline and three DNA foundation models. The authors claim that DNALongBench is the most comprehensive benchmark tailored to DNA long-range tasks to date.

优点

The paper studies a significant problem in computational biology. DNA regulation and Chromatin structure appears to be one of the challenges in the field and improvements could lead to novel insights into gene regulation and cellular regulatory networks.
The paper is well-written and well structured.
The effort of collecting data from different sources, curating the data, and providing it for download is a strong part of the work.
Each individual task, the biological importance, the data, and evaluation is well-described. The authors seem to be well informed about the literature and the underlying problem.

缺点

Major:

It is unclear to me how many of the tasks really require long context to achieve strong results.
A clear limitation of the work, which is also stated by the authors in the conclusion, is that they do not evaluate transformer based models on the different tasks.
The codebase is not well documented, there is only a readme and one line referring to a notebook with code for the dataloaders.

Please see more explanations below.

Regarding 1:

Some overview tables/plots in the appendix that clearly show the long-range dependencies of the tasks would make sense I think.

Regarding 2:

While it makes sense that models for genomic data require long contexts, the argument of NOT training attention-based models appears weak. The authors use a CNN baseline which obviously does not have a receptive field spanning millions of tokens / base pairs. However, even this “lightweight” baseline seems to outperform the foundation models in some of the tasks (see e.g. Table 4 RSAP Human and Avg, Table 5 WB, SNSES, MS). I would rather say that models with limited context size would be very useful to argue in favor of the development of a long-range benchmark if they really perform worse than non-attention based FMs. Otherwise, the benchmark might still add value, but probably the line of argumentation should be different.

Regarding 3:

I think a benchmark should seamlessly integrate with developer code for a new method to be useful in practice. The authors state that they envision that DNALongBench could be a valuable resource for future evaluations, however, this requires clear documentation, a user friendly interface, and in the best case a leaderboard. A benchmark should integrate into my own codebase with only a few lines of code but it seems like I would have to integrate my code into DNALongBench instead. This clearly limits the usability of DNALongBench and I strongly recommend that the authors work on their code base.

For example, an API as follows would be very helpful to increase the usability

import myModel
import Benchmark

benchmark = Benchmark(task=’someTask’)
model = myModel()

def prediction_wrapper(task):
    prediction = model.inference(task.x)
    return prediction

results = benchmark(prediction_wrapper)

print(results)

The provided dataloaders might be useful, however, I'm missing a clear description and interface for the evaluation. Overall, the lack of documentation is an important weakness here. While the API could still be improved until the CRC deadline, I hope the authors can already show some improvements during time of rebuttal.

For a good example benchmark see e.g. [1] (featured paper at NeurIPS 2022 DBT)

Minor:

Line 363 typo: tetails -> details

Line 303: Maybe I missed it but SNPs are not introduced.

Line 479: Additional the

[1] https://openreview.net/forum?id=_HLcjaVlqJ

问题

CMP	HFF	H1hESC	GM12878	IMR90	HCT116	Average
Caduceus-Ph-409600bps	0.153	0.130	0.101	0.138	0.145	0.133
Caduceus-Ph-307200 bps	0.066	0.076	0.051	0.054	0.149	0.079
Caduceus-Ph-204800 bps	0.082	0.090	0.047	0.053	0.146	0.083

RISP	Human	Mouse
Context Length=196k	0.298	0.396
Context Length=131k	0.268	0.393

	ETGP
Caduceus-Ph	0.826
Caduceus-Ph w/ shuffled context	0.418

eQTLP	CCF	WB	Thyroid	SNSES	SSELL	MS	NT	AT	AS	Average
Caduceus-Ph	0.597	0.594	0.527	0.586	0.574	0.538	0.588	0.547	0.541	0.565
Caduceus-Ph w/ shuffled context	0.519	0.515	0.507	0.533	0.539	0.521	0.542	0.507	0.504	0.521

TISP	FANTOM CAGE	ENCODE CAGE	ENCODE RAMPAGE:	GRO-cap	PRO-cap	Average
Puffin-D	0.593	0.331	0.294	0.249	0.394	0.372
Puffin-D with shorter context	0.530	0.283	0.450	0.212	0.331	0.361

TISP	FANTOM CAGE	ENCODE CAGE	ENCODE RAMPAGE:	GRO-cap	PRO-cap	Average
Puffin-D	0.808	0.710	0.749	0.624	0.774	0.733
Puffin-D with shorter context	0.781	0.699	0.715	0.604	0.702	0.700

优点

Good test of actual functional abilities of long-range DNA language models

In the recent literature, a lot of effort has been made to port over LLMs from NLP to DNA sequences (and other biological tasks). In these works, the evaluation of DNA LLMs has been very sparse, and it was unclear as to whether or not these models could handle real biological tasks of significance (instead of very simplistic predictive tasks like species prediction, which is not biologically meaningful or challenging today as a computational task).

This paper offers a much-needed evaluation of whether or not these DNA LLMs can actually handle biologically meaningful and challenging tasks which rely on long-range interactions (which was the initial promise of these models). Through various experiments, this paper shows that, in fact, these DNA LLMs are not particularly well-poised to predict meaningful biological tasks, even after fine-tuning.

Good selection of benchmarks with meaningful biological significance

The selection of the 5 benchmark tasks is well-thought-out, and they span many areas of interest, from regulatory genomics, to chromatin organization, to variant-effect prediction (for eQTLs). Together, these tasks cover a decent space in testing the ability of models to capture trans effects in the genome.

Clear writing and well-organized paper

The paper is well-written, with clear organization and motivation. Reading through, things were simple to understand, and structured in a very good way.

缺点

Other than large DNA language models, not many other models are tested

The DNA LLMs that were tested (i.e. HyenaDNA and Caduceus) are good representations of this sort of model, but there aren’t that many other models that were tested in this benchmark. Other than these DNA LLMs, this work trains a very simplistic CNN, and applies a single expert model (generally, the model which was published along with the data). Together, the expert model is an example of one of the best models possible, and the CNN is an example of one of the easiest deep-learning-based baselines. The analyses in this paper show how DNA LLMs perform relative to these two endpoints, but there are not any other models that are tested.

In addition to these DNA LLMs, many researchers will use other models and architectures which can also detect long-range interactions (such as those described in section 2.2). Especially due to the computational cost of DNA LLMs, many researchers will necessarily be relying on other architectures. It would be much more informative to show the performance of these other architectures (e.g. Enformer, Borzoi, even DanQ, etc., both fine-tuned and not fine-tuned on the specific task), to see how these models also perform. In its current form, this paper is largely just a benchmark for DNA LLMs, even though several other models exist, whose performance also falls somewhere in the spectrum between the basic CNN and the expert model. It would be far more meaningful of a benchmark if those models are shown, as well.

Limited results on the various advantages and disadvantages of certain models

There are only limited results in the paper exploring and discussing the relative advantages and disadvantages of certain models. For a benchmarking paper, it would be useful to have more discussion on the various areas where certain models do better than others, and an exploration of why (including different ablation studies). Some interesting discussion points (a very non-comprehensive set of suggestions) may be:

How useful is reverse-complement augmentation or reverse-complement-aware architectures for these tasks?
Since the expert model really does much better, how much of its improved performance is due to its focus on a specific task (whereas the DNA LLMs are not trained on a specific task)? How much does fine-tuning the DNA LLMs allow them to perform much better on these specific tasks?
How much performance is gained from the sheer size/expressivity of the DNA LLMs? Would a different architecture of similar size/expressivity be comparable to the DNA LLMs (e.g. testing an Enformer-like architecture of high capacity, for each of these tasks).
Section 5.2 suggests a difference between long-range and short-range prediction performance in these models. Is the failure of DNA LLMs to predict specific peaks due to the excessively long input sequences (i.e. so the model has a hard time focusing on specific parts of a sequence)?

Suggestions for improvements in writing

There are several places where the paper has some typos or other grammatical errors. Here is a non-comprehensive list:

An en-dash (not a hyphen, and not an em-dash) should be between words/objects which specify a pair: enhancer–target gene, enhancer–promoter, variant–promoter
“megabase pairs” should be “mega basepairs” or “mega base-pairs” (L220)
“complied” should be “compiled” (L278)
“tetails” should be “details” (L363)

Additionally, there are areas which could use more clarity

In the tables comparing performance between the models, because the expert model is effectively always by far the best one, it would be useful to underline the second-best model, as well (or use another way to emphasize the second-best model in addition to the best)
In Table 2, the shape of the output in contact-map prediction is a bit confusing, as contact maps are typically thought of as 2D objects rather than a single 99681-vector
The order of the datasets/tasks is not the same in the Table 2, the main text, and the supplement in different sections; there doesn’t seem to be a reason why the ordering of these 5 tasks can’t be presented in the same ordering each time
In the data repositories at https://dataverse.harvard.edu/, it is not clear what each file represents and how the data is organized vis a vis the descriptions in the appendix; a README should be added to each repository

问题

What is the stratified sampling approach used in some of these tasks (i.e. enhancer–target-gene prediction, eQTL prediction); that is, what is the stratification based off of?
In any of these datasets/tasks, is it possible that the same genomic coordinate appears in multiple dataset splits (e.g. train and test)? This could be possible if the examples arising from the same gene end up in different splits, for example
When masking out other enhancers in the enhancer–target-gene task, what were these enhancers masked with?
What was the reasoning for removing genes with too few positive/negative pairs (i.e. enhancer–target-gene prediction, eQTL prediction)?

评论- Rebuttal by authors

2024-11-22

Q4: What was the reasoning for removing genes with too few positive/negative pairs (i.e. enhancer–target-gene prediction, eQTL prediction)?

R4: The rationale is to ensure that the model has sufficient data to learn the sequence dependency between an enhancer and its target genes. Given the limitations of CRISPR screening, not all potential enhancer candidates for a gene are experimentally tested. As a result, we have decided to concentrate solely on genes that have an adequate number of enhancer candidates and verified positive pairs.

We believe that our responses have effectively addressed the reviewer's concerns. If there are any further questions or suggestions, we would be more than happy to discuss them. We look forward to any additional feedback the reviewer may provide.

评论- Rebuttal by authors

2024-11-22

We appreciate the reviewer’s thoughtful and constructive questions. We are particularly encouraged that you found our work is a much-needed evaluation of long-range DNA LLMs’ ability to biologically meaningful and challenging tasks. Our answers to specific points are shown below:

Weakness1: In addition to these DNA LLMs, many researchers will use other models and architectures which can also detect long-range interactions (such as those described in section 2.2). Especially due to the computational cost of DNA LLMs, many researchers will necessarily be relying on other architectures. It would be much more informative to show the performance of these other architectures (e.g. Enformer, Borzoi, even DanQ, etc., both fine-tuned and not fine-tuned on the specific task), to see how these models also perform. In its current form, this paper is largely just a benchmark for DNA LLMs, even though several other models exist, whose performance also falls somewhere in the spectrum between the basic CNN and the expert model. It would be far more meaningful of a benchmark if those models are shown, as well.

R: We thank the reviewer for raising this important question. One of the primary goals of this work is to provide a set of biologically meaningful long-range genomic tasks and use them to evaluate the performance of current DNA LLMs. DNA LLMs are promising but current evaluation benchmarks mostly focus on short-range genomic tasks or limited biological data types. We first compiled five long-range genomic data for benchmarking. To make an effective comparison, for each task, we selected one of the best models specifically designed to each task. These expert models actually represent a wide-spread type of model architecture. For instance, we evaluated the Enformer model, a CNN and transformer hybrid model, on the regulatory sequence activity prediction (RSAP) and eQTL prediction (eQTLP) task. Puffin-D, a CNN based model with two upward-downward passes and residual connections similar to U-net, was used to evaluate the transcription initiation signal prediction (TISP) task. Regarding the enhancer target-gene prediction (ETGT) task, we chose the activity-by-contact (ABC) model. This model is simple but has been widely shown with good performance against more complicated deep learning models. For instance, the authors in the gABC model [1] have found the gABC model shows better performance than Enformer. Note that we also included gABC as the expert model in two additional CRISPRi screening-based enhancer-target gene datasets, of which the performance is shown in the following tables. As suggested by the reviewer, it would be quite interesting to compare the performance of the above-mentioned best models on each task, but this deviates from our primary goal to evaluate the DNA LLMs. In response to the reviewer, we have now conducted a series of ablation experiments to explore the advances and limitations of DNA LLMs (also see our response to reviewer kFm8). For instance, we found that context lengths indeed show great impact on the model performance, suggesting DNA LLMs can learn some long-range dependency. However, the relatively poor performance of DNA LLMs as compared with these best models indicate that there is still a lot of room for the development of DNA LLMs. Model size is important but is not deterministic (see the response to weakness 2.3). We hope this response can answer the reviewer's questions.

[2] Schraivogel et al.

Model	ETGP
Expert Model (ABC)	0.896
Expert Model (gABC)	0.890
CNN	0.711
HyenaDNA	0.752
Caduceus-Ph	0.906
Caduceus-ps	0.895

[3] Gasperini et al.

Model	ETGP
Expert Model (ABC)	0.824
Expert Model (gAB)	0.831
CNN	0.730
HyenaDNA	0.781
Caduceus-Ph	0.788
Caduceus-ps	0.793

[1] Hecker, D., Behjati Ardakani, F., Karollus, A., Gagneur, J., & Schulz, M. H. (2023). The adapted Activity-By-Contact model for enhancer–gene assignment and its application to single-cell data. Bioinformatics, 39(2).

[2] Targeted Perturb-seq enables genome-scale genetic screens in single cells. Schraivogel et al. Nature Methods. 2020.

[3] A genome-wide framework for mapping gene regulation via cellular genetic screens. Gasperini et al. Cell. 2019.

评论- Rebuttal by authors

2024-11-22

Weakness 2.1: How useful is reverse-complement augmentation or reverse-complement-aware architectures for these tasks?

R: We appreciate the reviewer’s question regarding the usefulness of reverse-complement augmentation and reverse-complement-aware architectures for these tasks. To address this, we first evaluated the effect of reverse-complement (RC) augmentation using Caduceus-Ph on the enhancer target-gene prediction (ETGP) task. We also implemented shift augmentation by randomly shifting the input by 1-3 base pairs (bps) to either 5’ and 3’ end of the DNA sequence, as this is another common practice in the field to augment data. According to our results as shown below, we observed that applying reverse-complement or shift augmentations led to comparable or slightly worse performances for Caduceus-Ph compared to training without these augmentations. We think the reason might be because applying augmentations may introduce additional complexity in the patterns of positive samples for models to learn. Furthermore, promoter-enhancer orientation is important for this task. The input sequence is oriented with the gene positioned at the 5’ end and the enhancer located at the 3’ end. And this task requires orientation-specific information, so reverse-complement augmentation might not be that useful in this task.

	Expert Model	CNN	Caduceus-Ph	Caduceus-Ph-with shift data augmentation	Caduceus-Ph-with complementary data augmentation
ETGP	0.926	0.797	0.826	0.821	0.820

Weakness2.2: Since the expert model really does much better, how much of its improved performance is due to its focus on a specific task (whereas the DNA LLMs are not trained on a specific task)? How much does fine-tuning the DNA LLMs allow them to perform much better on these specific tasks?

R2.2: We have already finetuned the DNA LLMs on each of the classification or regression tasks. Therefore, the performance of each model is achieved by training or tuning the model on a specific task. We think that the performance gap between expert model and DNA LLMs is likely due to the fact that current DNA LLMs still lack the ability to fully capture the complex dependencies across long-range contexts. We hope our benchmark data could be a good resource for researchers to test various model architectures to more effectively learn long-range sequence dependencies.

Weakness 2.3: How much performance is gained from the sheer size/expressivity of the DNA LLMs? Would a different architecture of similar size/expressivity be comparable to the DNA LLMs (e.g. testing an Enformer-like architecture of high capacity, for each of these tasks).

R2.3: In our setting, we utilize Enformer as the expert model for the eQTLP, ETGP, and RSAP tasks. The model sizes for Enformer and DNA LLMs are reported below. Notably, Enformer has a significantly larger model size and higher capacity compared to HyenaDNA and Caduceus. However, previous literature suggests that larger model size alone does not always lead to a better performance. For instance, in the Caduceus paper [1], the authors found that DNABERT-2, despite its larger size (117M parameters), underperforms compared to Caduceus in several long-range genomic benchmark tasks. This highlights the critical importance of architectural design tailored to the specific task, which has a substantial impact on final performance outcomes.

	Enformer	HyenaDNA	Caduceus
Model Size	252M	1.6M	1.9M

[1] Schiff, Y., Kao, C. H., Gokaslan, A., Dao, T., Gu, A., & Kuleshov, V. (2024). Caduceus: Bi-directional equivariant long-range dna sequence modeling. arXiv preprint arXiv:2403.03234.

评论- Rebuttal by authors

2024-11-22

Weakness 2.4: Section 5.2 suggests a difference between long-range and short-range prediction performance in these models. Is the failure of DNA LLMs to predict specific peaks due to the excessively long input sequences (i.e. so the model has a hard time focusing on specific parts of a sequence)?

R2.4: We thank the reviewer for this interesting question. To answer this question, we evaluate DNA LLMs under different input sequence lengths. We chose the contact map prediction task as it is known that there exist chromatin interaction patterns at different length scales. We evaluate the contact map prediction task under three input sizes, i.e. 409600, 307200 and 204800 bps. We used stratum-adjusted correlation coefficient (SCC) to evaluate the prediction. We found that there is a performance decline when we reduce the context length, indicating that though there is a performance gap between the expert model Akita and DNA LLMs, utilizing long-range sequences still helps to obtain a better performance for DNA LLMs.

Contact map prediction with different context length:

Context length 409600 bps

	HFF	H1hESC	GM12878	IMR90	HCT116	Average
CNN	0.025	0.024	0.010	0.013	0.001	0.015
Caduceus-Ph	0.153	0.130	0.101	0.138	0.145	0.133

Context length 307200 bps

	HFF	H1hESC	GM12878	IMR90	HCT116	Average
CNN	0.029	0.032	0.012	0.022	0.007	0.020
Caduceus-Ph	0.066	0.076	0.051	0.054	0.149	0.079

Context length 204800 bps

	HFF	H1hESC	GM12878	IMR90	HCT116	Average
CNN	-0.003	0.049	0.012	-0.006	0.003	0.011
Caduceus-Ph	0.082	0.090	0.047	0.053	0.146	0.083

Suggestions for writing: There are several places where the paper has some typos or other grammatical errors. Additionally, there are areas which could use more clarity.

R: Thanks for pointing out these typos and errors. We have revised the manuscript accordingly and uploaded the new version of the manuscript. Please refer to the revised manuscript highlighted in red: (1) L220, “mega basepairs”, (2) L 278, “compiled”, (3) L 363,”details”.

Q1: What is the stratified sampling approach used in some of these tasks (i.e. enhancer–target-gene prediction, eQTL prediction); that is, what is the stratification based off of?

R1: We apologize for any confusion. In both enhancer target-gene and eQTL predictions, our dataset exhibits a significant class imbalance. This occurs because true enhancer-gene pairs constitute only a small fraction of the pairs tested. To address this, we employed stratified sampling rather than random sampling. This approach prevents the scenario where random sampling might allocate all negative samples to either the training or validation set, which would result in biased model training and evaluation. Specifically, we first screened out genes with fewer than two positive and two negative enhancer-gene pairs. While creating the training, validation, and test sets, stratified sampling ensures that each of the training and validation sets contains at least one positive and one negative sample. The remaining samples are then randomly divided among the training, validation, and test sets. We implemented the same strategy for eQTL prediction for similar reasons.

Q2: In any of these datasets/tasks, is it possible that the same genomic coordinate appears in multiple dataset splits (e.g. train and test)? This could be possible if the examples arising from the same gene end up in different splits, for example

R2: We thank the reviewer for the question. We agree with the reviewer that it is important to make sure there is no information leakage between training and test set. For the enhancer target-gene task, each gene has its own enhancer candidates within 450 kb from the transcription start sites. The input of the model is the sequence between the promoter of the gene and the enhancer candidate. The enhancer-gene interactions are mostly gene-specific so holding out some genes in the training set might be very informative. To avoid the scenarios where an enhancer candidate in the training set may also exist in the input sequence of the test enhancer, we masked all enhancer candidates by letter N in the input sequence.

Q3: When masking out other enhancers in the enhancer–target-gene task, what were these enhancers masked with?

R3: These enhancer input sequences are masked with the other enhancer candidates of its potential target genes. The reason is to eliminate the confounding effects that might arise from the presence of other enhancer candidates.

2024-11-26

Dear reviewer FfkD,

We sincerely appreciate the time and effort you've dedicated to reviewing our paper. In our response, we have carefully and diligently addressed your thoughtful concerns and insightful questions. As the discussion deadline is approaching, please do not hesitate to let us know if you have any additional questions or need further clarification. Your insights and feedbacks are very important for us to improve our work.

Best, Authors of paper 5418

审稿意见

评分: 5置信度: 42024-11-02

This paper proposes a set of five benchmarking tasks for genomic language models, with a focus on tasks that involve modeling long-range DNA dependencies, including enhance-target gene interaction and 3D genome organization. These tasks represent a diverse set of biologically important characteristics. The authors benchmark five baseline methods on each task - task-specific expert models, CNN-based models, and three recent DNA language models, and find that the DNA language models they benchmark do not outperform the task-specific models.

优点

The tasks included in the benchmark are biologically significant and diverse, especially the focus on tasks with different dimensionalities (1D or 2D) and granularities (binned or base pair level). In addition, the paper does a good job explaining the biological significance of each task. Both the increased focus on long-range tasks and base pair resolution tasks represent novel aspects of this benchmark compared to other previously published benchmarks.
The results are presented in a clear and concise manner
The performance of the three evaluated DNA language models — HyenaDNA, Caduceus-Ph, and Caduceus-PS — on these tasks reveal important limitations of current DNA language models at modeling long range dependencies, even when the models are able to incorporate long context lengths.

缺点

In my opinion, the main weakness of this paper is novelty. Although the benchmarking tasks are important and biologically motivated, all five of the tasks and corresponding datasets have already been used to benchmark models in previous publications and in some previous benchmarks, such as BEND (Marin, et al. 2023) and LRB (Kao, et al. 2024). The “Regulatory sequence activity” task directly uses the training, validation, and test set sequences from Enformer. Therefore, the main contribution of this paper is consolidating these tasks into one resource and comparing the performance of DNA language models and more traditional supervised approaches. One potential way to increase the novelty and contribution of this work could be to incorporate more datasets that weren’t used in previous publications to increase the amount of benchmarking data for each task. For example, for the enhancer-target gene task, the dataset is relatively small, and different experimental datasets often suffer from their own technical and experimental biases. Since a number of similar datasets exist and are publicly available, it would improve and increase the impactfulness of the benchmark to aggregate multiple similar datasets.

问题

The term “Expert model” is misleading. For example, L369-370 refers to “expert models tailored to each task,” but in some cases, such as using Enformer as an expert model for eQTL prediction, the expert model has not actually been tailored to this task. An alternative could be to refer to these models at “State of the art models.” In addition, in some cases the choice of Expert/SOTA model could be revisited. For example, for the enhancer-target gene task, Enformer (Avsec et al. Nature Methods (2021)) or Borzoi (Linder et al. 2023) should be used as a state of the art model instead of Activity by Contact. In addition, for the eQTL prediction task, Borzoi was shown to outperform Enformer in Linder et al. 2023.
Use of the term DNA foundation model and which models are classified in this category is inconsistent throughout the paper. For example, on L146, Avsec et al. 2021a is cited as a DNA foundation model (although it is not actually pre-trained on only DNA seqeunces), but later in the paper it’s used as an “expert model” baseline in contrast to the DNA foundation models.

Minor suggestions:

L51-53 - Karollus et al. Genome Biology (2023) could be cited with reference to this point
L154-155 is unclear and should be reworded: “It has shown promising performance in long-range species classification tasks despite the problem itself is not well defined in real applications.”

评论- Rebuttal by authors

2024-11-23

Continuing replying to weakness

We also curated and processed four additional datasets for the contact map prediction task (Table A10, line 263-265, and line 817-822). Briefly, we downloaded Hi-C data for four different cell types from the 4DN data portal and processed them following the same procedure described in the Akita paper. Due to the time constraints, we only evaluated the performance of different models in the 5% downsampled data. We observed some overfitting issues, likely due to the small size of training data. We plan to update the table and leaderboard once the training process is complete. The data source and results have been included in the manuscript in Tables A10 and A11, and Appendix H.

Table A10 data source on additional data set for contact map prediction

Cell type	Accession ID on 4DN data portal
HAP1	4DNFIWGGYEW2
Hela	4DNFIIT9PKS9
HUVEC	4DNFI2R1VS8E
K562	4DNFIIT9PKS9

Table A11 Performance on additional Dataset on Contact Map prediction

	HAP1	Hela	HUVEC	K562	Average
Expert Model	0.0100	0.0120	0.0010	0.0010	0.0060
CNN	0.023	0.0115	-0.00025	0.0008359	0.0087
HyenaDNA	-0.0307	0.342	0.0590	-0.00766	0.0906
Caduceus-Ph	0.0483	0.0120	0.0163	0.0257	0.0256
Caduceus-PS	0.0477	0.0121	0.0154	0.0259	0.0252

Q1: The term “Expert model” is misleading. For example, L369-370 refers to “expert models tailored to each task,” but in some cases, such as using Enformer as an expert model for eQTL prediction, the expert model has not actually been tailored to this task. An alternative could be to refer to these models as “State of the art models.” In addition, in some cases the choice of Expert/SOTA model could be revisited. For example, for the enhancer-target gene task, Enformer (Avsec et al. Nature Methods (2021)) or Borzoi (Linder et al. 2023) should be used as a state of the art model instead of Activity by Contact. In addition, for the eQTL prediction task, Borzoi was shown to outperform Enformer in Linder et al. 2023.

R: We thank the reviewer for the valuable suggestions. In response, we have replaced the phrase “expert models tailored to each task” with “expert models that have shown state-of-the-art results” on lines 369–370. However, we opted not to replace all mentions of expert models with SOTA for several reasons, as we believe this term may lead to some confusion.

For example, in the enhancer target-gene task, the Borzoi model was trained to predict gene expression, and the authors of Borzoi demonstrated its ability to predict target genes by evaluating gradient attributions. While this approach appears reasonable, recent reports, including [1] and [2], indicate that such models may underestimate distal enhancers and exhibit asymmetric bias depending on the direction relative to the transcription start site (TSS). Furthermore, a variant of the ABC model, named gABC, has been shown to outperform Enformer [3]. We included gABC as an additional expert model and reevaluated the task using two newly added datasets. Both ABC and gABC demonstrate similar performance levels.

Lastly, we want to highlight that the primary goal of this work is to provide a set of biologically meaningful and challenging long-range genomic tasks and use them to evaluate the performance of current DNA LLMs. While DNA LLMs are promising, existing evaluation benchmarks primarily focus on short-range genomic tasks or limited biological data types. Our evaluations show that the current expert models selected in this study have already outperform DNA LLMs. Models like Borzoi are still under development and have not yet been published, so we chose to adopt a conservative approach in referring to these models as SOTA.

[1] Karollus, A., Mauermeier, T., & Gagneur, J. (2023). Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome biology, 24(1), 56.

[2] Toneyan, S., & Koo, P. K. (2023). Interpreting cis-regulatory interactions from large-scale deep neural networks for genomics. bioRxiv.

[3] Hecker, D., Behjati Ardakani, F., Karollus, A., Gagneur, J., & Schulz, M. H. (2023). The adapted Activity-By-Contact model for enhancer–gene assignment and its application to single-cell data. Bioinformatics, 39(2).

评论- Rebuttal by authors

2024-11-23

Q2: Use of the term DNA foundation model and which models are classified in this category is inconsistent throughout the paper. For example, on L146, Avsec et al. 2021a is cited as a DNA foundation model (although it is not actually pre-trained on only DNA seqeunces), but later in the paper it’s used as an “expert model” baseline in contrast to the DNA foundation models.

R: We agree with the reviewer. We have fixed the issue by removing Avsec et al 2021a on L146.

Minor suggestions1: L51-53 - Karollus et al. Genome Biology (2023) could be cited with reference to this point

R: According to the reviewer’s suggestion, we have cited this paper. Please see the reuploaded version, where the citation has been added on line 53 and is highlighted in red.

Minor suggestions2: L154-155 is unclear and should be reworded: “It has shown promising performance in long-range species classification tasks despite the problem itself is not well defined in real applications.”

R: We thank the reviewer for pointing this out. We have revised the paper accordingly. The changes are highlighted in red in the reuploaded version on lines 154–155. The sentence has been replaced with: “It has demonstrated promising performance in long-range species classification tasks, even though the practical applications of this problem remain poorly defined.”

We sincerely thank the reviewer for their thoughtful and constructive comments, which have significantly improved the quality and clarity of our paper. We believe that we have thoroughly addressed all the concerns raised and made substantial revisions accordingly. If the reviewer has any additional questions or suggestions, we would be happy to discuss them further. We look forward to any additional feedback the reviewer may have.

2024-11-25

Thanks to the authors for their detailed response and updates to the manuscript. The authors have sufficiently addressed my questions regarding choice of expert models and terminology in the manuscript. In addition, the authors have partially addressed my concerns about novelty with the addition of additional datasets for the enhancer-target gene and contact map prediction tasks. While I still believe there is significant overlap between this benchmark and prior benchmarking studies, I have updated my score to reflect the novelty gained by incorporating additional datasets not used in previous evaluations.

评论- Rebuttal by authors

2024-11-23

We appreciate the reviewer’s valuable feedback and insightful questions. These questions help improve the overall quality of our manuscript. In response to the reviewer's concerns, we have incorporated additional datasets and conducted further experiments. Our detailed answers to the specific points are provided below:

Weakness: In my opinion, the main weakness of this paper is novelty. Although the benchmarking tasks are important and biologically motivated, all five of the tasks and corresponding datasets have already been used to benchmark models in previous publications and in some previous benchmarks, such as BEND (Marin, et al. 2023) and LRB (Kao, et al. 2024). The “Regulatory sequence activity” task directly uses the training, validation, and test set sequences from Enformer. Therefore, the main contribution of this paper is consolidating these tasks into one resource and comparing the performance of DNA language models and more traditional supervised approaches. One potential way to increase the novelty and contribution of this work could be to incorporate more datasets that weren’t used in previous publications to increase the amount of benchmarking data for each task. For example, for the enhancer-target gene task, the dataset is relatively small, and different experimental datasets often suffer from their own technical and experimental biases. Since a number of similar datasets exist and are publicly available, it would improve and increase the impactfulness of the benchmark to aggregate multiple similar datasets.

R: We thank the reviewer for the valuable comments. First, we want to clarify that three of the five benchmark datasets (i.e., enhancer-target gene, contact map, and transcription initiation signal) are not included in any existing benchmarks, such as BEND and LRB. Enhancer-target gene prediction and contact maps represent well-known regulatory mechanisms involving DNA interactions that span thousands or millions of base pairs. These tasks also introduce new and more challenging task types, including 2D contact map prediction and base pair resolution regression, which were not addressed in any existing benchmarks.

We agree with the reviewer that incorporating more relevant datasets would increase the size of benchmarking data. To address this, we have collected two additional CRISPRi screening-based enhancer-target gene datasets (line 231-232, line 787-790) and evaluated the performances of different models (Table A8 and Table A9). CRISPRi screening-based methods are among the most direct and accurate methods to determine whether an enhancer regulates specific target genes. To the best of our knowledge, only three such datasets are publicly available, and we have incorporated all of them into our benchmarking task.

The following table shows the performance of different models on the enhancer target-gene tasks. We observe a consistent trend that expert models including ABC and its variants gABC [1] outperform CNN baseline and three DNA language models. The authors in the gABC model have found that gABC shows higher performance than ABC and Enformer. In our evaluations on the benchmark data, we observed comparable performance between the ABC and gABC models. These results have been included in the manuscript in Tables A8 and A9, and are summarized in Appendix G.

Two additional dataset on enhancer-target gene prediction task

Table A8 ETGP results on data curated from Schraivogel et al

Model	ETGP
Expert Model	0.896
Expert Model (gABC)	0.890
CNN	0.711
HyenaDNA	0.752
Caduceus-Ph	0.906
Caduceus-ps	0.895

Table A9 ETGP results on data curated from Gasperini et al

Model	ETGP
Expert Model (ABC)	0.824
Expert Model (gAB)	0.831
CNN	0.730
HyenaDNA	0.781
Caduceus-Ph	0.788
Caduceus-ps	0.793

[2] Targeted Perturb-seq enables genome-scale genetic screens in single cells. Schraivogel et al. Nature Methods. 2020.

[3] A genome-wide framework for mapping gene regulation via cellular genetic screens. Gasperini et al. Cell. 2019.

评论- Further response to reviewer's concerns

2024-11-25

We sincerely thank the reviewer for their thoughtful feedback and careful review of our responses and manuscript updates. We are encouraged to hear that our revisions have effectively addressed many of your concerns. Regarding your comments on the perceived overlap of benchmark data with BEND and LRB, we would like to take this opportunity to further clarify the distinctions between DNALongBench and these prior benchmarks. As we explain below, DNALongBench not only incorporates more diverse experimental data types but also provides significantly larger datasets, enabling more comprehensive evaluations compared to BEND and LRB.

Comparison with BEND:

BEND contains seven different classification tasks, five of which involve input DNA sequences limited to 512bp. The input length for gene finding ranges between 1,433 and 14,000 base pairs. The enhancer annotation task is the only task that can be considered a long-range genomic task with an input length around 100kb, whereas the input length in DNALongBench is up to 1M base pairs. Additionally, this task includes only 285 samples, which is much smaller than DNALongBench. Furthermore, the task goal in BEND differs significantly from DNALongBench. In BEND, the task is to predict whether each 128 bp bin is an enhancer, without addressing enhancer and its target genes relationships, which are a more challenging core focus of DNALongBench.

Comparison with LRB:

LRB consists of seven tasks. Only CAGE and variant effect prediction have some overlap with what we used by deriving from the Enformer datasets. However, both the dataset sizes and task settings in LRB are not comparable with DNALongBench.

For instance, the authors of LRB only selected 50 CAGE tracks from the 638 available in humans. In contrast, DNALongBench includes a total of 6,991 tracks from human and mouse -- over 100 times more than LRB, as shown in the table below. This decision was made to ensure a fair and consistent comparison between DNA LMs and a strong supervised baseline (Enformer). While LRB also used Enformer data, their setup included only 50 tracks, with Enformer serving as the task-specific expert model baseline. We believe that providing the full training dataset, as used by Enformer, enables a more comprehensive and fair comparison.

The setting of the variant effect prediction of eQTL tasks also differs significantly. In LRB, the task is to predict whether a SNP is a positive QTL, while DNALongBench aims to predict positive eQTL-gene pairs. Additionally, LRB combines data across all tissues, whereas DNALongBench separates datasets by tissue to better capture biologically meaningful tissue-specific regulatory interactions. The GTEx paper [1] demonstrated that eQTLs exhibit consistent tissue specificity (Figure 3a), supporting our approach of treating eQTL from different tissues as distinct datasets.

In conclusion, while we acknowledge that some overlap exists with previous benchmarks, we believe DNALongBench offers a fundamentally more comprehensive and scalable framework for evaluating long-range genomic tasks. By including larger datasets, longer input sequences, and task types that directly address biologically more meaningful gene regulatory dynamics, DNALongBench fills critical gaps left by BEND and LRB and would be very useful for the field of genomic LM development. We hope this clarifies our contributions and demonstrates the unique value of DNALongBench. We remain open to any additional questions or feedback and greatly appreciate the reviewer's insights, which have been invaluable in strengthening this work.

Table R1. Number of tracks used by DNALongBench and LRB

	DNALongBench (Human)	DNALongBench (Mouse)	LRB
CAGE	683	357	50
CHIP-seq	3991	1058	0
DNase-seq	674	101	0
ATAC-seq	0	127	0

[1] GTEx Consortium (2017). Genetic effects on gene expression across human tissues. Nature, 550(7675), 204-213

评论- Response with additional contact map dataset and the corresponding results

2024-11-28

We sincerely appreciate the reviewer for pointing out their further concerns. We are pleased to update the final evaluation results for the additional dataset curated for contact map prediction. The results are shown in the following table. Akita outperforms other methods across all four cell types, consistent with our observations in the five datasets initially used in the Akita paper. In our initial experiments, we tried using the HUVEC Hi-C data. However, we found this dataset has very low coverage (around 10%) compared to other cell lines, leading to lower performance on the test dataset. We then chose another cell line, HepG2, and used another high-coverage Hi-C data for HeLa cells. We believe these newly curated datasets will be quite useful for others to thoroughly evaluate genomic models’ performance in predicting cell type-specific contact maps. We will also generate more datasets upon request to further expand the size of benchmarks. We have correspondingly updated the text of the manuscript and tables to reflect this update at line 817-822, Table A10 and A11.

	HAP1	Hela	HepG2	K562	Average
Expert model (Akita)	0.196	0.223	0.198	0.175	0.198
CNN	0.018	0.025	0.021	0.003	0.017
HyenaDNA	-0.062	0.103	0.094	0.065	0.049
Caduceus-Ph	0.063	0.168	0.178	0.003	0.103
Caduceus-PS	0.063	0.170	0.178	0.051	0.115

AC 元评审

2024-12-19

This is a well-executed DNA benchmark paper. The reviewers were pretty well-aligned on their score in the weak reject to accept range. Compared to previous work like BEND - published at last year's ICLR - this paper adds five long range benchmarks.

This work definitely deserved publication. But given its incremental nature, arguably a specialized (bioinformatics) venue is a more obvious choice than ICLR.

审稿人讨论附加意见

None.

最终决定Reject

2025-01-22

Reject