5.0

/10

Rejected4 位审稿人

最低3最高6标准差1.2

3.8

置信度

正确性3.0

贡献度3.0

表达3.0

ICLR 2025

The Genomics Long-Range Benchmark: Advancing DNA Language Models

Evan Trop,Yair Schiff,Edgar Mariano Marroquin,Chia Hsiang Kao,Aaron Gokaslan,McKinley Polen,Mingyi Shao,Aymen Kallala,Bernardo P de Almeida,Thomas PIERROT,Yang I Li,Volodymyr Kuleshov

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

A benchmark to test the long-range capabilities of genomics language models.

摘要

关键词

DNALanguage ModelsGenomicsBenchmark

评审与讨论

审稿意见

评分: 5置信度: 42024-11-03

This paper introduces the Genomics Long-Range Benchmark (LRB), designed to evaluate DNA language models (LMs) on tasks that reflect biologically meaningful long-range interactions. The benchmark includes tasks across variant effect prediction, gene expression prediction, regulatory element detection, and chromatin feature identification.

优点

The LRB addresses a critical gap in DNA LM evaluation by focusing on biologically relevant tasks that require long genomic contexts.
The experiments on DNA LMs, including zero-shot and fine-tuning performance across multiple tasks, reveal the strengths and limitations of the models.
Fine-tuning recipes and context-length extension methods provide a robust framework for DNA LM evaluation.

缺点

Lack of in-depth analysis of the experiments. Why do certain DNA LMs perform well on specific tasks?
Potential unfairness in comparisons. DNABERT and HyenaDNA have significantly fewer parameters compared to NT500M, which may skew results. It would be beneficial to compare models with similar parameter counts where possible.
Missing key long-range baseline LMs. The benchmark lacks important long-sequence models such as Caduceus [1] and Evo [2], which would provide a more comprehensive evaluation.
Insufficient comparison in context extension experiments. The analysis of TSS distance effects lacks comparisons with other long-sequence models.
References on benchmarks. In the first paragraph of the Introduction, the reference to ProteinGym [3] should have the publication year 2023 instead of 2024. Additionally, including relevant benchmarks like GenBench [4] and BEACON [5] would improve the coverage of related literature.

[1] Caduceus: Bi-directional equivariant long-range dna sequence modeling

[2] Sequence modeling and design from molecular to genome scale with Evo

[3] Proteingym: Large-scale benchmarks for protein fitness prediction and design

[4] GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models

[5] BEACON: Benchmark for Comprehensive RNA Tasks and Language Models

问题

See Weaknesses

评论- Response to Reviewer 39ce (1/2)

2024-11-25

We thank the reviewers for their comments and suggestions and for recognizing the utility of our benchmark and analyses. We address the concerns and questions in detail below.

Concern 1: Why do certain DNA LMs perform well on specific tasks?

The models we test vary along several axes, including context length, parameter count, length of pre-training, architecture, and tokenization. While teasing apart the effect of each of these factors would fall outside the scope of this work, some of the factors we believe are driving the results are model capacity (e.g. the NTv2 with 500M parameters performs best among all the models within the NTv2 family) and the scope of pre-training. DNABERT and HyenaDNA were both pre-trained for 200-260B tokens, whereas NTv2 was pre-trained for 900B tokens. This disparity is made even larger by the fact that NTv2 uses k-mer tokenization. Overall we find that NTv2 is the best performing LM (even looking at the smaller model sizes for NTv2 in Tables 11 and 12). The scope of the pre-training is potentially a significant driver of this.

Concern 2: Size mismatch (e.g., between NTv2 and other models) provides a potentially unfair comparison

For each of the family of DNA LM (i.e., DNABERT, NTv2, and HyenaDNA) we evaluated several variations of each model. The full results are presented in Tables 11 and 12 in the appendix. Due to space constraints, in the versions of these tables for the main text, we only report the best model from within each family of models. Although the largest NTv2 model is significantly bigger than some of the other models in the benchmark, our goal was to provide a robust accounting of important works in this field, and so we used the available models that have been pre-trained and made available to the community. Note, we did not re-pre-train any of these models, but rather only used existing published weights.

Concern 3: Caduceus and Evo need to be added to evaluations.

This is a great suggestion. We are actively working on getting these results for Caduceus (we provide the initial results below and will update here once the full set is available). We also plan to add this model to the TSS context analysis in Figure 2. For the Evo model, given it was pre-trained on prokaryotic and phage genomic sequences and is a substantially larger model than any of the ones we have run for the current benchmark, we have restricted results to the zero-shot prediction tasks.

Pre-trained Caduceus results

Task	Caduceus (7.7 M params, 131k bp inputs)	DNABERT-2	NTv2	HyenaDNA
Causal eQTL - Zero-shot (AUROC)	0.49	0.50	0.51	0.51
Causal eQTL - Fine-tune (AUROC)	0.681	0.73	0.74	0.71
Pathogenic ClinVar - Zero-shot (AUROC)	0.52	0.50	0.68	0.49
Pathogenic OMIM - Zero-shot (AUPRC)	0.002	0.002	0.003	0.002
Bulk RNA ( $R^2$ )	0.52	0.51	0.60	0.46
Promoters (AUPRC)	0.75	0.71	0.79	0.67

Evo results

Task	Evo (6.5 params, 6.5k bp inputs)	DNABERT-2	NTv2	HyenaDNA
Causal eQTL - Zero-shot (AUROC)	0.50	0.50	0.51	0.51
Pathogenic ClinVar - Zero-shot (AUROC)	0.529	0.50	0.68	0.49

评论- Response to Reviewer 39ce (2/2)

2024-11-25

Concern 3: Included relevant benchmarks like GenBench [4] and BEACON [5] to improve the coverage of related literature.

Thank you for this suggestion. Below we include a discussion of of GenBench and BEACON that we will add to our revised manuscript:

GenBench: This suite is composed of 43 different datasets split between “short” and “long” range tasks, where long-range tasks are defined by having a sequence length of greater than 1000 base pairs. The tasks in GenBench, spanning multiple species, are primarily binary, sequence level classification tasks but also include multi-class classification and regression tasks. The authors evaluate six different genomic language models covering both attention and convolution-based architectures. While GenBench provides a comprehensive evaluation, it lacks critical tasks like variant effect prediction in non-coding regions and zero-shot evaluations. It also omits comparisons to long-context models like Enformer and is limited in its evaluation of long-range tasks, with the longest sequence length capped at 30,000 base pairs.

BEACON: This benchmark introduces the first unified evaluation framework for RNA modeling, encompassing 13 tasks across structural analysis, functional studies, and engineering applications. It evaluates 29 models, ranging from pre-trained RNA language models to naive supervised models, and examines the influence of tokenization strategies and positional embeddings on performance. While BEACON is a valuable resource for assessing RNA-focused models, its scope is distinct from genomic benchmarks, as it targets RNA-specific tasks rather than genomic applications like regulatory element prediction, variant effect prediction, or gene expression prediction.

Concern 4: Typo in ProteinGym citation

Thank you for catching this. We’ve corrected it in our manuscript.

References:

Liu, Zicheng, et al. "GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models." arXiv preprint arXiv:2406.01627 (2024).

Notin, Pascal, et al. "Proteingym: Large-scale benchmarks for protein fitness prediction and design." Advances in Neural Information Processing Systems 36 (2024).

Nguyen, Eric, et al. "Sequence modeling and design from molecular to genome scale with Evo." Science 386.6723 (2024): eado9336.

Ren, Yuchen, et al. "BEACON: Benchmark for Comprehensive RNA Tasks and Language Models." arXiv preprint arXiv:2406.10391 (2024).

Schiff, Yair, et al. "Caduceus: Bi-directional equivariant long-range dna sequence modeling." arXiv preprint arXiv:2403.03234 (2024).

评论- Official Comment by Reviewer 39ce

2024-11-27

Thank you for the authors' response, which has addressed some of my concerns. However, I am still interested in more detailed analyses regarding why certain DNA LMs perform well on specific tasks. Additionally, I would like to see the confidence of the experimental results, it can be addressed by providing mean and variance values, to ensure a more reliable performance measure.

评论- Follow up response (1/5) - Concern: Provide mean and variance values, to ensure a more reliable performance measure

2024-12-01

We clarify that all of the results presented in our original manuscript include mean $\pm$ standard deviation for test set performance from 5 separate fine-tuning runs. For each run, we use a different validation chromosome and do early stopping on the validation loss and then compute performance on the test set. Given the limited timeframe during the rebuttal period, the newly reported results from the rebuttal have not had 5 runs collected, but we are actively working on doing so. For our camera ready revision we will also have mean $\pm$ standard deviation for the new experiments conducted during the rebuttal period as well (using the same protocol described above).

评论- Follow up response (2/5) - Concern: Further analyzing why certain DNA LMs perform well on specific tasks.

2024-12-01

We identify a number of specific factors affecting the performance of DNA LMs across tasks: model input context length, model size, training data (including data quality and number of tokens), and the expressivity of the architecture. We provide analysis and supporting experiments for each factor below, drawing from both our work as well as results reported in the literature.

Context Length

Context length generally improves DNA LM performance. For example, gene expression can be regulated by regions up to 100K bp away: accurately predicting gene expression requires the model to process inputs containing both the gene and its regulatory regions, i.e. a DNA input of up to 100K bp. Similar arguments can be made for variant classification tasks (e.g., identifying eQTLs).

Below, we report experiments showing that across architectures, context length increasingly improves performance on multiple tasks.

Model	Input length (bp)	Causal eQTL - Fine-tune (AUROC)	Bulk RNA ( $R^2$ )	Cage ( $R^2$ )
CNN (12M)	2k	0.709	0.470	0.051
CNN (12M)	32k	0.704	0.461	0.091
CNN (12M)	65k	0.713	0.466	0.120
Caduceus (3.3M)	2k	0.674	0.506	0.086
Caduceus (3.3M)	32k	Running	0.540	0.079
Caduceus (3.3M)	65k	Running	0.542	0.100

We’ve seen evidence for this claim in other works, as well. For example, in the Caduceus paper (Schiff et al. 2024) Figure 4, we see that a Caduceus model, despite being orders of magnitude smaller, with larger context size of 131k bps, outperforms a much larger NTv2 model on a version of the eQTL variant effect expression task. Numbers from that figure are reproduced below:

Model	Input Size (bps)	AUROC for SNPs that have distance to nearest TSS >100k bps
NTv2	12k	0.540
Caduceus	131k	0.586

Model Size

As a rule of thumb, all else equal, larger models yield improved performance. Larger models are more expressive and fit the data better. Empirically, better fitting data also yields representations that perform well on downstream tasks. More formally, a good data fit means the model more accurately identifies conserved vs. non-conserved regions, a useful predictive feature that improves downstream performance. This is especially true for tasks where sequence conservation is an important feature, e.g., variant effect prediction.

In our benchmark, the NTv2 family of models generally performed best. Below we examine more closely how performance varies across model sizes for this family, comparing the 50M to 500M models (these numbers are taken from Tables 12 and 13 of our manuscript, where we see the larger model consistently outperform the smaller one on the downstream tasks:

Model	Causal eQTL (zero-shot; AUROC)	Causal eQTL (fine-tune; AUROC)	Pathogenic Clinvar (zero-shot; AUROC)	Pathogenic Clinvar (fine-tune; AUROC)	Pathogenic OMIM (zero-shot; AUPRC)	BulkRNA ( $R^2$ )	CAGE ( $R^2$ )	Promoter (AUPRC)	Enhancer (AUROC)	Histone Marks (AUPRC)	DNA Accessibility (AUPRC)
NTv2 50M	0.72 $\pm$ 0.005	0.51	0.75 $\pm$ 0.008	0.53	0.002	0.52 $\pm$ 0.074	0.35 $\pm$ 0.030	0.75 $\pm$ 0.008	0.78 $\pm$ 0.041	0.34 $\pm$ 0.007	0.18 $\pm$ 0.005
NTv2 500M	0.72 $\pm$ 0.003	0.51	0.78 $\pm$ 0.009	0.68	0.003	0.60 $\pm$ 0.038	0.39 $\pm$ 0.011	0.79 $\pm$ 0.006	0.82 $\pm$ 0.002	0.38 $\pm$ 0.003	0.3 ± 0.007

评论- Follow up response (3/5) - Concern: Further analyzing why certain DNA LMs perform well on specific tasks (continued).

2024-12-01

Expressivity of the Architecture

Given a fixed number of parameters, certain DNA LM architectures represent a more expressive hypothesis class or have better inductive biases, which improves performance for a fixed model size and training budget. For example, attention can express more complex functional relationships than convolutions, resulting in improved performance on both language and genomics tasks. Other architectures incorporate inductive biases that would otherwise have to be learned from data, such as reverse complement (RC) equivariance, which improves performance and data efficiency.

Below, we report performance across convolutional, Hyena, and Mamba architectures controlled for model size and context length. More expressive Hyena and Mamba architectures (variants of RNNs) outperform simpler convolutions. Caduceus models further add bi-directionality, and RC equivariance; each step further improves performance over multiple tasks (results taken from Table 1 of Schiff et al. 2024; best values bolded, second best are italicized):

	CNN (264k)	HyenaDNA (436k)	Mamba (468k)	Caduceus w/o Equiv. (470k)	Caduceus-Ph (470k)	Caduceus-PS (470k)
Mouse Enhancers	0.715 ± 0.087	0.780 ± 0.025	0.743 ± 0.054	0.770 ± 0.058	0.754 ± 0.074	0.793 ± 0.058
Coding vs. Intergenomic	0.892 ± 0.008	0.904 ± 0.005	0.904 ± 0.004	0.908 ± 0.003	0.915 ± 0.003	0.910 ± 0.003
Human vs. Worm	0.942 ± 0.002	0.964 ± 0.002	0.967 ± 0.002	0.970 ± 0.003	0.973 ± 0.001	0.968 ± 0.002
Human Enhancers Cohn	0.702 ± 0.021	0.729 ± 0.014	0.732 ± 0.029	0.741 ± 0.008	0.747 ± 0.004	0.745 ± 0.007
Human Enhancer Ensembl	0.744 ± 0.122	0.849 ± 0.006	0.862 ± 0.008	0.883 ± 0.002	0.893 ± 0.008	0.900 ± 0.006
Human Regulatory	0.872 ± 0.005	0.869 ± 0.012	0.814 ± 0.211	0.871 ± 0.007	0.872 ± 0.011	0.873 ± 0.007
Human OCR Ensembl	0.698 ± 0.013	0.783 ± 0.007	0.815 ± 0.002	0.818 ± 0.003	0.828 ± 0.006	0.818 ± 0.006
Human NonTATA Promoters	0.861 ± 0.009	0.944 ± 0.002	0.933 ± 0.007	0.933 ± 0.006	0.946 ± 0.007	0.945 ± 0.010

Training Data: Number of Tokens

For a fixed model and architecture, training the model longer on more tokens typically improves performance. This is primarily because longer training further minimizes the loss and improves the data fit. A better fit to the data yields better representations for reasons described above. Typically overfitting is not a problem as the pre-training datasets of most models are often so large that seeing each token more than 3-4 times is computationally intractable. For example, Figure 3 in Caduceus (Schiff et al. 2024) and Figure S4 in Evo (Nguyen et al. 2024) show that models’ pre-training loss on the test set continues to decrease even as training progresses.

Training Data: Quality

The quality of the training data can matter even more than its quantity. This is especially true for DNA LMs, given that a large percentage of many genomes consists of repetitive regions that have little or no functional role. Training on these regions may be at best a waste of computation, and at worst may bias the model towards certain less important repetitive regions of the genome at the expense of others. In both protein language models and natural language applications, sequence deduplication is a key step. We think that in DNA LMs, data pre-processing will be at least as important.

While many DNA LM (e.g., the Hyena and NT families) train on all genomic data, recent work (e.g., GPN; Benegas et al 2023) has reported significant performance improvements from sampling different genomic regions with different frequencies while keeping training budget constant. Similarly, recent PlantCaduceus (Zhai et al. 2024) models have observed significant performance improvements from training on a larger diversity of plant genomes. We think genomic data curation is an under-explored area that will significantly impact DNA LMs.

评论- Follow up response (4/5) - Concern: Further analyzing why certain DNA LMs perform well on specific tasks (continued).

2024-12-01

Fine-tuning Algorithms

Once a model is trained, it has to be used on downstream tasks, oftentimes by being fine-tuned on a small labeled dataset using gradient descent. Most works fix the output embeddings from the model and train a simple classifier on top of them. Alternatively, we may fine-tune the weights of the full model on the downstream tasks. The latter task yields a significantly more expressive hypothesis class for the supervised problem, and in our observations improves performance.

Below, we report experiments that demonstrate that across most tasks and models full fine-tuning yields significantly better results than freezing embeddings. We attribute the few cases where it doesn’t to the susceptibility of the full model to overfitting. We report the delta between full-finetuning and freezing the backbone embeddings (results reproduced from Table 15 of our work):

Model	Causal eQTL (AUCROC)	Pathogenic ClinVar (AUROC)	Bulk RNA ( $R^2$ )	CAGE ( $R^2$ )	Promoter (AUPRC)	Enhancer (AUROC)	Histone Marks (AUCPRC)	DNA Accessibility (AUPRC)
NTv2 50M	+1.13	+9.30	+30.23	+71.60	+1.93	-2.05	+32.03	+33.43
NTv2 100M	+0.98	+6.24	+13.70	+27.72	+2.16	+2.83	+32.70	+40.54
NTv2 250M	+0.36	+3.57	+21.70	+40.41	+2.07	+3.71	+31.01	+54.44
NTv2 500M	+0.49	+4.27	+24.45	+42.14	-1.45	+0.90	+22.46	+47.96
HyenaDNA 1K	+0.95	+15.39	+16.50	+45.22	+7.13	+4.68	+23.61	+22.65
HyenaDNA 16K	+0.21	+22.81	+75.53	+133.52	+6.19	-1.10	+42.83	-9.62
HyenaDNA 32K	+0.35	+11.58	+82.46	+102.91	-18.21	-6.02	+14.43	-22.67

评论- Follow up response (5/5) - Concern: Further analyzing why certain DNA LMs perform well on specific tasks (continued).

2024-12-01

Training Hyper-Parameters

Implementing training and fine-tuning also involves optimizing the model using gradient descent. This procedure is sensitive to hyper-parameters such as batch size and learning rate. We observed below that tuning these parameters has non-trivial effects on model performance, and requires careful consideration when applying a DNA LM.

Model	LR	Batch size	Causal eQTL (AUCROC)	Bulk RNA ( $R^2$ )
NTv2 500M	$1e^{-5}$	32	0.723 ± 0.006	0.597 ± 0.050
NTv2 500M	$1e^{-5}$	64	0.722 ± 0.003	0.588 ± 0.048
NTv2 500M	$1e^{-5}$	128	0.718 ± 0.010	0.596 ± 0.015
NTv2 500M	$3e^{-5}$	32	0.717 ± 0.006	0.580 ± 0.079
NTv2 500M	$3e^{-5}$	64	0.717 ± 0.007	0.566 ± 0.016
NTv2 500M	$3e^{-5}$	128	0.721 ± 0.006	0.585 ± 0.047
DNABERT 2	$1e^{-5}$	32	0.726 ± 0.005	0.483 ± 0.135
DNABERT 2	$1e^{-5}$	64	0.719 ± 0.008	0.503 ± 0.068
DNABERT 2	$1e^{-5}$	128	0.725 ± 0.002	0.484 ± 0.085
DNABERT 2	$3e^{-5}$	32	0.687 ± 0.067	0.480 ± 0.063
DNABERT 2	$3e^{-5}$	64	0.713 ± 0.016	0.507 ± 0.050
DNABERT 2	$3e^{-5}$	128	0.720 ± 0.005	0.501 ± 0.055
Hyena DNA 160K	$1e^{-5}$	32	0.703 ± 0.016	0.459 ± 0.010
Hyena DNA 160K	$1e^{-5}$	64	0.708 ± 0.010	0.450 ± 0.006
Hyena DNA 160K	$1e^{-5}$	128	0.708 ± 0.012	0.439 ± 0.016
Hyena DNA 160K	$3e^{-5}$	32	0.701 ± 0.006	0.456 ± 0.018
Hyena DNA 160K	$3e^{-5}$	64	0.699 ± 0.010	0.457 ± 0.006
Hyena DNA 160K	$3e^{-5}$	128	0.696 ± 0.011	0.445 ± 0.020

References

Benegas, Gonzalo, Sanjit Singh Batra, and Yun S. Song. "DNA language models are powerful predictors of genome-wide variant effects." Proceedings of the National Academy of Sciences 120.44 (2023): e2311219120.

Nguyen, Eric, et al. "Sequence modeling and design from molecular to genome scale with Evo." Science 386.6723 (2024): eado9336.

Schiff, Yair, et al. "Caduceus: Bi-directional equivariant long-range dna sequence modeling." arXiv preprint arXiv:2403.03234 (2024).

Zhai, Jingjing, et al. "Cross-species modeling of plant genomes at single nucleotide resolution using

审稿意见

评分: 6置信度: 42024-11-04

The authors present a new benchmark for evaluating DNA language models (LMs) with a focus on context size studies. They compiled a set of both long-range and short-range downstream tasks for DNA LMs, including variant effect prediction, gene expression, regulatory element, chromatin feature predictions, etc. The tasks and datasets are well-documented, and the benchmark comes with user-friendly features such as customizable context size downloads and visualization tools. Although this benchmark has potential utility for the field, it lacks some important results and discussions. Therefore, I currently recommend a weak rejection of this paper. I am open to raising my score if these issues are adequately addressed.

优点

The benchmark has a clear theme centered on the study of context sizes, which is currently an important topic in the development and application of DNA LMs. This benchmark can be expected to yield valuable insights for the field.
The authors selected a biologically relevant array of downstream tasks.
The paper includes a thorough review of existing benchmarks.
Dataset details are well-documented.
The visualization tool appears useful and well-designed.

缺点

I see the most significant contribution/novelty of this benchmark to be facilitating studies on context size, a point emphasized in both the title and introduction. However, the results section provides only a general and superficial discussion on this topic, and the study of context length is quite restricted to NT. To provide more insight, there should be an in-depth analysis of the impact of context sizes on individual tasks and models. For instance, does each model empirically benefit from longer context lengths, and to what extent? Do certain tasks show a stronger dependence on longer context sizes as expected? There are more detailed evaluation results in Tables 11 and 12 in the appendix but lack an interpretation of the data. I would suggest creating some figures/tables to summarize these results and have more discussion on the impact of context size.
There is a missing discussion regarding alignment-based DNA LMs, which could have different context-size dependencies than single-sequence DNA LMs. Including this aspect is crucial for a complete and accurate narrative. This work builds on the GPN-MSA ClinVar and OMIM benchmarks but strangely does not include (nor even mention) the GPN-MSA model itself. GPN-MSA achieves SOTA on these tasks, performing better than CADD, and therefore better than all the DNA language models considered. It is important to include GPN-MSA in the benchmark (at least for zero-shot evaluations, if the authors deem fine-tuning to be cumbersome), since this could change the narrative in fundamental ways. First, it's not true that DNA language models do worse than CADD. This only applies to single-sequence models. Second, GPN-MSA achieves a good performance even with the smallest context. One interpretation is that with evolutionary context, one doesn’t need as much spatial context. Given that GPN-MSA is alignment-based, it could be reported by itself with a separator line in Table 3. Finally, even if the authors decide to restrict the discussion to single-sequence DNA LMs, it is conventional to compare with the actual SOTA on each task.

问题

The sections on context length extrapolation read a bit disconnected from the rest of the manuscript. It appears to be an improvement on the NT model rather than directly related to the benchmark. If the authors claim it to be a generally applicable method to other DNA LMs, it should be made clear in the writing and preferably applied to at least another model. If it is for a focal investigation on the impact of context size on NT, the results should be more carefully analyzed and discussed.
In Table 3, Enformer is not a good baseline for the ClinVar task, since this set only contains missense variants and Enformer is a model focused on the non-coding genome.
In Section 3, the authors discussed why several tasks should be considered long-range, but did not discuss why the others are not. It would have been better to also include brief discussions on why those tasks are expected to be performed well with short-range models.
Section 3.1.3 and Line 1042: I’d like to point out that missense VEP is not necessarily a short-context task. Coding variants require a small protein context but a large genomic context, since the coding sequence is distributed across exons (which are very far away due to large introns).

评论- Response to Reviewer GKL3 (1/2)

2024-11-25

We thank the reviewer for their detailed feedback and for recognizing the novelty and importance of our work. Below we address the concerns and comments raised.

Concern 1: More analysis of dependence on context length is required.

This is a great suggestion. We have run ablations on the effect of sequence length by training supervised baselines on our “long-range” tasks with varying input context sizes.

CNN baseline (12 M param model with residual connections) that is trained in a supervised manner. The CNN is inspired by that used in GPN, Benegas et al. 2023, (with dilation removed).
Caduceus baseline (3.3 M param model) that is trained in a supervised manner to our evaluation. This model is trained from scratch on the datasets.

We present initial results below (note some runs for the Caduceus model have not completed.

Model	Input length (bp)	Causal eQTL - Fine-tune (AUROC)	Bulk RNA ( $R^2$ )	Cage ( $R^2$ )
CNN (12M)	2k	0.709	0.470	0.051
CNN (12M)	32k	0.704	0.461	0.091
CNN (12M)	65k	0.713	0.466	0.120
Caduceus (3.3M)	2k	0.674	0.506	0.086
Caduceus (3.3M)	32k	Running	0.540	0.079
Caduceus (3.3M)	65k	Running	0.542	0.100

While results are being collected, we do observe a positive association between context size and performance on these hypothesized long-range tasks for both architectures.

Details about the model / experiment : For the CNN, we use an 8 layer convolutional model with skip connections between layers and hidden dimension of 512. We use an input context of 2,048 base pairs. The same LR and batch size are used as for the DNA LM benchmarking, but since we train from scratch, we train the models for 10-20 epochs depending on the task (as opposed to 1-3 for the DNA LMs). For the Caduceus from scratch model, we use 8 layers and hidden dim 256 with input context size of 2,048 base pairs. The LR is set to 1e-4 with a linear warmup of 500 steps. We use the same number of epochs as when training the CNN baseline.

Concern 2: Add alignment-based model (namely GPN-MSA) to zero-shot variante effect task

We thank the reviewer for this suggestion as well. We have run GPN-MSA on the zero-shot tasks and are adding this model to our revised manuscript. The results for GPN-MSA are presented below (with CADD as a reference). We see that GPN-MSA outperforms all DNA LMs and is competitive than CADD, and the discussion regarding the importance of alignment and GPN-MSA as a useful baseline will be added to our updated manuscript.

Model	Causal eQTL (zero-shot; AUROC)	Pathogenic ClinVar (zero-shot; AUROC)	Pathogenic OMIM (zero-shot; AUPRC
DNABERT-2	0.50	0.50	0.002
NTv2	0.51	0.68	0.003
HyenaDNA	0.51	0.49	0.002
CADD	0.56	0.97	0.25
GPN-MSA	0.55	0.97	0.35

Concern 3: Enformer is the wrong baseline for ClinVar task

This point is well taken. With the new GPN-MSA results that we will be reporting for the variant effect prediction tasks, we agree that removing Enformer as a baseline for this task is more appropriate and the combination of CADD and GPN-MSA can serve as strong watermarks against which to compare other DNA LMs.

Concern 4: Discussion of why some tasks are “short-range” is missing

This discussion is present in Appendix B. For each task that is deemed “short-range,” we include a discussion, similar to that for the “long-range” tasks in the main text, as to why we hypothesize long-context is less important in these settings. We originally opted to include the short-range discussion in the appendix due to page limit concerns.

Concern 5: Missense VEP should not be categorized as a “short-range” task

The reviewer raises an interesting question / discussion here. We will update our manuscript to reflect this hypothesis. Below we also report supporting evidence for the reviewer’s point by seeing the effect of context size on the zero-shot performance of 2 long range models for this task. We report results for HyenaDNA and Caduceus pre-trained models:

Model	Input Context (bp)	VEP ClinVar Zero Shot AUROC
HyenaDNA	1k	0.4918
HyenaDNA	2k	0.4920
HyenaDNA	8k	0.4920
HyenaDNA	32k	0.4916
HyenaDNA	131k	0.4949
Caduceus	1k	0.5216
Caduceus	2k	0.5232
Caduceus	8k	0.5267
Caduceus	32k	0.5277
Caduceus	131k	0.5285

评论- Response to Reviewer GKL3 (2/2)

2024-11-25

References

Schiff, Yair, et al. "Caduceus: Bi-directional equivariant long-range dna sequence modeling." arXiv preprint arXiv:2403.03234 (2024).

2024-11-26

I thank the authors for the explanation and the added experiments. The authors adequately addressed most of my concerns and therefore I have raised my score to 6. I will consider further adjusting my score upon reading the revised manuscript.

The added experiments in Concern 1 are very informative. I look forward to seeing the full results, and a thorough discussion on context size dependency in the revised manuscript.

One question: what's the main difference between the pathogenic OMIM dataset in this work and the one in the GPN-MSA paper? The performance of CADD appears to be very different.

2024-11-28

We will post the revised manuscript by this evening.

The OMIM dataset in our work is equivalent to the one used in GPN-MSA. There are two reasons for the observed difference of CADD between our work and GPN-MSA. The first is that in our work we evaluated CADD using version 1.7 which involved additional new features to improve scores for certain variant effects as opposed to version 1.6 that was used in GPN-MSA. Secondly, for the sake of inference time, especially when evaluating our context length models, we report the AUPRC on a subset of the OMIM dataset as outlined in Appendix D.2.2 . This subset version along with the complete dataset can be loaded via our HuggingFace dataset.

2024-11-28

Thanks for the explanation. Could you share this dataset anonymized? I don't find it in the manuscript.

I have concerns about the results presented. The 0.11 AUPRC reported for GPN-MSA matches the value from their version 1 manuscript, where the full negative set was used. However, the results presented here are supposed to use the subsampled negative set. Given the greatly reduced negative set, it is surprising to see the same AUPRC. I would appreciate it if the authors could double-check this.

2024-11-28

We thank the reviewer for their response and for catching this mistake. We have identified the error in the initial AUPRC computation of GPN-MSA on the OMIM subset where we indeed were not correctly subsampling and including all the pathogenic variants. The value has been updated (from 0.11 to 0.35). Unfortunately, the current dataset format would break anonymity, but we are investigating how we grant the reviewer access to an anonymized version of the dataset.

Additionally, the updated manuscript has been posted here to OpenReview with all the changes highlighted in blue text. We look forward to any additional feedback and questions from the reviewer.

审稿意见

评分: 3置信度: 32024-11-04

This paper proposes a new benchmark for evaluating DNA LMs on tasks with emphasis on long-range prediction, unlike other benchmarks which focus on short sequence tasks (<2k bp). The authors benchmark 4 DNA LMs on 4 tasks in their benchmark (variant effect prediction, gene expression prediction, and cis-regulatory element detection, chromatin feature identification). They compare performance on both zero-shot and full finetune settings. The authors also finetune NucleotideTransformer with extended context length.

优点

Benchmarks for DNA language models are currently sparse, and this paper does a good job of curating a diverse set of tasks and benchmarking recent DNA LMs.

缺点

A main claim of this paper is that the tasks they curate are long-range tasks, and that they expect that model performance increases with longer context input. The claim that these tasks require long context would be strengthened by an ablation study over different input lengths.
The authors should be clear that this is a human-only benchmark, ideally in the title and abstract. This is not mentioned until Section 3, and limits the usefulness of the benchmark as many DNA LMs like Evo are trained primarily on microbial sequences.

问题

Section 4 describes context length extension, specifically using NTK and an attention implementation with sqrt(L) chunks. The latter is not explained in the paper or in the supplement.

-The authors do not explain how the train/test splits are generated. How is train/test leakage avoided? Do they split by sequence similarity thresholds?

The Evo model (https://www.biorxiv.org/content/10.1101/2024.02.27.582234v2) and Caduceus (https://arxiv.org/abs/2403.03234) should be benchmarked if possible.

评论- Response to Reviewer frd8 (1/2)

2024-11-25

We thank the reviewer for their detailed feedback and for recognizing the value of the tasks we curate. Below we respond to the concerns and questions that the reviewer raised

Concern 1: Add length-dependence ablation experiments

This is a great suggestion. In response to this comment, we have run ablations on the effect of sequence length by training supervised baselines on our “long-range” tasks with varying input context sizes.

CNN baseline (12 M param model with residual connections) that is trained in a supervised manner. The CNN is inspired by that used in GPN, Benegas et al. 2023, (with dilation removed).
Caduceus baseline (3.3 M param model) that is trained in a supervised manner to our evaluation. This model is trained from scratch on the datasets.

We present initial results below (note some runs for the Caduceus model have not completed.

Model	Input length (bp)	Causal eQTL - Fine-tune (AUROC)	Bulk RNA ( $R^2$ )	Cage ( $R^2$ )
CNN (12M)	2k	0.709	0.470	0.051
CNN (12M)	32k	0.704	0.461	0.091
CNN (12M)	65k	0.713	0.466	0.120
Caduceus (3.3M)	2k	0.674	0.506	0.086
Caduceus (3.3M)	32k	Running	0.540	0.079
Caduceus (3.3M)	65k	Running	0.542	0.100

While the results are still being collected, we do observe a positive association between context size and performance on these hypothesized long-range tasks for both architectures.

Concern 2: Emphasize human-centricity of the benchmark

We will rename our paper to be titled “The Human Genomics Long-Range Benchmark: Advancing DNA Language Models” and have added an explicit reference to the tasks being human genome to our abstract. The updated abstract reads as follows (changes highlighted in bold):

… Here, we present the Human Genomics Long-Range Benchmark (LRB), which focuses on biologically meaningful tasks and supports long-range contexts. We complement our benchmark with fine-tuning recipes that meaningfully improve performance and affect model evaluation. We evaluate DNA LMs across nine compiled human genome tasks…

The human genome focus of our benchmark is also already highlighted in the intro and is explicitly marked in Table 1 as well.

评论- Response to Reviewer frd8 (2/2)

2024-11-25

Question 1: How is sqrt(L) context length extension performed

We follow the implementation / algorithm from Rabe et al. 2021. Below are more details about this method which we will add to our Appendix:

The algorithm in Rabe et al. 2021 leverages a “lazy softmax” approach where key-value pairs are processed sequentially, maintaining only two vectors in memory: one for the accumulated weighted values and another for the cumulative sum of weights. This method significantly reduces memory usage by avoiding the storage of all pairwise attention scores. To optimize performance on modern hardware accelerators, which rely on parallelization for efficiency, the implementation processes attention in chunks. Rabe et al. (2021) empirically determined that using a chunk size of $\sqrt(L)$ strikes a balance between memory savings and computational overhead. Larger chunks increase memory requirements, while smaller chunks can lead to excessive re-computation of activations during the backward pass. Additionally, the implementation is numerically stable and functions as a drop-in replacement for the standard attention module, making it highly practical for tasks requiring extended context lengths.

Question 2: How are train/test splits generated?

We provide a detailed overview of train/test splits Appendix B, Table 6. For most tasks in the benchmark, train and test splits are performed by chromosome (following previous works that introduced these datasets, e.g., Enformer for the eQTL variant effect prediction task) to minimize sequence overlap and ensure that performance reflects the models’ ability to generalize to unseen genomic regions. The exception is the CAGE task, where the split was done randomly in order to better compare to the results in the original Enformer paper that followed a similar protocol.

Question 3: Add Caduceus (Schiff et al. 2024) and Evo (Nguyen et al. 2024) models to benchmark

This is a great suggestion. We are actively working on getting these results for Caduceus (we provide the initial results below and will update here once the full set is available). For the Evo model, given it was pre-trained on prokaryotic and phage genomic sequences and is a substantially larger model than any of the ones we have run for the current benchmark, we have restricted results to the zero-shot prediction tasks.

New Baseline results

Task	CNN (12 M params, 2k bp inputs)	Caduceus (from scratch; 3.3M params 2k bp inputs) DNABERT-2	NTv2	HyenaDNA
Causal eQTL - Fine-tune (AUROC)	0.71	0.674	0.72	0.72
Pathogenic ClinVar - Fine-tune (AUROC)	0.61	0.61	0.74	0.78
Bulk RNA ( $R^2$ )	0.47	0.51	0.51	0.60
Cage ( $R^2$ )	0.05	0.09	-	0.39
Promoters (AUPRC)	0.84	0.89	0.71	0.79
Enhancer (AUROC)	0.81	0.85	0.81	0.82
Histone Marks (AUPRC)	0.11	0.14	0.24	0.38
DNA Accesibility (AUPRC)	0.10	0.10	0.15	0.30

Pre-trained Caduceus results

Task	Caduceus (7.7 M params, 131k bp inputs)	DNABERT-2	NTv2	HyenaDNA
Causal eQTL - Zero-shot (AUROC)	0.49	0.50	0.51	0.51
Causal eQTL - Fine-tune (AUROC)	0.681	0.73	0.74	0.71
Pathogenic ClinVar - Zero-shot (AUROC)	0.52	0.50	0.68	0.49
Pathogenic OMIM - Zero-shot (AUPRC)	0.002	0.002	0.003	0.002
Bulk RNA ( $R^2$ )	0.52	0.51	0.60	0.46
Promoters (AUPRC)	0.75	0.71	0.79	0.67

Evo results

Task	Evo (6.5 params, 6.5k bp inputs)	DNABERT-2	NTv2	HyenaDNA
Causal eQTL - Zero-shot (AUROC)	0.50	0.50	0.51	0.51
Pathogenic ClinVar - Zero-shot (AUROC)	0.529	0.50	0.68	0.49

References:

Nguyen, Eric, et al. "Sequence modeling and design from molecular to genome scale with Evo." Science 386.6723 (2024): eado9336.

Rabe, Markus N., and Charles Staats. "Self-attention does not need $O (n^ 2)$ memory." arXiv preprint arXiv:2112.05682 (2021).

Schiff, Yair, et al. "Caduceus: Bi-directional equivariant long-range dna sequence modeling." arXiv preprint arXiv:2403.03234 (2024).

评论- Follow-up on Rebuttal

2024-11-30

Dear reviewer,

We wanted to follow up on our rebuttal and see if the experiments and answers we provided addressed your concerns. If so, would you be willing to adjust your score? Please let us know if there is any additional clarification or discussion we can provide.

Sincerely, The authors

2024-11-30

Given the new results of Section 5.3 ("Importance of Context lengths for Long-range tasks"), I am not convinced that this primary claim is valid. The only analysis that the authors provide is not sufficient, and misleading given 2 of the 3 long range tasks perform best with the shortest input length.

In Table 5, we see a positive association between input context length and performance across both architectures. These findings validate our characterization of these tasks as ‘long-range.’

The only task that does improve with long context is CAGE, but the authors mention this task has random train/test split. Random split should never be performed for biological sequences (for example, see [1]), and we cannot draw any conclusion from this result as it is likely due to long range models overfitting on the test set.

Following previous work is not a reason to use poor methodology, and I encourage the authors to use sequence similarity based splits.

Because of the reasons above, I do not recommend this paper as an accept, and will be keeping my score.

[1] https://github.com/aqlaboratory/proteinnet/blob/master/docs/splitting_methodology.md

评论- Follow up response

2024-12-03

We thank the reviewer for their continued engagement with our work.

Increased context length improves performance on all long-range tasks

On 3 out of 3 long range tasks, increased context length improves performance: Copied below is the sequence length ablation result. The only outlier is the BulkRNA result for the CNN architecture, but Caduceus is clearly able to leverage long context on that task. On every other task, every model improves with longer inputs.

Model	Input length (bp)	Causal eQTL - Fine-tune (AUROC)	Bulk RNA ( $R^2$ )	Cage ( $R^2$ )
CNN (12M)	2k	0.709	0.470	0.051
CNN (12M)	32k	0.704	0.461	0.091
CNN (12M)	65k	0.713	0.466	0.120
Caduceus (3.3M)	2k	0.697	0.506	0.086
Caduceus (3.3M)	32k	0.699	0.540	0.079
Caduceus (3.3M)	65k	Running	0.542	0.100

Our CAGE train/test splits are principled

We apologize for not making this clear above: the train/test splits for CAGE are not done purely randomly; they are split in a principled manner. The Enformer protocol, which we follow, avoids homologous sequences appearing in different splits. First, regions of the genome are grouped by sequence similarity and then groups of similar sequences are randomly split into train/val/test. Enformer protocol is as follows:

They divide the human and mouse data into 1M bp regions
They grouped the 1M bp regions into groups that had >100kb of aligning sequences between them
Then they randomly split the homology groups into the training, validation, and test sets
We further subset this data to only include human data, but using these same similarity based groups.

Therefore, in our work, homologous groups, not sequences, are randomly assigned across the train/validation/testing splits just as in Enformer.

Additional sequence similarity analysis

We agree with the reviewer that similarity based data leakage is a real concern. Therefore we conducted further analysis to ensure that there are no significantly similar sequences between the training and test sets.

Overall, we find little overlap: We aligned every sequence in the test set to every sequence in the training set using MMSeq to identify any conserved sequences between them. Using a conservative E-value cutoff of 0.01 and considering any quality alignment, the median sequence similarity of sequences in the test set to sequences in the training set is 10%. If we consider only high quality alignments (alignments matching by at least 90%), then the median sequence similarity falls to 2.7%. We will add this sequence alignment analysis to the camera ready revision.

Given that the splitting methodology we originally followed is principled and that we do not identify significant similarity between the sequences in the training and test sets, we believe that the CAGE results are valid and do not represent overfitting.

Evidence from the literature

Finally, we also emphasize that our claims about the long-range nature of these tasks are rooted in the literature. In addition to the biological rationale provided in our paper, we observe a trend on these tasks where models with increasingly long context size continuously improve. Looking for example at the trajectory from Basenji (Kelley et al. 2018) with 20k bp inputs, to Enformer (Avsec et al 2021) with 196k bp inputs, to Borzoi (Linder et al. 2023) with 524k bp inputs, we see that for these tasks as model input length grows so does performance improve (although admittedly here there are confounding factors, such as architectural and training data/recipe differences across these works as well). Even within a model we see the importance of context length on long-range task; see for example Extended Data Fig. 5 from the Enformer paper (although again this is somewhat confounded by model size).

References

Avsec, Žiga, et al. "Effective gene expression prediction from sequence by integrating long-range interactions." Nature methods 18.10 (2021): 1196-1203.

Kelley, David R., et al. "Sequential regulatory activity prediction across chromosomes with convolutional neural networks." Genome research 28.5 (2018): 739-750.

Linder, J., et al. "Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. bioRxiv preprint." 2023

审稿意见

评分: 6置信度: 42024-11-04

The paper presents the Genomics Long-Range Benchmark (LRB), a new suite of biologically meaningful tasks designed to evaluate DNA language models with a focus on long-range genomic contexts. The authors argue that existing benchmarks are limited by their emphasis on short sequences and sometimes biologically irrelevant tasks. They provide fine-tuning recipes to improve model performance, introduce a visualization tool for detailed analysis, and explore methods for extending the context length of transformer-based models.

优点

The paper identifies a significant gap in the evaluation of DNA LMs by focusing on long-range genomic interactions, which are key for understanding complex biological processes.

The LRB includes nine tasks covering variant effect prediction, gene expression prediction, regulatory element detection, and chromatin feature identification. This breadth ensures that models are tested on a variety of biologically relevant tasks.

Allowing users to select arbitrary sequence lengths for each task is very relevant for the field and facilitates the exploration of context length effects on model performance.

The authors demonstrate that full fine-tuning of DNA LMs significantly enhances performance compared to previous methods that froze the backbone model weights.

Exploring techniques to extend the context size of transformer-based models is a valuable contribution, especially given the computational challenges associated with long sequences.

缺点

Although the paper compares DNA LMs to supervised baselines like Enformer and DeepSEA, it could include more recent or diverse models to strengthen the evaluation.

The results are presented with mean and standard deviation across folds, but there's no discussion of statistical significance. Including statistical tests would provide more confidence in the reported improvements.

问题

I highly encourage the authors to:

Report statistical comparison between metrics. Include more models in the benchmark. A good example is https://arxiv.org/abs/2406.10391

评论- Response to Reviewer 1nci

2024-11-25

We thank the reviewer for their feedback and recognizing the novel aspects of our work. Below we respond to the concerns and questioned raised in detail

Concern 1: Adding more supervised baselines.

We have added two new baselines:

CNN baseline (12 M param model with residual connections) that is trained in a supervised manner. The CNN is inspired by that used in GPN, Benegas et al. 2023, (with dilation removed).
Caduceus baseline (3.3 M param model) that is trained in a supervised manner to our evaluation. This model is trained from scratch on the datasets.

We present initial results below (note some runs for the Caduceus model have not completed).

New Baseline results

Task	CNN (12 M params, 2k bp inputs)	Caduceus (from scratch; 3.3M params 2k bp inputs) DNABERT-2	NTv2	HyenaDNA
Causal eQTL - Fine-tune (AUROC)	0.71	0.674	0.72	0.72
Pathogenic ClinVar - Fine-tune (AUROC)	0.61	0.61	0.74	0.78
Bulk RNA ( $R^2$ )	0.47	0.51	0.51	0.60
Cage ( $R^2$ )	0.05	0.09	-	0.39
Promoters (AUPRC)	0.84	0.89	0.71	0.79
Enhancer (AUROC)	0.81	0.85	0.81	0.82
Histone Marks (AUPRC)	0.11	0.14	0.24	0.38
DNA Accesibility (AUPRC)	0.10	0.10	0.15	0.30

On most tasks, other than the gene expression and chromatin features tasks, we find that this supervised baseline performs competitively with the strongest performing baseline from our benchmark results, even outperforming models on the task of promoter identification. These results underscore the fact that DNA LMs are still in the early stages of development before being mature enough to replace traditional supervised methods.

Concern 2: Adding statistical comparison

To the best of our knowledge, it is standard practice to report mean and standard deviations (e.g., a similar practice is seen in the BEACON reference provided by the reviewer). We are not aware of works that also perform statistical / hypothesis testing to distinguish benchmarked model performance. If the reviewer has any references or specific tests they had in mind, we will do our best to perform these in the coming days.

References

Schiff, Yair, et al. "Caduceus: Bi-directional equivariant long-range dna sequence modeling." arXiv preprint arXiv:2403.03234 (2024).

2024-11-26

Dear authors,

Thank you for including more baselines in the study. I appreciate Caduceus' retraining from scratch. As for the statistical testing, you are absolutely right, it is not standard practice in this field but it should be. Especially moving away from the "bold is best" table reporting. Please, consider adding multiple t-test with corrections in future works as it would improve the comparison of actual performances.

评论- Follow up: Significance tests

2024-12-03

We thank the reviewer for engaging in discussion with us.

We follow up on the reviewer's suggestion for more comprehensive statistical testing. For each task, we conduct a Welch’s t-Test between each model and every other model. We control FDR for the multiple tests by applying Benjamini-Hochberg correction. Below we summarize some highlights of significant differences (using p < 0.05 threshold) and attach an example of the underlying table of p-values for one of the tasks. Similar tables are available for each task, and we are happy to share them with the reviewer if that would be of interest. We will include these significance test results in our camera ready manuscript.

Significant Differences

Clinvar:

NTv2 500M is significantly better than the rest of the evaluated models
HyenaDNA 160k is significantly worse than the rest of the models

BulkRNA:

The Enformer baseline is significantly better than any DNA LM
NTv2 500 is significantly better than DNABERT-2
HyenaDNA 160k is significantly worse than either NTv2 Model

CAGE:

The Enformer baseline is significantly better than any DNA LM
NTv2 is significantly better than any other DNA LM
NTv2-500M-ext is significantly better than HyenaDNA LM

Promoter:

The Enformer baseline is significantly better than any DNA LM except DNABERT-2
NTv2-500M is significantly better than any DNA LM except DNABERT-2

Enhancer

The Enformer baseline is significantly better than any DNA LM
DNABERT-S is significantly better than any DNA LM except DNABERT-2

Histone:

NTv2 500M and NTv2 500M-ext are significantly better than any other DNA LM (except each other)

DNA Accessibility:

NTv2 500M and NTv2 500M-ext are significantly better than any other DNA LM (except each other)

Causal eQTL p-values

Below we provide an example of the statistical significance analysis results for the Causal eQTL fine-tuning task (similar tables for all the tasks have been computed and used to generate the summary above. We are happy to share those with the reviewer as well and will include them in our camera ready manuscript). We observe that the difference between the Enformer baseline and the rest of the models is statistically significant, the baseline is better than all DNA LMs. In addition, we see that the NTv2 500M -96K model is significantly better than the rest of the models, including the NTv2 500M -12K model.

DNABERT-2	DNABERT-S	NTv2 500M	NTv2 500M - Ext	HyenaDNA 160K	Enformer
DNABERT-2	-	-	-	-	-	-
DNABERT-S	1.13E-01	-	-	-	-	-
NTv2 500M	1.00E+00	6.54E-02	-	-	-	-
NTv2 500M - Ext	5.19E-03	6.58E-02	1.20E-04	-	-	-
HyenaDNA 160K	1.49E-01	1.51E-02	1.15E-01	2.79E-03	-	-
Enformer	6.07E-04	1.78E-03	4.98E-07	2.17E-04	7.08E-04	-

评论- Summary of feedback and improvments

2024-12-03

We thank the reviewers for their time and useful comments on our work.

We would like to summarize comments that were common to several reviewers and our high level responses here. For details please refer to the specific responses posted to each reviewer.

Adding more models: 3 new DNA LMs and 2 new supervised baselines

We added results for a pre-trained Caduceus model (Schiff et al. 2024) on our benchmark
We added Evo (Nguyen et al. 2024) on the zero-shot tasks
We added GPN-MSA (Benegas et al. 2024) for the zero shot-tasks
We added two supervised baselines on to all fine-tuning tasks: A CNN model and a Caduceus model trained from scratch

Probing the effect of context length:

We added an experiment that examined the effect of context length for the long-range tasks in our benchmark. We found that for both of the new supervised training baselines, there is a strong positive association between context length and performance, validating our hypothesis about the importance of long-range modeling for these tasks.

Other improvements include:

Adding statistical significance tests, per Reviewer 1nci’s suggestion.
Adding an extensive discussion about the drivers of varying DNA LM performance on downstream tasks, per Reviewer 39ce’s suggestion.

References

Benegas, Gonzalo, et al. "GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction." bioRxiv (2023).

Nguyen, Eric, et al. "Sequence modeling and design from molecular to genome scale with Evo." Science 386.6723 (2024): eado9336.

Schiff, Yair, et al. "Caduceus: Bi-directional equivariant long-range dna sequence modeling." arXiv preprint arXiv:2403.03234 (2024).

AC 元评审

2024-12-19

This is a well-executed DNA benchmark paper. The reviewers were mostly well-aligned on their score in the eject to accept range. Compared to previous work like BEND - published at last year's ICLR - this paper adds five long range benchmarks.

This work definitely deserved publication. But given its incremental nature, arguably a specialized (bioinformatics) venue is a more obvious choice than ICLR.

审稿人讨论附加意见

None.

最终决定Reject

2025-01-22

Reject