PaperHub
5.5
/10
Poster4 位审稿人
最低3最高3标准差0.0
3
3
3
3
ICML 2025

SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24

摘要

关键词
DNA Foundation modelsmixture of expertsgenomic profile predictionDNAbiology

评审与讨论

审稿意见
3

The paper challenges the recent paradigm of self-supervised genomic language models (gLMs), which are trained on the DNA sequence alone, and proposes supervised genomic profile prediction (GPP) task as a more effective pre-training method. The motivation stems from the number and complexity of different factors (epigenetic modifications, chromatin accessibility patterns etc.) that determine the function of the DNA, in addition to the raw sequence. To overcome the limitations of current genomic profile prediction models (GPPMs) in learning the relationship between DNA sequence across species and genomic profiles, the authors implement Species-Profile Adaptive Collaborative Experts (SPACE), which extends Enformer by preprending species embedding token and adding two Mixture of Experts (MoE)-based modules: (1) Transformer Encoder with MoE layer replacing the FFN component including a shared pool of experts and a per-species gating function, to allow species-aware weighing of expert contribution (2) MoE-based decoder to decode the genomic profiles. MoE learning is employed with standard practices of noisy top-k selection to ensure sparsity. In addition, mutual information loss is used, which encapsulates load balancing between experts (to avoid collapse) and species diversity in training data, while encouraging species aware preference to specific expert subset. SPACE is evaluated on two benchmarks that are commonly used in gLMs literature (GUE and Nucleotide Transformer’s downstream tasks) and compared to three gLM baseslines (HyenaDNA, DNABERT2, Nucleotide Transformer and its variants) and a GPPM popular baseline (Enformer).

Update after rebuttal

The authors have addressed my questions. I thus keep my initial rating and lean to accept this paper.

给作者的问题

  1. Have you tried simply increasing the capacity of Enformer by adding a decoder module and adding species embedding (instead of MoEs)?

  2. Do you have an intuition to why SPACE outperforms the selected gLMs in some tasks and less so in others?

论据与证据

The authors present two main claims:

  • Supervised GPP is a more effective are more effective for pre-re than self-supervised gLMs that are trained on raw DNA sequences
  • A MoE-based GPPM architecture as a mitigation for inferior performance of previous models

While the motivation for challenging the current sequence-only self-supervised gLMs paradigm makes sense, there are several issues that needs to be addressed:

  • While SPACE consistently outperforms the selected baselines on chromatin profiles and regulatory elements prediction tasks (Table 3), it shows inferior performance on splicing tasks and some prediction tasks in the GUE dataset (Table 3; Tables 9-10), where Nucleotide Transformer (NT) and DNA-BERT2 show better performance. The authors suggest the model capacity of NT as a possible reason for its better performance in some cases, but DNA-BERT2 is on the same order of magnitude as SPACE and employs a simple Transformer Encoder architecture.
  • In addition, the comparison to gLMs is missing recent strong baselines including GPN-MSA, EVO and Caduceus

The MoE modules are a novel extension to GPPMs (and gLMs) to the best of my knowledge and the authors also present results on emerging expert capabilities (specific and cross-species). However, it is not completely clear if improvement is gained simply thanks to additional model capacity or to the MoE addition. This is important to understand as MoE training at scale can introduce complexities due to load balancing and bandwidth bottlenecks.

方法与评估标准

The authors employ well accepted criteria on common benchmarks for evaluating gLMs

理论论述

The paper does not consist of any theoretical proofs

实验设计与分析

The paper employs standard and well accepted experimental design and analysis for gLMs as well as an ablation study which examines the contribution of the main architectural components (Table 3 in main text, Section E in the supplementary). The Results section is well structured and is supplemented with additional results in the appendix of the paper.

补充材料

I have reviewed the supplementary material, giving special attention to sections B-E.

与现有文献的关系

The paper makes two main contributions:

  • Pointing out the difference between natural language, where the sequence typically includes all information to determine its function, to DNA, where additional factors collaboratively determine the phenotype, and by that, challenging the paradigm of many gLMs, which are trained on raw DNA sequences without additional information using self-supervised tasks (next or masked token prediction).
  • Proposing supervised GPPMs as an alternative to self-supervised gLMs and extending a current popular baseline with MoE architecture, which outperforms several gLM baselines

遗漏的重要参考文献

The comparative analysis in the paper is missing some recent competitive gLMs baselines, including:

  • Caduceus [1]
  • Grover [2]
  • Evo [3]
  • GPN-MSA [4]

[1] Schiff, Y.,et al. Caduceus: Bi-directional equivariant long-range dna sequence modeling. ICML’24

[2] Sanabria, M. et al. DNA language model GROVER learns sequence context in the human genome. Nat Mach Intell 6, 911–923 (2024).

[3] https://doi.org/10.1038/s42256-024-00872-0[1] Eric Nguyen et al. Sequence modeling and design from molecular to genome scale with Evo. Science 2024.DOI:10.1126/science.ado9336

[4] Gonzalo et al. "GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction." bioRxiv (2024): 2023-

其他优缺点

Strengths:

  • Presenting an interesting discussion, which challenges popular paradigm on self-supervised gLMs
  • Introducing a novel MoE-based GPPMs, which outperforms several gLMs on commonly used benchmarks.
  • The paper is well written and well structured

Weaknesses:

  • Leading gLMs are missing from the comparison and the model does not consistently outperform selected baselines, which weakens the main claim of the paper
  • Key claim does not address the challenge of collecting the supervised signal. While using the raw sequence alone has its disadvantages it is readily available, and related information such as conservation metrics were shown to greatly improve performance (GPN-MSA).
  • The exact contribution of MoE (as opposed to simple addition of capacity) is not completely clear

其他意见或建议

  1. Perhaps it is worth considering a “softer” claim, which points to the importance of adding information beyond the raw sequence (either as a prior or as a supervision signal) as a means to improve current genomic language models. This is supported by the results but does not necessitate a clear and consistent superiority over self-supervised gLMs. In addition, it is worth discussing challenges such as modelling long range interactions.

  2. It is worth adding comparison, or at the very least references, to recent gLMs (see previous comments)

  3. Minor - running title is incorrect (still using the default template string I think)

作者回复

Thank you for your thoughtful review. Additional experiments are presented in https://anonymous.4open.science/r/charts-CDE6. We respond your concerns below.

Q1:Performance on Splicing Tasks and GUE Dataset

We clarify two key points:

  • SPACE ​outperforms DNABERT2 across all splicing benchmarks (see Table 7 and Table 10). While SPACE does not yet surpass Nucleotide Transformer (NT) in absolute terms, it achieves ​substantial gains over Enformer:
    • ​GUE splice task: 0.8748 (SPACE) vs. 0.8593 (DNABERT2), 0.0593 improvement (relative to Enformer)
    • ​NT splice tasks: Average 0.917 (SPACE) vs. 0.851 (DNABERT2), ​0.161 average improvement (relative to Enformer)
  • Our analysis shows that SPACE underperforms DNABERT2 only on a subset of Epigenetic Marks Prediction tasks in the GUE benchmark. This can be attributed to SPACE encountering a novel species (yeast) in these tasks. Notably, SPACE demonstrates significant improvements over Enformer under the same conditions, which provides strong evidence for its generalization capability.

Q2:Comparison to gLMs Baselines

  • Grover: Directly referenced Grover's results on the GUE dataset. As Grover used only human-annotated datasets, we limited our comparisons accordingly (in section 1.1 of supplementary materials). Our results demonstrate significant outperformance over Grover across all these tasks. Furthermore, we fine-tuned Grover using the official hyperparameters on the revised NT downstream tasks (in section 1.2 of supplementary materials).
  • Caduceus: All baseline comparisons were derived from Caduceus' reported performance on genomic benchmarks (as showed in their paper). Notably, Caduceus did not evaluate the Drosophila task, consequently, the CNN baseline results were extracted from genomic benchmarks (showed in ​Supplementary Materials Section 1.3).
  • GPN-MSA: Compared SPACE against GPN-MSA on mutation effect prediction tasks within the BEND benchmark.While SPACE underperforms GPN-MSA, we emphasize that ​SPACE uses only reference sequences during inference, aligning with standard foundation model evaluation protocols. In contrast, GPN-MSA requires ​multi-sequence alignments (MSAs) as input, introducing an unfair advantage for this baseline.
  • evo: While we acknowledge Evo's contributions, integrating its results during rebuttal faced challenges due to:(1) Evo's 7B parameters exceed the largest baseline capacities (NT-2.5B), and current resources cannot support timely replication;(2) No official benchmark data exists for Evo; (3) Evo focuses on generation tasks, lacking results on mainstream benchmarks.

Q3:Data Availability Challenges

We acknowledge that raw sequence data is more accessible than genomic profiles. However: (1) Multi-omic data is increasingly available, and we've demonstrated SPACE's advantages using only limited data, suggesting future improvements as these resources expand; (2) While profile data is needed during pre-training, downstream applications require only sequence input since profile understanding is embedded in the model parameters, unlike GPN-MSA which requires additional MSA input for both pre-training and inference.

Q4:Increasing capacity by decoder and species embedding

We appreciate your suggestion. We have incorporated this consideration in our new ablation studies, with the results presented in Section 2 of the supplementary materials. The corresponding analysis is provided in Q3 of Reviewer oB6U.

Q5:Task Performance

Yes, we hypothesize that SPACE demonstrates superior performance on tasks requiring modeling of gene regulatory relationships. However, for purely sequence-based tasks, SPACE may underperform compared to gLMs pretrained on extensive sequence data, as its training data is more limited in this regard.

Q6: Other Comments

We fully agree with the reviewer's perspective and will soften our claims. Our work primarily validates the potential of supervised pre-training for DNA foundation models, having already achieved SOTA. Both our results and those of GPN-MSA demonstrate that reference sequence pre-training alone is insufficient. However, we think that integrating extra information as supervision signals is particularly meaningful, as it enables downstream tasks to rely solely on raw sequences while implicitly embedding biological knowledge into model parameters, whereas GPN-MSA requires additional inputs. In the protein LM field, recent work has similarly adopted MSA information as supervision [1]. [2] has also demonstrated the advantages of using AlphaFold2's supervised-trained backbone for pLMs. Therefore, we believe incorporating additional information as supervision signals provides advantages for foundation models.

Detailed discussion on long sequence modeling is in our response to Q5 of Reviewer ceSM.

References:

[1]Evolution-inspired loss functions for protein representation learning

[2]Exploring evolution-aware &-free protein language models as protein function predictors

审稿意见
3

This paper introduces a novel approach using a genomic profile prediction models (GPPMs) as a foundation model in biological sequence modeling. The authors propose SPAC which leverages species Mixture-of-Experts (MoE) and an enhanced category operator to improve predictive performance. They conduct ablation studies to analyze the necessity of individual components and demonstrate SPACE outperforms alternatives in existing benchmarks.

给作者的问题

  • What specific biological knowledge is incorporated into the category operator, and how does it influence model decisions?
  • Given the marginal performance differences in Table 3, how do the authors justify the necessity of the species MoE and enhanced components?

Post-Rebuttal I'm updating the score from Weak Reject to Weak Accept based on the authors' clarifications.

论据与证据

  • Overall, I think this is a solid paper providing new perspectives of using a GPPM as a foundation model.
  • The paper claims that SPACE, with its category operator and species MoE, significantly improves structured prediction performance. However, the ablation studies (Table 3) show only marginal performance differences between the full model and its variants, raising concerns about the necessity of these components. The authors do not provide sufficient discussion on the results of these ablations, making it difficult to assess their importance.
  • The paper states the category operator decomposes the base prediction into Q profile type based on domain-specific biological knowledge. It's unclear what they meant by "domain-specific biological knowledge" and how they are incorporated.

方法与评估标准

Evaluation is performed using standard benchmarks, but the paper lacks a qualitative analysis of why certain components contribute to performance improvements.

理论论述

Not applicable.

实验设计与分析

  • The ablation studies lack interpretability; Table 3 alone is insufficient to justify the necessity of different components.
  • Additional case studies or qualitative insights would enhance understanding of why specific design choices matter.

补充材料

No.

与现有文献的关系

The key contributions are very specific to the application, and would be difficult for broader scientific community to benefit from this paper.

遗漏的重要参考文献

Not applicable.

其他优缺点

Not applicable.

其他意见或建议

Not applicable.

作者回复

Thank you for your thoughtful review and for highlighting areas where our manuscript can be improved. We supplemented with additional experiments which are presented in https://anonymous.4open.science/r/charts-CDE6. Below, we address your concerns and outline revisions to strengthen the paper.

Q1:Category Operator

The category operator leverages biological priors derived from the Enformer dataset, which groups genomic profiles into four categories based on distinct measurement technologies:

  1. Transcription factor (TF) chromatin immunoprecipitation and sequencing (ChIP–seq)
  2. Histone modification ChIP–seq
  3. DNase-seq or ATAC-seq
  4. CAGE tracks

These categories reflect critical biological distinctions. For example, chromatin accessibility (DNase/ATAC-seq) and transcription initiation (CAGE signals) are functionally linked: chromatin accessibility establishes permissive 3D topological environments necessary for transcription initiation, with transcription start site (TSS)-associated promoters and enhancers overlapping spatially with accessible chromatin domains. This biological rationale guides expert specialization in the decoder—profiles with strong interdependencies (e.g., CAGE and ATAC-seq) share experts, while distinct profiles (e.g., TF ChIP-seq) activate specialized experts.

Q2:Necessity of the components

The evaluation of DNA language models (LMs) typically exhibits substantial variance, and no existing DNA LM achieves absolute dominance across all datasets. For instance, DNABERT2 adopted mean ranking as its primary metric in their publication. We note that while the ablation results in Table 3 of our original paper report averaged outcomes, the complete results in Appendix E (Table 11) demonstrate robust improvements over Enformer across nearly all evaluated tasks. To further validate the architecture, we conducted targeted ablation studies, which demonstrate that:

• The species-aware MoE encoder enables cross-species knowledge transfer.
• The profile-grouped decoder captures category-specific regulatory mechanisms.

These components are biologically grounded and critical for generalizability, particularly in data-scarce scenarios. To further strengthen our claims, we have included and analyzed new ablation results in Q3.

Q3:Ablation Studies

Additional ablation studies are documented in Section 2 of our supplementary materials. Section 2.1 presents comprehensive ablation results on Nucleotide Transformer downstream tasks. All ablation models were pre-trained with their hidden dimension parameters halved, as detailed in Section 4.6 of the main text.

SPACE demonstrates comparable or superior performance to the decoder-removed variant in ​14/18 tasks, with ​11/18 tasks still outperforming even when replaced by a parameter-matched MLP. Notably, for regulatory element classification tasks, SPACE achieves better results in ​4/5 datasets, with the only exception being the ​TATA box dataset—which primarily examines sequence motifs of TATA boxes and does not require complex regulatory mechanism understanding. This suggests that while our decoder does not explicitly improve direct chromatin profile prediction accuracy, the ​MoE architecture implicitly captures cross-profile regulatory interactions by modeling their dependencies. This capability provides critical advantages for tasks requiring integrated understanding of multiple profiles, such as regulatory element prediction.

To further validate cross-species generalization, we evaluated ablation variants on the GUE benchmark (yeast and virus tasks, detailed in Section 2.1). Results reveal that the MLP-based decoder variant shows markedly weaker generalization to novel species compared to SPACE with its enhancement decoder architecture.

Q4:Qualitative Analysis and Interpretability

In Section 4.4 of main text, we analyze expert specialization patterns:
Encoder: Two shared experts capture cross-species mechanisms, while one species-specific expert adapts to unique genomic contexts.
Decoder: Complex profiles (e.g., TF ChIP-seq) activate dedicated experts, whereas interdependent profiles (e.g., CAGE and ATAC-seq) co-activate shared experts—aligning with their biological relationships.

We agree that deeper case studies could strengthen interpretability and will prioritize this in revisions.

Q5:Broader Impact

We have re-examined the viability of ​supervised learning paradigms for DNA foundation models and proposed specifically designed optimizations. This approach effectively addresses the core limitation of traditional unsupervised pre-training — ​inadequate modeling of biological functional associations. We believe these insights could have profound implications for advancing research in ​genomic language models (gLMs).

审稿人评论

Re-posting as a rebuttal comment

Thank you for the detailed and thoughtful response. The supplementary ablation results, particularly those in Appendix E, help clarify the value of the proposed architectural components. I now better appreciate how the species-aware MoE encoder and profile-grouped decoder contribute to improved cross-species generalization and category-specific modeling.

  1. Component Justification: The expanded ablation study and qualitative analysis in Section 4.4 demonstrate that the architectural choices are biologically motivated and result in consistent gains across diverse tasks. While some improvements are modest, they appear meaningful given the complexity of genomic prediction.

  2. Biological Interpretability: The explanation of how expert activation aligns with known biological groupings adds useful context. While more case studies would further strengthen interpretability, I find the current discussion reasonably convincing.

The authors' revisions and clarifications have addressed my key concerns. I am updating my score to weak accept.

作者评论

Dear Reviewer oB6U,

We sincerely appreciate the insightful and constructive comments, which have significantly enhanced the overall quality of our manuscript. We are particularly grateful for the positive recognition of our ablation studies, as well as the favorable assessment of both the component justification and biological interpretability of our approach.

We confirm that all supplementary experimental results, including additional ablation analyses, will be systematically integrated into the final version of the manuscript. Furthermore, we fully agree that incorporating more comprehensive case studies would substantially improve the interpretability of our findings. Accordingly, we will prioritize this enhancement in our revision process to ensure a more robust and transparent presentation of our results.

审稿意见
3

The paper claims that self pretraining alone in DNA is not a good prior to later generalize for downtream tasks. Instead this paper revisits Genomic Profile Prediction Models (GPPMs) such as Enformer the are trained to directly to predict genome profiles. The paper proposear a further refinement on Enformer model, incorporating a specie-aware encoder and profile-grouped decoder. Experimentation shows that the refined version of Enformer surpases its original in several tasks.

给作者的问题

Are the authors using an initial pretained model from Enformer or are you performing the pre-training?

论据与证据

The paper claims that the architecture of current GPPMs is not optimal in the sense that the encoder is shared for different species and the prediction heads are independent.

方法与评估标准

The paper proposes a new architecrure of GPPM with (1) a specie-aware encoder and (2) profile-grouped decoder following MoE in both cases.

理论论述

N/A

实验设计与分析

The experimentation is performed in 2 datasets: The NT dowstream tasks and the GUE benchmark. In both cases, the model seems to improve the performance of Enformer and in some cases becomes SoTA. Although the improvement over Enformer is not very significant in most cases, the experimentation is sufficient. Perhaps, the most notorious improvement is in the underrepresented mouse experiments.

补充材料

Yes, I reviewed the Experimentation details

与现有文献的关系

The contribution of this paper might help in designing multi genome foundational models.

遗漏的重要参考文献

The paper references are correct and up to date.

其他优缺点

N/A

其他意见或建议

N/A

伦理审查问题

N/A

作者回复

Thank you for the constructive feedback. We appreciate your acknowledgment of our work’s potential contributions to multi-genome foundation models. We supplemented with additional experiments which are presented in https://anonymous.4open.science/r/charts-CDE6. Below, we address your points to clarify and strengthen the manuscript:

1. Response to the Question

Reviewer’s Question:
Are the authors using a pretrained model from Enformer or performing pretraining from scratch?

Response: SPACE is trained from scratch on genomic profile prediction tasks without initializing weights from Enformer. We sincerely appreciate this insightful inquiry, as it inspires us that initializing each expert FFN's weights with Enformer's pretrained parameters could be a promising direction for future exploration.

2. Significance of Improvements

Reviewer’s Observation:
Although the improvement over Enformer is not very significant in most cases, the experimentation is sufficient. Perhaps, the most notable improvement is in the underrepresented mouse experiments.

Response:
Our improvements yield significant gains over Enformer in cross-species generalization and new genomic tasks, demonstrating that SPACE learns more robust DNA representations. As detailed in Table 2 of the main text, we systematically quantify SPACE’s cross-taxa generalization superiority over Enformer. This is further empirically validated through the Drosophila melanogaster enhancer classification task (Section 1.3 of supplementary materials). Notably, on splice prediction benchmarks, SPACE demonstrates marked improvements of ​0.0593 on the GUE splice dataset and an average ​0.161 across NT splice tasks (Table 7 and Table 10 of main text). These quantitative enhancements provide rigorous evidence of SPACE’s enhanced generalization capacity on both evolutionarily divergent species and functionally distinct genomic tasks.

The strong performance on underrepresented species (e.g., mouse) highlights SPACE’s ability to generalize in data-scarce scenarios—a strength enabled by its species-aware encoder. By decoupling shared and species-specific regulatory mechanisms via MoE, SPACE effectively adapts to species with limited training data.

审稿意见
3

The paper introduces SPACE, a supervised DNA foundation model that predicts genomic profiles (e.g., chromatin accessibility) to learn effective DNA sequence representations. The authors argue that unsupervised DNA foundation models (DFMs) lack biological context, leading to suboptimal generalization. SPACE addresses this via a Species-Profile Adaptive Collaborative Experts architecture, which combines a species-aware encoder (using Mixture of Experts, MoE) and a profile-grouped decoder to capture cross-species and cross-profile dependencies.

Generally, key contributions include advocating supervised genomic profile prediction as a superior pre-training objective and proposing a biologically inspired MoE-based architecture.

给作者的问题

What do you think is important for developing DNA LMs?

论据与证据

The claims:

  1. Supervised pre-training > unsupervised DNA LMs:
  2. SPACE’s architecture improves over GPPMs

However,

  1. Since the method uses supervised pre-training, it's very important to evaluate generalization capability. Authors demonstrate that SPACE is better than Enformer on yeast and viral genome in Tab. 2, but it would be more promising to demonstrate generalization across more species on more tasks.
  2. The newly proposed model architecture is rather complex compared to transformers or state-space models, while the performance improvement is not so significant.

方法与评估标准

Methods: The MoE design for species-specific and shared features is biologically motivated (species-specific regulatory mechanisms). The profile-grouped decoder leverages functional dependencies (e.g., chromatin accessibility ↔ transcription).

Evaluation: Benchmarks (NT, GUE) and metrics (MCC, Pearson correlation) are standard.

However,

  • Benchmarking on GUE and NT only is largely behind the development of DNA LMs. For example, BEND [3] develops more biological important tasks that many DNA LMs fail.
  • BEND [3] also shows that many DNA LMs cannot outperform simple supervised models trained from scratch using ResNet or CNN. SPACE is also trained supervisely, and it's also important to consider supervised expert models.
  • Importance baselines like Caduceus [1] and Evo [2] are missing

[1] Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

[2] Evo: DNA foundation modeling from molecular to genome scale

[3] BEND: Benchmarking DNA Language Models on biologically meaningful tasks

理论论述

N/A

实验设计与分析

N/A

补充材料

Reviewed

与现有文献的关系

N/A

遗漏的重要参考文献

As mentioned above, importance baselines like Caduceus [1] and Evo [2] are missing

[1] Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

[2] Evo: DNA foundation modeling from molecular to genome scale

其他优缺点

  • The novelty of this paper is unclear.
    • If it's demonstrating supervised pretraining is a better choice for DNA LMs, the performance improvement is not so significant.
    • If it's proposing a new architecture applying MoE onto DNAs with cross-species and genomic profile design for the gate, it's important to focus on comprehensive downstreams for benchmarking, rather than simply adopting GUE and NT (see the next point on more benchmarks)

其他意见或建议

N/A

作者回复

We sincerely appreciate the constructive feedback. We supplemented with additional experiments which are presented in https://anonymous.4open.science/r/charts-CDE6. We address each key issue and outline manuscript improvements below.

Q1:Generalization Across Species and Tasks

We concur that evaluating generalization capacity is paramount for supervised DNA foundation models.To comprehensively address this concern, we performed extended benchmarking using the Genomic Benchmarks dataset, which represents the only mainstream benchmark encompassing species beyond those investigated in our previous experiments, including Human-or-worm classification and Drosophila enhancer classification. The results are presented in Section 1.3 of supplementary materials. SPACE achieves robust performance on these tasks, demonstrating cross-species generalizability. While it doesn't achieve SOTA on the human-or-worm classification, it significantly outperforms Enformer.

Q2:Architectural Complexity vs. Performance Gain

SPACE's architecture is biologically motivated:

  • ​Species-aware Encoder: Explicitly models shared and species-specific regulatory mechanisms to enable cross-species knowledge transfer.
  • ​Profile-Grouped Decoder: Captures mutual regulatory influences among distinct genomic profiles.

SPACE consistently improves performance across ​most tasks while maintaining biological interpretability. And additional ablation studies on module effectiveness are provided in Q3 of Reviewer oB6U.

Q3:Missing Baselines

We acknowledge this limitation and resolved in ​Q2 of Reviewer ueQp.

Q4:Novelty and Contribution Clarity

We consider both aspects as key contributions:

As the first work to rigorously explore supervised pretraining for DNA foundation models, SPACE achieves SOTA over NT on most benchmarks despite using significantly fewer parameters and less training data. This validates the effectiveness of supervised pretraining for genomic tasks. While computational limitations and the scarcity of curated multi-species, multi-omics datasets prevent scaling SPACE to the extremes of NT or Evo, our work highlights supervised pretraining as a promising scale-up direction for future research.

The Species-Aware Encoder and Profile-Grouped Decoder collectively enable substantial improvements over Enformer, particularly in cross-species tasks and new biological applications(e.g., splice site prediction). The detailed experimental results presented in the Appendix D consistently demonstrate that our proposed module achieves superior performance compared to Enformer. The MoE design enhances interpretability: expert selection frequencies align with foundational biological principles. While we acknowledge the reviewer’s concern about limited downstream benchmarks, we have validated SPACE on additional benchmarks, including Genomic Benchmarks and BEND (see ​Q2 of Reviewer ueQp).

Q5:keys for developing DNA LMs

First, based on the scaling laws observed in NT [1] and Evo [2], expanding both model parameter size and training data scale — following practices in the LLM field — is critical for developing DNA language models.

Second, increasing the model's ​context window length is theoretically essential to prevent information loss by accommodating full DNA sequences. However, current DNA models have not yet empirically validated this advantage. As noted in Evo 1.5 [3], training with 8K context lengths outperforms 131K configurations. While longer input lengths technically allow processing full DNA sequences, they currently lack consistent performance improvements.

Additionally, most existing DNA LMs scale training data by aggregating reference genome datasets, which inherently fail to capture mutation patterns. This explains why DNA LMs underperform Protein LMs [5] in ​zero-shot variant effect prediction [4]. Protein LMs benefit from more diverse training data (e.g., multiple sequence alignments), suggesting that incorporating population-level mutation data into DNA LM pretraining — as proposed in [6] — holds significant promise.

Finally, as emphasized in our work, ​genomic profiles are indispensable for advancing DNA LMs. DNA sequence functionality is regulated through cell-type-specific interactions with diverse genomic profiles (e.g., chromatin accessibility, histone modifications). These profiles collectively shape chromatin environments that govern gene expression patterns across cellular contexts.

References:

  1. Nucleotide Transformer: building and evaluating robust foundation models for human genomics
  2. Sequence modeling and design from molecular to genome scale with Evo
  3. Semantic mining of functional de novo genes from a genomic language model
  4. GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction
  5. Evolutionary Scale Modeling: Pretrained language models for proteins
  6. Pre-training Genomic Language Model with Human Variants for Better Understanding Functional Genomics
审稿人评论

Thanks for authors' efforts. The generalization experiment helps to better clarify the model's performance. And also thanks for adding the BEND benchmark. I would like to encourage the authors to gather all supplementary results to the main paper or in appendix to show a comprehensive results across different benchmarks. Personally, I'm still deeply concerned about the complex model design and its influence for the future development for this DNA FM research direction. And also thanks for sharing the insights. Overall, I would like to increase my rating from 2 (weak reject) to 3 (weak accept).

作者评论

Dear Reviewer ceSM,

Thank you for your thoughtful feedback, which has undoubtedly enhanced the quality of our work. We appreciate your positive response to our additional experiments. We commit to incorporating all supplementary experimental results, including the BEND benchmark, into the final version of our paper.

Regarding your ongoing concerns about our approach, we would like to offer the following clarifications:

  1. On model complexity: From a model architecture design perspective, our approach primarily builds upon MoE, which aligns with the widely used MoE designs in modern LLMs [1, 2]. We extended this by introducing task-specific gating networks [3] to address the multi-task nature of our problem (where tasks correspond to different species and genomic profiles). While MoE approaches are not yet common in genomic models, they have become a fundamental component in modern LLMs. In contrast, models like AlphaFold2 [4] introduced truly complex, specialized architectures that are quite uncommon in the broader ML community.

  2. On future research direction: Self-supervised DNA FMs and sequence-to-function models represent the two most critical model types in genomic modeling. While the ML community has actively advanced architectural improvements for self-supervised DNA FMs in recent years, few researchers have focused on architectural innovations for sequence-to-function models (typically emphasizing data expansion instead [5,6]). Given that genomic profile prediction is a complex task spanning multiple species and profiles, designing more suitable architectures is necessary rather than relying on the conventional approach of a unified encoder with parallel profile decoders. Our work demonstrates that bio-inspired modifications to the Enformer architecture can transform sequence-to-function models into competitive DNA foundation models. We hope our research encourages more ML researchers to participate in optimizing sequence-to-function models.

References:

[1] Liu, Aixin, et al. "Deepseek-v3 technical report." arXiv preprint arXiv:2412.19437 (2024).

[2] "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation," https://ai.meta.com/blog/llama-4-multimodal-intelligence/

[3] Chen, Zitian, et al. "Mod-squad: Designing mixtures of experts as modular multi-task learners." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[4] Jumper, John, et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596.7873 (2021): 583-589.

[5] Chen, Kathleen M., et al. "A sequence-based global map of regulatory activity for deciphering human genetics." Nature genetics 54.7 (2022): 940-949.

[6] Linder, Johannes, et al. "Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation." Nature Genetics (2025): 1-13.

最终决定

This paper proposes SPACE, a supervised genomic profile prediction approach that improves upon purely sequence-based DNA language models. Through a Mixture-of-Experts design, SPACE incorporates species-specific and profile-specific modules to capture diverse regulatory mechanisms and shows strong generalization across species. The authors addressed reviewer concerns by providing new experiments on additional benchmarks and clarifying biological motivations for each architectural component. While some gaps were noted—such as missing baselines and relatively modest gains on certain tasks—the authors demonstrated meaningful improvements over Enformer and showed that including genomic profile supervision helps the model learn more robust DNA representations, particularly for complex regulatory tasks. Overall, the approach and supplementary results convincingly argue that supervised training with genomic profiles can yield a competitive and biologically grounded DNA foundation model, and I recommend acceptance.