PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
5
4
4
4.0
置信度
创新性3.3
质量3.5
清晰度3.3
重要性3.0
NeurIPS 2025

Multimodal 3D Genome Pre-training

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We propose the first multimodal foundation model for 3D genomics, integrating Hi-C contact maps and chromatin accessibility to achieve unified semantic representation, outperforming state-of-the-art methods on diverse tasks.

摘要

关键词
3D GenomicsMultimodal Foundation ModelHi-C dataChromatin Accessibility

评审与讨论

审稿意见
5

This paper proposes a multi-modal foundation model that takes as input Hi-C data and epigenomic data. The model itself includes some interesting techniques to capture cross-modal information while preserving both shared and distinct information from the two modalities. After pre-training, the model is fine-tuned for several downstream tasks, where it outperforms state-of-the-art methods.

优缺点分析

One fairly significant omission is any reference to or comparison to the existing Hi-C foundation model, HiCFoundation (Wang et al. Biorxiv 2024). The model proposed here is applicable to more tasks, since it takes in both Hi-C data and epigenomics, whereas HiCFoundation operates only on Hi-C data. But for loop calling, both models could be compared head-to-head. And certainly in the related work section, this existing Hi-C foundation model should be discussed.

One nice feature is that you have done train/test splitting at the level of both chromosomes and cell lines. On the other hand, I think you should only report the cross-cell line + cross-chromosome results. I.e., the experiments in Section 5.1 all should have been done in the test set cell lines and the test set chromosomes. Later sections (e.g., results in Table 5) should also use this setting. The cross-chromosomal prediction task is not really practically interesting.

You have also done a good job of identifying state-of-the-art methods to compare against (with the obvious exception of the other Hi-C foundation model mentioned above).

I also like the model design, and in particular the idea of maintaining cell type-specific and cell type-agnostic latent representations of each of the inputs.

The sentence starting in line 398 does not make sense to me: "Notably, the Hi-C contact map prediction task inherently predicts contact maps, thus lacking some experiments with Hi-C input." Similarly, this sentence is hard to parse: "Further constraint for orthogonal constaint [sic] is provided in Appendix C.4." What is a constraint for a constraint?

问题

The data set is described as if it's very large, but it's really not. There are only six cell lines being used here, even though there are many more available paired datasets in ENCODE and 4DN. Why was such a small number of cell lines used?

The text largely discussed the model inputs in terms of "epigenomic tracks." Only with some effort did I figure out that the inputs are DNAse-seq and ATAC-seq. This seems like a weird choice, because these two are largely redundant with one another. Why not use a histone modification instead? If you stick with DNAse and ATAC-seq, then you should just refer to this as "chromatin accessibility" rather than the more general term "epigenomic."

The paper claims (line 229) that MIX-HIC can infer a missing modality based on one modality. I would like to see empirical evidence that this inference is accurate, both from Hi-C to chromatin accessibility and vice versa.

I was pretty surprised by the shape of the learning curves in Figure 4. I would expect that if you eliminate 99.99% of the training data the performance should drop precipitously for most methods. I would also expect the gap between a pre-trained model and a non-pre-trained model to get much larger when the training set is tiny. Why don't we see these trends?

局限性

Yes

最终评判理由

The authors did a good job of responding to my questions. I still think this is a good candidate for acceptance.

格式问题

None

作者回复

We thank the reviewer for a thorough and rigorous examination of our paper. We aim to address the questions and provide clarifications to further improve the work.

Q1. HiCFoundation comparison and discussion.

Thank you for highlighting this relevant work. The omission is due to its concurrent publication as a preprint. HiCFoundation excels at learning deep structural patterns intrinsic to the Hi-C contact matrix. However, it can only infer the regulatory state that drives loop formation from these structural patterns. In contrast, MIX-HIC is designed to learn both structural patterns from Hi-C contact maps and the causal regulatory signals from chromatin accessibility. We benchmark MIX-HIC with HiCFoundation on loop detection in K562 cells.

MethodsF1AUROC
HiCFoundation0.80640.8978
MIX-HIC0.82670.9194

The results confirm that MIX-HIC outperforms HiCFoundation by approximately 2%, validating the benefit of our multimodal design. By integrating chromatin accessibility, MIX-HIC is inherently more versatile and applicable to a broader range of tasks. We will incorporate these results and a discussion of HiCFoundation into the manuscript.

Q2. I think you should only report the cross-cell line + cross-chromosome results in Section 5.1 and later sections.

Thank you for this excellent suggestion. We agree that the combined cross-cell line and cross-chromosome setting is the most rigorous test of generalization. With this in mind, our evaluation is designed in two complementary stages. The cross-chromosome setting (Section 5.1) serves as a foundational benchmark [1, 2], essential for validating that the model has learned the universal principles of chromatin folding rather than memorizing chromosome-specific features.

Crucially, we already performed the most rigorous test: the experiment in Section 5.3 (Figure 5) is precisely the combined cross-cell-line and cross-chromosome evaluation, though we acknowledge our original description may not have made this sufficiently explicit. The results demonstrate that MIX-HIC consistently outperforms other baseline methods in this rigorous setting, providing compelling evidence of MIX-HIC's ability to generalize across entirely unseen cellular and chromosomal contexts simultaneously.

While re-running every experiment from scratch under this combined setting is computationally prohibitive for the rebuttal period, the definitive success of MIX-HIC on this most challenging test gives us high confidence in its robust performance across all tasks. We will revise the manuscript to explicitly label the Section 5.3 results as the combined cross-cell-line and cross-chromosome benchmark and to clarify our overall evaluation strategy. Furthermore, we will provide comprehensive results for all tasks under this setting in our public code repository upon publication.

[1] Yang et al. "Epiphany: predicting Hi-C contact maps from 1D epigenomic signals." Genome Biology, 2023.

[2] Salameh et al. "A supervised learning framework for chromatin loop detection in genome-wide contact maps." Nature Communications, 2020.

Q3. Two confusing sentences (line 398 and line 389).

Thank you for pointing these out. We apologize for the confusion and have revised them for clarity. We have revised both sentences. The first sentence has been revised as "For the Hi-C contact map prediction task, the model is trained to predict the contact map using only 1D chromatin accessibility as input. The Hi-C contact map itself is the prediction target (output), making its use as an input feature methodologically invalid." The second sentence contains a typo and has been corrected to "Further details for the orthogonal constraint are provided in Appendix C.4."

Q4. Why was such a small number of cell lines used?

Thank you for the comment. The number of cell lines is limited because our dataset required pairing Hi-C data from 4DN with corresponding chromatin accessibility from ENCODE, and all cell lines met this criterion have been included in our study. While the number of cell lines is currently limited, our work pioneers the first multimodal 3D genome foundation model. Generating a large-scale, high-quality paired dataset requires meticulous effort, and we have successfully curated over 1.27 million rigorously validated sample pairs. This resource provides the necessary scale for robust pre-training and a valuable foundation for future research in 3D genome analysis.

Despite the limited number of source cell lines, MIX-HIC demonstrates strong cross-cell line + cross chromosome generalization, as we detailed in our response to Q2. This robust performance in the most rigorous evaluation setting indicates that our pre-training on this large sample set successfully captured fundamental and generalizable biological representations.

We agree that increasing cell line diversity is a crucial next step to enhance the model's generalization across a wider range of biological contexts. Expanding this dataset is a key priority for our ongoing work, and we plan to release larger and more diverse datasets to the community in the future.

Q5. Reasons for choosing DNAse-seq and ATAC-seq as inputs. Why not use a histone modification instead? If you stick with DNAse and ATAC-seq, then you should just refer to this as "chromatin accessibility"

Thanks for this excellent feedback. Regarding the choice of DNase-seq and ATAC-seq, while both assays probe chromatin accessibility, their distinct enzymatic biases can provide complementary, rather than purely redundant, signals, creating a more robust combined representation. In addition, we select these two tracks over histone modification due to data availability. Our rigorous cross-database pairing process requires finding high-quality assays that are consistently available across our selected cell lines, and ATAC-seq and DNAse-seq best fulfill this criterion. Moreover, MIX-HIC is inherently flexible and designed to be agnostic to specific 1D genomic track inputs. Integrating key histone modifications like H3K27ac is an important direction for future work to build a more comprehensive model. Finally, we agree that "chromatin accessibility tracks" is a more precise term, and we will revise the manuscript accordingly.

Q6. Missing modality inference from Hi-C to chromatin accessibility and vice versa.

We thank the reviewer for this insightful question. We perform an ablation experiment to assess the cross-modal inference ability of MIX-HIC. The results on the loop detection task are shown below.

EpiHi-CPre-trainedGM12878K562
--0.82360.8054
0.84940.8226
--0.90650.9072
0.91350.9159
0.92090.9194

The results demonstrate that augmenting a single modality with its inferred counterpart consistently boosts performance, narrowing the gap to the full bimodal model. This ability stems from the shared latent space learned during large-scale pre-training. We will incorporate the related results into the manuscript.

Q7. Performance should drop for most methods and the gap between a pre-trained model and a non-pre-trained model should get much larger in low-data regimes.

Thank you for this sharp observation. The performance of most methods remains robust, primarily stemming from the tractable nature of the task and the employed architectures. Powerful and repetitive biological signatures of chromatin loops make them relatively easy for most models to detect. However, architectural differences explain the performance variations. Peakachu, a Random Forest model, is inherently stable, as it relies on engineered features that are less sensitive to data volume. In contrast, DLoopCaller's CNN architecture is more data-hungry and shows a steeper decline. Our MIX-HIC architecture, even without pre-training, proves more data-efficient: the self-attention mechanism is better suited for capturing the global and long-range dependencies in contact maps than local CNNs, and its bimodal input provides complementary information, enhancing robustness even in low-data regimes.

The reason why the gap between pre-trained and non-pre-trained models does not appear to widen dramatically may be largely attributed to our robust backbone architecture and the limited variety of cell lines available for pre-training. Our backbone architecture has proven notably effective, enabling the non-pre-trained model to already achieve strong performance—this in turn helps narrow the potential gap. As we noted in our response to Q4, while the dataset boasts a large sample size, the limited number of cell lines available under our rigorous pairing process may, to some extent, constrain the performance gains from pre-training.

Crucially, pre-training provides a dramatic boost in model stability. For instance, in the 0.01% data experiment on GM12878, the pre-trained MIX-HIC-Bimodal exhibits a standard deviation of just 0.0052—nearly three-fold lower than the 0.0151 of its non-pre-trained counterpart. This means pre-training yields far more reliable and consistent results when data is scarce. Furthermore, the full power of pre-training becomes most apparent in data-rich scenarios (Table 3), where it allows MIX-HIC to overcome the saturation seen in other methods by leveraging its comprehensive learned representations.

评论

Q4: You said you had to have Hi-C data from 4DN and corresponding chromatin accessibility data from ENCODE. But there is a lot of Hi-C data in ENCODE (more than 4DN, I think). Why not use that?

Q5: You say you chose DNase and ATAC-seq due to data availability. Aren't there more biosamples with CTCF ChIP-seq than both of these assays? I recongize that DNAseq and ATAC are not perfectly redundant, but they do basically measure the same phenomenon. It still seems better to do this using some other type of measurement.

I'm satisfied with the answers to my other questions.

评论

We sincerely thank the reviewer for the constructive feedback. We are glad that we could address your other concerns and appreciate the opportunity to clarify the final two points.

Q4: Concern about choice of Hi-C data

We appreciate the reviewer’s astute observation about ENCODE’s rich Hi-C resources. Our initial choice to use the Hi-C data from the 4DN portal was driven by its highly standardized data processing pipelines, ensuring data uniformity. However, we acknowledge focusing on simplified processing caused the neglect of the extensive and valuable ENCODE Hi-C datasets, which required additional preprocessing. We agree ENCODE’s breadth is invaluable. In the revised manuscript, we will detail plans to integrate ENCODE data by reprocessing raw reads with 4DN’s pipeline to expand our training set by more cell lines to enhance generalization.

Q5: Concern about choice of input assays

We appreciate you raising this important point. We fully acknowledge the abundance of publicly available CTCF ChIP-seq datasets. As the first attempt to create our multimodal foundation model, we prioritized chromatin accessibility because it captures the genome's foundational regulatory information. While DNase-seq and ATAC-seq have some inherent redundancy, they also offer complementary insights due to their distinct enzymatic biases. The enzymes' different cutting biases cause each assay to detect unique accessible sites, which together yield a more complete accessibility landscape [1].

The flexible design of MIX-HIC makes the integration of other epigenomic tracks like CTCF ChIP-seq entirely feasible. Incorporating distinct and complementary epigenomic tracks will enable MIX-HIC to capture a more comprehensive understanding across various epigenomic layers.

We truly appreciate the time and effort you have dedicated to improving our manuscript.

[1] Meye et al. "Identifying and mitigating bias in next-generation sequencing methods for chromatin biology." Nature Reviews Genetics, 2014.

审稿意见
5

This work proposes to pretrain models for better 3D genomics representation - jointly training models to integrates genome structure information (HiC) and epigenomic tracks. They curate a pretraining data of size 1 million (after cleaning). After training, the model performed the best across tasks such as 3D chromatin organization prediction, loop detection, and subsequence expression prediction.

优缺点分析

Strengths:

  1. The paper is well-written and the idea presented is very clear.
  2. Bringing forward a dataset of this magnitude is important for this AI4Science domain.
  3. The evaluation is pretty comprehensive and the baselines are strong enough to robustly support the superiority of the proposed model.

问题

  1. Details and citation in Section 3. As the reviewer is not familiar with proprocessing the Hi-C contact maps and the epigenomic tracks, can the authors add citation regarding preprocessing? Also, here are a few questions
  • Is log-transformation and 50 x 50 matrices standard? Would a 5kb resolution be too rough?
  • Why filtering out 10% non-zero Hi-C interactions? How many data is filtered out during this process?
  1. Error analysis:
  • What are the main obstacles that stops the MiX-HIC-Bimodal to achieve even higher performance? Have the authors consider performing some error analysis or stratified analysis to dive further into the how and why the proposed model is better than others?
  1. Open source
  • Do the authors plan to open-source pre-training data?

局限性

yes

最终评判理由

The authors answered my questions and resolved my concerns. I recommend an acceptation of this work.

格式问题

no

作者回复

We thank the reviewer for all the comments. We believe we have addressed all the concerns and are happy to follow up in the discussion phase.

Q1. Add citation regarding preprocessing? Is log-transformation and 50 x 50 matrices standard? Would a 5kb resolution be too rough? Why filtering out 10% non-zero Hi-C interactions? How many data is filtered out during this process?

Thank you for these detailed questions. We will revise Section 3 to include more detailed justifications and citations for our data processing pipeline. Our parameter choices and preprocessing steps follow common practices in the 3D genomics field.

First, due to the high cost of deep sequencing required for Hi-C experiments [1], publicly available datasets are typically provided at resolutions such as 5 kb, 10 kb, or coarser. A 5 kb resolution is a fine-grained and effective choice for deep learning models [2, 3]. The 50 x 50 matrix size is a direct result of our choice of a 250 kb genomic window (50 bins * 5 kb/bin = 250 kb). We chose this window size because it captures the typical length scale of key regulatory structures like chromatin loops, ensuring that most functional units are fully contained within the inputs [4, 5].

Second, raw Hi-C contact values are count data with a highly skewed distribution and a large dynamic range. Applying variance-stabilizing KR normalization [6] and a log-transformation is a critical step to make the data more stable for modeling with standard deep learning architectures [4, 6, 7].

Finally, the zero Hi-C contacts filtering is a standard quality control step to remove uninformative regions [5, 7]. Hi-C contact maps are inherently sparse, particularly for long-range interactions far from the diagonal. Including these extremely sparse, low-signal windows would introduce noise and degrade model training. As shown in Table 1, this filter removed approximately 30% of the raw windows. Crucially, this process retains a massive, high-quality dataset of over 1.2 million sample pairs, which is more than sufficient for robust pre-training.

We will integrate these justifications and the relevant citations into the manuscript to make our pipeline clearer and reproducible.

[1] Chang, Lei, et al. "Droplet Hi-C enables scalable, single-cell profiling of chromatin architecture in heterogeneous tissues." Nature Biotechnology (2024): 1-14.

[2] Zhang et al. "A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome." Nucleic Acids Research, 2023.

[3] Karbalayghareh et al. "Chromatin interaction–aware gene regulatory modeling with graph attention networks." Genome Research, 2022.

[4] Yang et al. "Epiphany: predicting Hi-C contact maps from 1D epigenomic signals." Genome Biology, 2023.

[5] Salameh et al. "A supervised learning framework for chromatin loop detection in genome-wide contact maps." Nature Communications, 2020.

[6] Kaul et al. "Identifying statistically significant chromatin contacts from Hi-C data with FitHiC2." Nature Protocols, 2020.

[7] Wang et al. "DLoopCaller: A deep learning approach for predicting genome-wide chromatin loops by integrating accessible chromatin landscapes." PLoS Computational Biology, 2022.

Q2. Performance Bottlenecks and Error Analysis.

Thank you for this valuable question. We hypothesize that the primary obstacle preventing MIX-HIC from achieving even higher performance is not in the model architecture but the inherent noise and sparsity of the input Hi-C data. Hi-C experiments are known to be affected by factors like sequencing depth and biases, which can obscure true biological signals [1, 2].

Thus, we perform an error analysis by simulating low-coverage sequencing scenarios. Specifically, we corrupt Hi-C contact maps by perturbing different ratios of non-zero contacts with sparsity and Gaussian noise. We then evaluated both our pre-trained MIX-HIC and the supervised baseline, Peakachu, on the chromatin loop detection task.

Varying ratio0.00.50.70.9
Peakachu0.88330.76590.5091-
MIX-HIC0.91940.88990.87540.8486

As shown in the table, the performance of Peakachu degrades sharply as data quality degrades, falling to near-random performance (0.5091 AUC) when 70% of the contacts are disturbed. In contrast, MIX-HIC exhibits remarkable robustness, with its performance declining by less than 7% around all ratios. We attribute this resilience to the power of our large-scale pre-training strategy, which learns the fundamental principles of 3D genome organization from over a million examples. This experiment demonstrates data quality as a key performance bottleneck and provides a clear explanation of the benefits of pre-training.

[1] Yardımcı et al. "Measuring the reproducibility and quality of Hi-C data." Genome Biology, 2019.

[2] Spill et al. "Binless normalization of Hi-C data provides significant interaction and difference detection independent of resolution." *Nature Communications, 2019.

Q3. Do the authors plan to open-source pre-training data?

Thank you for your interest in our work! We are committed to releasing the complete pre-training dataset and the source code for MIX-HIC with detailed documentation upon publication. We believe that providing this large-scale dataset alongside our open-source model will create a valuable foundational resource for the community, ultimately accelerating progress in 3D genomics research.

评论

I thank the authors for their clarifications. Having read the reviews from reviewers, and the rebuttals from the authors, I keep my initial assessment and rating of the paper.

评论

We sincerely thank the reviewer for their valuable time and constructive feedback throughout the review process. We are pleased to know that our rebuttal has addressed your initial concerns. Thank you again for your review!

审稿意见
4

This paper proposes a deep neural network (DNN) architecture called MIX-HIC for DNA analysis. The model takes as input Hi-C data alongside several genomic assays—including ATAC-seq, DNase-seq, CAGE-seq expression data, and CTCF ChIA-PET. It is pretrained with a representation-learning objective and then fine-tuned for downstream tasks. The contributions are as follows:

  • A model for learning joint representations of Hi-C and genomic assay tracks.
  • A large-scale paired dataset comprising Hi-C maps and common genomic assay tracks.
  • Demonstrated applications of the model to genomic track prediction.

优缺点分析

Strengths

  • The paper is well-motivated; Hi-C is an important data type that should be integrated into DNA-sequence modelling.
  • The model architecture, training-loss design, and ablation study are comprehensive.
  • The experiments include both conventional and deep-learning baselines to validate performance.

Weaknesses

  • The absence of a DNA-sequence modality restricts the model’s use cases, especially because many available tasks use DNA sequence as the sole input.
  • The terms “3D genome” and “Hi-C” are not interchangeable. While Hi-C provides a good indicator of the genome’s 3D conformation, it does not fully capture it. Previous work predicts 3D genome structure from both Hi-C and DNA sequence. The DNA sequence itself determines physical properties—e.g., bond rigidity between adjacent nucleotides and the ease with which DNA bends—that may be important.
  • See the Questions section for detailed inquiries about the experiments.

问题

  • What is the total parameter count for the MIX-HIC model?
    It would be helpful if the authors showed how model size affects the pre-training loss. According to the ablation study in Figure 8, performance appears insensitive to model size, which is counter-intuitive. If that observation is accurate, why does model size have so little impact on downstream performance?

  • Why was MSE loss used for CAGE-track prediction?
    The common practice is to employ the Poisson negative log-likelihood loss, as in [1] and [2].

  • Could you report Enformer’s zero-shot performance on the CAGE-track prediction task?
    Including Enformer [1] as a baseline would clarify how MIX-HIC compares with sequence-based models.

[1] Avsec, Ž., Agarwal, V., Visentin, D., Ledsam, J.R., Grabska-Barwinska, A., Taylor, K.R., Assael, Y., Jumper, J., Kohli, P. and Kelley, D.R., 2021. Effective gene expression prediction from sequence by integrating long-range interactions. Nature methods, 18(10), pp.1196-1203.]

[2] Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 16, e1008050 (2020).

局限性

Yes

最终评判理由

I will maintain my original score. The authors’ rebuttal provides additional clarification on the model training details. Although the responses to Q4 and Q5 were somewhat unclear, the overall quality of the paper remains fair.

格式问题

No.

作者回复

Thank you for your constructive feedback! We treasure the opportunity to address your concerns.

Q1. The absence of a DNA-sequence modality restricts the model’s use cases.

Thank you for this forward-looking suggestion. We agree that incorporating DNA sequence is a critical next step to create a truly comprehensive 3D genome model. We'd like to clarify the strategic rationale for our current focus and our clear roadmap for future integration.

Our primary goal is to pioneer the first multimodal foundation model for 3D genome, integrating 3D architecture (Hi-C contact maps) and cell-type-specific regulatory state (epigenomics). As established by prior works [1, 2], the interplay between these two modalities provides potent signals for predicting dynamic genomic functions like gene expression and chromatin looping across different cell types.

However, modeling long-range DNA sequences encounters two critical challenges: cell-type invariance and computational inefficiency. Since the DNA sequence is static, models trained on DNA sequences would struggle with generalization across different cell types. Moreover, processing megabase-scale DNA inputs demands significant computational resources, creating a significant bottleneck for large-scale pre-training and downstream applications. Current MIX-HIC circumvents these challenges while establishing a strong baseline for multimodal 3D genome modeling. We view this as a pragmatic and foundational first step.

Recent breakthroughs in long-range DNA modeling (e.g., HyenaDNA [3] and AlphaGenome [4]) make the direct integration of sequence data into our framework more feasible and promising. As we mentioned in the "Broader Impacts and Limitations" section, we plan to further incorporate DNA sequence into MIX-HIC for capturing the full hierarchy of genome regulation. The related descriptions will be added in the camera-ready manuscript.

[1] Tan et al. "Cell-type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening." Nature Biotechnology, 2023.

[2] Yang et al. "Epiphany: predicting Hi-C contact maps from 1D epigenomic signals." Genome Biology, 2023.

[3] Nguyen et al. "HyenaDNA: Long-range genomic sequence modeling at single nucleotide resolution." Advances in Neural Information Processing Systems, 2023.

[4] Avsec et al. "AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model." bioRxiv, 2025.

Q2. The terms “3D genome” and “Hi-C” are not interchangeable. Previous work predicts 3D genome structure from both Hi-C and DNA sequence.

Thank you for this crucial feedback. We completely agree that the terms "3D genome" and "Hi-C" are not interchangeable. As stated in our introduction, our intent is to define the Hi-C technique as a method to "quantify high-resolution 3D chromatin interactions," thus positioning it as a powerful measurement of 3D genome organization, not the entity itself.

To address your feedback, we have performed a thorough review of the manuscript to eliminate any potential ambiguity. For instance, the phrase "...integrates both 3D genome structure and epigenomic tracks" has been revised to "...integrates both Hi-C contact maps and epigenomic tracks."

We want to assure the reviewer that this reflects a potential for misinterpretation in our phrasing, not a conceptual misunderstanding in our methodology. The model's architecture and experimental design are soundly based on the distinct roles of Hi-C data and epigenomic tracks, with the ultimate goal of integrating these views to form a comprehensive representation of 3D genome organization. We also concur on the importance of the DNA sequence, and as discussed in our response to Q1, its integration is a central pillar of our future work.

Q3. Total parameter count for the MIX-HIC model. How model size affects pre-training loss. Why does model size have so little impact on downstream performance?

Thank you for these insightful questions. Our primary configuration (128-feature dimension, 2-Transformer layers) has approximately 74 million parameters. The pre-training loss across different model sizes is summarized below.

LayersEpoch 1Epoch 20Epoch 50Epoch 100
23.06480.08980.04160.0224
43.05190.08670.03830.0171
83.02270.08380.03290.0108

While larger models achieve lower pre-training loss, this does not translate to better downstream performance (as demonstrated in Figure 8), aligning with previous observations [1]. We argue this is a key strength, demonstrating our model's robustness and parameter efficiency. The performance suggests our architecture reaches a "sweet spot" [2] with a moderate parameter count and computational efficiency, sufficient to capture the essential biological patterns without incurring the high risk of overfitting during fine-tuning on smaller, task-specific datasets.

As our pre-training corpus expands, we anticipate larger architectures will become beneficial. We thank the reviewer for this suggestion and will add this analysis to the final manuscript.

[1] Liu et al. "Same pre-training loss, better downstream: Implicit bias matters for language models." International Conference on Machine Learning, 2023.

[2] Nakkiran et al. "Deep double descent: Where bigger models and more data hurt." Journal of Statistical Mechanics: Theory and Experiment, 2021.

Q4. Why was MSE loss used for CAGE-track prediction, instead of Poisson negative log-likelihood loss.

Thank you for your question regarding our choice of loss function. The Poisson negative log-likelihood loss is suitable for modeling raw genomic counts. However, such data are characterized by a high dynamic range and significant overdispersion [1, 2], which may create challenges during model training and may lead to a model that is overly sensitive to extreme outliers.

To address this, we apply both RPGC normalization and a log transformation to the CAGE-seq expression data following previous works [3, 4]. These transformations are critical for stabilizing the variance across the spectrum of expression values and compressing the highly skewed distribution of the raw counts data. After transformation, the data distribution more closely approximates a Gaussian distribution instead of a Poisson distribution. Consequently, the MSE loss becomes a more appropriate and robust choice for these continuous values [3, 4].

We will incorporate related clarification of this rationale into the Methods section of the final manuscript.

[1] Robinson et al. "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics, 2010.

[2] Love et al. "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome Biology, 2014.

[3] Zhang et al. "A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome." Nucleic Acids Research, 2023.

[4] Karbalayghareh et al. "Chromatin interaction–aware gene regulatory modeling with graph attention networks." Genome Research, 2022.

Q5. Compared with sequence-based models on the CAGE-track prediction task like performing Enformer’s zero-shot.

We appreciate the reviewer's constructive comments. Comparing MIX-HIC to a landmark sequence-based model like Enformer would indeed provide valuable context. However, a direct, quantitative benchmark is challenging due to a fundamental mismatch in the task formulation. Enformer is designed to predict raw CAGE-seq expression counts, while MIX-HIC predicts normalized and log-transformed expression values for variance stabilization. Directly comparing these results would not be a fair or meaningful evaluation. While retraining Enformer on our specific task formulation would be computationally prohibitive for this rebuttal period, we will provide such a comparison in our public code repository.

Despite the absence of a direct Enformer benchmark, our experiments have offered strong insights in comparison with sequence-based models. Our CAGE-seq prediction benchmark (Table 4) includes a comprehensive comparison against EPCOT, another powerful DNA sequence-based model.

MethodsGM12878K562
EPCOT-LSTM0.47230.8704
EPCOT-Transformer0.85780.8230
MIX-HIC-Bimodal (Ours)0.88330.9077

The results demonstrate that MIX-HIC outperforms the best EPCOT variant, with gains of 3.0% on GM12878 and 4.3% on K562. Crucially, fine-tuning MIX-HIC for this task takes merely 1-2 hours on a single GPU, whereas fine-tuning the sequence-based EPCOT model requires over a day. These results demonstrate that MIX-HIC achieves robust performance with significantly greater computational efficiency. We will add a detailed discussion of Enformer and efficiency comparison results to the manuscript to further highlight the practical advantage of MIX-HIC.

评论

Thank your for your reply. Regarding the Q4 and Q5, I think the author may not be aware of the implementation details of existing Sequence2CAGE models. Existing models rarely work with raw CAGE-seq expression counts. Basenji2, Enformer and Borzoi all normalize the prediction objective expression count to log scale with RPGC normalization and a log transformation, the same as what you have in the paper. With that being said, thank you for your rebuttal.

评论

We sincerely thank the reviewer for this crucial clarification and apologize for the misunderstanding in our previous response. The reviewer is correct that models like Basenji2, Enformer, and Borzoi work with log-transformed and normalized expression counts.

This log-transformation and normalization is a standard setting adopted in previous works [1, 2]. This process converts discrete counts into continuous values, making them incompatible with Poisson-based loss functions (which assume integer counts) and far more amenable to regression with MSE loss, as variance is stabilized and the dynamic range is compressed. Although reimplementing and fine-tuning Enformer under identical evaluation protocols is computationally intensive within the rebuttal period, we will provide a rigorous head-to-head benchmark using the same RPGC normalization, log-transformation, and evaluation metrics (e.g., R² and MSE) in our open-source GitHub repository. This will enable an unbiased comparison of our model’s performance relative to these baselines.

We sincerely appreciate the reviewer’s careful reading and constructive feedback, which significantly strengthens the clarity and rigor of our work.

[1] Zhang et al. "A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome." Nucleic Acids Research, 2023.

[2] Karbalayghareh et al. "Chromatin interaction–aware gene regulatory modeling with graph attention networks." Genome Research, 2022.

审稿意见
4

This paper introduces MIX-HIC, a multimodal pretraining framework designed to learn rich representations of 3D genome structure and epigenomic signals. The method integrates Hi-C data (3D genome conformation) and epigenomic tracks (e.g., DNase, ChIP-seq signals) across hundreds of cell types to train a transformer-based model.

优缺点分析

Strengths

  1. Innovative Multimodal Integration: Combines epigenomic tracks and Hi-C contact maps, which are typically modeled separately.

  2. Strong Pretraining Design: The contrastive learning objective aligns the 1D (track) and 2D (Hi-C) modalities, improving cross-view consistency.

  3. Empirical Performance: Demonstrates state-of-the-art results on multiple downstream tasks:

Weaknesses

  1. Lack of Interpretability: The model is powerful but opaque — biological interpretability of learned embeddings is not deeply explored.

  2. Limited Ablation on Modalities: The contribution of each modality (Hi-C vs. epigenomic) is not independently quantified in every downstream task.

  3. Sparse Evaluation on Rare Cell Types: Performance in rare or noisy datasets (e.g., low-coverage Hi-C) is not discussed.

问题

We consider the prediction task: Use epigenomic tracks to predict the Hi-C contact map. How many benefits can you get from using contrastive learning? I.e., if you directly train a model (the same architecture as shown in the paper) using epigenomic tracks as inputs to predict the Hi-C contact map, what is the performance compared with your model?

局限性

The author should provide more biological background in the paper, as this is a conference primarily focused on machine learning. The reviews are hard to understand the biological part.

Sparse Evaluation on Rare Cell Types: Performance in rare or noisy datasets (e.g., low-coverage Hi-C) is not discussed.

最终评判理由

Thanks for the response from the authors. All my concerns are solved, and I assign equal weights to my concerns. I appreciate the contribution from the authors, and this paper is very interesting and meaningful. I keep my score as 4 and support for the acceptance of this paper.

格式问题

No

作者回复

We thank the reviewer for the constructive comments. We provide the following responses regarding the concerns.

Q1. Lack of Interpretability.

We thank the reviewer for the insightful feedback. The presence and strength of certain epigenomic signals are highly associated with the formation of chromatin loops [1, 2]. For example, CTCF binding sites are typically located within regions of open chromatin, identifiable as peaks in assays such as DNase-seq or ATAC-seq.

To validate that MIX-HIC learns the fundamental principle linking epigenomic signals to 3D genome architecture, we design an in silico perturbation experiment. We hypothesize that the model's loop predictions should be governed by the underlying epigenomic signals at the loop anchors. We focus our analysis on 118 high-confidence chromatin loops in the K562 cell line, each characterized by convergent CTCF motifs at their anchors (identified via FIMO [3]). We then systematically attenuate the input epigenomic tracks (ATAC-seq and DNase-seq) by down-sampling the signal intensity of peaks within these anchor regions at varying ratios. These perturbed epigenomic profiles are then fed into the MIX-HIC-InferMap model (trained on GM12878) to assess the impact on loop prediction. The results are shown below.

Varying ratio0.00.50.70.80.9
MIX-HIC-InferMap100% (118)98% (116)61% (72)15% (18)0% (0)

Using the unaltered epigenomic data, MIX-HIC successfully recalls all 118 CTCF-mediated loops. As we progressively degrade the epigenomic signals at the loop anchors, the model’s recall for these loops decreases. This demonstrates MIX-HIC's predictions are mechanistically grounded in biologically pertinent epigenomic features. This result demonstrates the model's biological interpretability, confirming the capture of a fundamental principle of genome organization.

[1] Zhao et al. "Chromatin loops associated with active genes and heterochromatin shape rice genome architecture for transcriptional regulation." Nature Communications, 2019.

[2] Sahin et al. "HiC-DC+ enables systematic 3D interaction calls and differential analysis for Hi-C and HiChIP." Nature Communications, 2021.

[3] Grant et al. "FIMO: scanning for occurrences of a given motif." Bioinformatics, 2011.

Q2. Limited Ablation on Modalities.

We thank the reviewer for this important question. Evaluating the contribution of each modality is fundamental to demonstrating the value of our multimodal approach. We have performed direct single-modality ablations for every task where such a comparison is both methodologically sound and scientifically informative.

For the chromatin loop detection task, a direct comparison is highly informative. As shown in Table 6, we explicitly quantify the performance of epigenomics-only, Hi-C-only, and MIX-HIC-Bimodal. On the GM12878 dataset, for instance, MIX-HIC-Bimodal achieves the best performance with an AUROC of 0.9209, surpassing the epigenomics-only and Hi-C-only variants by approximately 9% and 2%, respectively, demonstrating a clear synergistic gain.

Epi.Hi-CPre-trainedGM12878K562
--0.82360.8054
--0.90650.9072
0.92090.9194

However, such a single-modality ablation is conceptually inappropriate for the other two tasks. For the Hi-C contact map prediction task, predicting the Hi-C contact map from itself would be a trivial identity mapping, not a meaningful ablation. For the CAGE-seq expression prediction task, a Hi-C-only baseline may not be scientifically meaningful. CAGE-seq expression is a direct functional readout driven by local epigenomic features (e.g., promoter accessibility), which is why state-of-the-art methods (e.g., EPCOT [1], GraphReg [2]) only focus on these 1D signals. In contrast, Hi-C provides the long-range genomic 'wiring diagram' but lacks the direct signal for transcriptional activity level [3]. Using the Hi-C contact map in isolation is therefore an indirect and weak predictor, making it an inappropriate baseline. Therefore, MIX-HIC respects this biological framing, using the Hi-C contact map to provide structural context to the primary epigenomic predictors, rather than as a standalone source [4].

In summary, we provide direct modality ablation only where it is informative and will add this rationale to the manuscript.

[1] Zhang et al. "A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome." Nucleic Acids Research, 2023.

[2] Karbalayghareh et al. "Chromatin interaction–aware gene regulatory modeling with graph attention networks." Genome Research, 2022.

[3] Nora et al. "Spatial partitioning of the regulatory landscape of the X-inactivation centre." Nature, 2012.

[4] Zhang et al. "Computational methods for analysing multiscale 3D genome organization." Nature Reviews Genetics, 2024.

Q3. Performance in rare or noisy datasets is not discussed.

We thank the reviewer for this perceptive question. The inherent noise and sparsity of Hi-C data [1, 2] present a critical challenge for developing robust and generalizable genomic models. To systematically assess the robustness of MIX-HIC, we conduct a controlled experiment simulating low-coverage and noisy scenarios. We corrupt Hi-C contact maps by perturbing different ratios of non-zero contacts with sparsity and Gaussian noise. Under these varying noise levels, MIX-HIC and the supervised baseline Peakachu are evaluated on the chromatin loop detection task using the K562 cell line dataset.

Varying ratio0.00.50.70.9
Peakachu0.88330.76590.5091-
MIX-HIC0.91940.88990.87540.8486

As shown in the table, the performance of Peakachu degrades sharply as noise increases, falling to near-random performance (0.5091 AUC) when 70% of the contacts are disturbed. In contrast, MIX-HIC exhibits remarkable robustness, with its performance declining by less than 7% around all noise ratios.

We attribute this resilience to the power of our pre-training paradigm, which learns the fundamental principles of 3D genome organization from over 1 million samples. This experiment demonstrates that MIX-HIC develops a robust biological representation and is thus particularly suitable for analyzing noisy or low-coverage datasets. We will add this analysis to our manuscript to further strengthen the paper.

[1] Yardımcı et al. "Measuring the reproducibility and quality of Hi-C data." Genome Biology, 2019.

[2] Spill et al. "Binless normalization of Hi-C data provides significant interaction and difference detection independent of resolution." Nature Communications, 2019.

Q4. Quantifying the benefit of pre-training vs. supervised baseline (i.e., epigenomic tracks).

Thank you for the question. The reviewer asks for a direct comparison against a model with an identical architecture but trained from scratch on the downstream task, which is precisely the ablation we perform to isolate the benefit of our pre-training. Comparison results are shown in Table 6.

Epi.Hi-CPre-trainedGM12878K562
--0.84810.7709
0.87240.8001

The first row is a model with the exact same architecture but with weights initialized randomly and trained directly (end-to-end) on the downstream prediction task. The second row is the MIX-HIC-InferMap model that benefits from our full pre-training pipeline. Experimental results highlight the efficacy of our pre-training, increasing the R² score by approximately 3% on GM12878 and 4% on K562. In the revised manuscript, we will add related descriptions to highlight the gain from pre-training against an end-to-end supervised baseline.

Q5. Provide a more biological background.

We sincerely thank the reviewer for this constructive suggestion. Following this advice, we have revised the manuscript to integrate more explicit biological context. Here, we offer some examples to demonstrate how we have integrated this valuable feedback throughout the manuscript:

  1. Clarifying the function of chromatin loops (line 17): We define loops by their function: "Key topological features of the 3D genome, such as chromatin loops that bring distant regulatory elements into close physical proximity with their target genes [1], are essential for cell-type-specific transcriptional regulation."

  2. Defining key biological terms and assays (line 116): We add concise definitions for key inputs and labels: "...while the epigenomic tracks (ATAC-seq and DNase-seq, which measure how 'open' or accessible DNA is for transcription), CAGE-seq expression data (which directly quantifies gene activity levels), and CTCF ChIA-PET chromatin loops (which identify high-confidence interactions mediated by the key architectural protein CTCF) are downloaded from the ENCODE Portal..."

  3. Clarifying the input/output setup for the Hi-C contact map prediction task (line 398): To avoid confusion about the task setup, we explicitly state: "For the Hi-C contact map prediction task, the model is trained to predict the contact map using only 1D epigenomic tracks as input. The Hi-C contact map itself is the prediction target (output), making its use as an input feature methodologically invalid."

We are confident these revisions make the biological motivation clear and allow the conference reader to better appreciate the context and significance of our technical contributions.

评论

Thanks for the additional efforts. The responses address all my questions!

评论

Thank you for your insightful review and constructive feedback! We truly appreciate the time and care you took in evaluating our work, as well as your valuable contributions to the field. Your thoughtful comments were instrumental in strengthening the paper, and we will carefully incorporate both your feedback and our responses into our revised manuscript.

最终决定

This paper proposes MIX-HIC, a multimodal pretraining framework integrating Hi-C contact maps with epigenomic tracks to learn joint representations for 3D genomics. The model demonstrates strong performance on chromatin loop detection, contact map prediction, and gene expression tasks.

Strengths

  • First large-scale multimodal integration of Hi-C and epigenomics.
  • Solid pretraining design with contrastive objectives and extensive ablations.
  • Strong empirical results with robustness under noise and low coverage.
  • Rebuttal addressed concerns with new analyses (e.g., interpretability, modality ablations, comparison with HiCFoundation).
  • Open-sourcing data and code enhances community impact.

Weaknesses

  • Limited biological interpretability, though rebuttal added meaningful evidence.
  • Evaluation restricted to a small set of cell lines and assay types.
  • Comparisons to sequence-based models remain incomplete.

A technically sound and impactful paper that convincingly addresses reviewer concerns. The work is novel, well-executed, and of clear value to the AI4Science community. I recommend acceptance.