PaperHub
7.8
/10
Poster4 位审稿人
最低4最高6标准差0.8
4
6
5
4
4.0
置信度
创新性3.3
质量3.3
清晰度3.3
重要性2.8
NeurIPS 2025

ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibiltiy Data

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
computational biologyscATAC-seqcell representation

评审与讨论

审稿意见
4

The authors introduce a foundation model specifically designed for single-cell chromatin accessibility (scATAC-seq) data. The authors propose a hybrid encoder architecture that integrates Mamba blocks with local windowed self-attention and introduces a genome-aware tokenization scheme for encoding open chromatin regions (OCRs). Trained on 1.97 million cells from over 30 tissues and 6 disease conditions, ChromFound is evaluated on a broad set of downstream tasks: zero-shot cell representation, cell type annotation, cross-omics prediction, and enhancer-gene link inference.

优缺点分析

Strengths: \bullet This is a pioneer foundation model explicitly for scATAC-seq. \bullet In order to address the high dimension issue of scATAC, the authors propose a novel architectural design combining Mamba and windowed attention. \bullet It demonstrates strong performance across multiple tasks (representation, annotation, prediction).

Weakness: \bullet The novelty of the model architecture is limited. The difference between Chromefound and other single-cell foundation models is that it adopts Mamba to process long sequential data. \bullet The training loss is not really suitable for the scATAC-seq data since the contextual signals of other regions are not significant with the dropout events. scATAC data is more sparse than scRNA data, however, the foundation model on scRNA does not really achieve robust performance when comparing to single linear regression. \bullet Unlike scRNA data which the gene set is already defined, scATAC fragment sets are not always the same among different datasets by different depths. The authors do not claim how they unify the formats of different datasets to employ Chromfound. \bullet As far as I know, there is another scATAC foundation model Epifoundation(https://www.biorxiv.org/content/10.1101/2025.02.05.636688v1) before Chromefound. \bullet Missing reference like Epifoundation, ATAC-Diff (https://arxiv.org/abs/2408.14801), and GET (https://www.nature.com/articles/s41586-024-08391-z). \bullet In the ablation study, the performance of Chromfound is close with much less training set. I am wondering if less OCR would decrease the performance.

问题

  1. The authors should compare Chromfound with linear regression in cell type prediction and cross-omics prediction
  2. How do the authors unify the peaks to adapt Chromfound for different datasets?
  3. Does Chromfound achieve different performance across different tissues for all tasks?
  4. Do the authors compare the performance and computational complexity of WPSA and linear attention?
  5. Statistical Significance: Did you do any random seed runs or significance tests for the modest performance gain?

局限性

  1. The novelty of model architecture is limited. There is no well-designed module for scATAC-seq data or tabular data, except for the position embeddings and mamba adoption.
  2. The training loss is not really suitable for the scATAC-seq data since the contextual signals of other regions are not significant with the dropout events.
  3. The downstream tasks are not comprehensive. The authors should conduct more research on perturbation prediction and temporal scATAC-seq prediction.
  4. More ablation studies should be conducted.

最终评判理由

The authors have addressed all my concerns.

格式问题

There is no formatting issue.

作者回复

We sincerely thank the reviewer for their constructive feedbacks and recognizing our work’s novelty and robust performance across downstream tasks. We would like to clarify and elaborate on the following points of concern:

[W1,L1] Model Novelty

As our submission is under the Machine Learning for Sciences area, our goal is to tailor the architecture to scATAC-seq characteristics for biologically meaningful representation and practical applications. We clarify the design of ChromFound’s each component as follows:

  • Unified OCR tokenization (Sec 3.1) encodes chromosomal identity, genomic coordinates, and continuous-valued accessibility, providing a generalizable architecture for large-scale pretraining of heterogeneous scATAC-seq datasets.
  • WPSA (Sec 3.2.1) captures local cis-regulatory dependencies within ±200 kb of transcriptional start sites, aligning with biological enhancer-promoter interactions.
  • Mamba (Sec 3.2.2) complements WPSA by modeling long-range interactions across hundreds of thousands of OCRs per cell, enabling ChromFound to learn both local and global chromatin accessibility patterns.
  • Pretrain objective (Sec 3.3) reconstructs masked both zero and non-zero OCRs using MSE loss, which ensures robustness to the high sparsity of scATAC-seq data.

[Q1] Shallow Model Baseline Comparison

We thank the reviewer for the helpful suggestion of including shallow baselines for both cell type annotation and cross-omics prediction.

  1. Cell Type Annotation

    We implement two logistic regression models (sklearn.linear_model.LogisticRegression) using full or HVG-filtered OCRs (5,000 peaks), trained with default hyperparameters. These baselines are evaluated across the same datasets in Fig. 3 and compared against ChromFound. Results are reported in the format of Accuracy/F1 score as below:

Tissue/train/testLinearLinear(HVG5k)ChromFound
PBMC/EPF_hydrop_1/VIB_10xv1_10.8132/0.67410.4527/0.31890.8863/0.8135
PBMC/EPF_hydrop_3/VIB_10xv1_20.8227/0.67090.4791/0.33550.9003/0.8134
Bone/Batch27/Batch260.7198/0.62410.3884/0.31500.8230/0.5477
Bone/Batch43/Batch440.7854/0.78730.4618/0.44140.8368/0.8335
Cortex/Batch2/Batch10.8367/0.60310.4880/0.24030.9366/0.7715
Cortex/Batch3/Batch20.8455/0.65980.4295/0.28040.9188/0.8063
Retina/D19D003/D018_130.9217/0.87810.7737/0.52350.9792/0.9780
Retina/D021_13/D19D0030.9278/0.88620.8542/0.74730.9786/0.9762
  1. Cross-Omics Prediction

    For cross-omics prediction, we implement a shallow two-layer MLP that maps OCR inputs to genome-wide gene expression profiles. Given the complexity of predicting over 30,000 gene targets, we used a hidden layer size of 4096, ReLU activation, learning rate of 1e-4, and 50 training epochs. Results are reported in the format of PCC/CCC as below:

ModelCortex_Zhu45KBone_To326KBMMC_multiome_2021BMMC_atac2gex_2022Cell_lines_Zhu11K
MLP0.1022/0.07210.0946/0.07090.1009/0.06270.1023/0.05580.2333/0.1729
ChromFound0.8316/0.80640.8304/0.74720.4249/0.40320.9293/0.88180.9449/0.9071

Key Conclusions:

  1. ChromFound consistently outperforms logistic regression, a strong baseline for annotation.
  2. Full OCR input significantly outperforms HVG-filtered input, reinforcing ChromFound's necessity of modeling genome-wide regulatory elements.
  3. The MLP baseline performs poorly on cross-omics prediction, revealing the limitations of simple linear or shallow models in capturing the complex cis-regulatory logic underlying chromatin-to-expression mapping. ChromFound is specifically designed to model this complexity through deep representation learning and large-scale self-supervised pretraining.

[W2,L2] Training Loss Suitability

We appreciate the reviewer’s concern regarding sparsity in scATAC-seq data. Actually, we have mentioned that ChromFound selectively reconstructs both non-zero and zero OCRs (Sec. 3.3). The masking strategy ensures the training procedure is not affected by the dropout events. Followed by xTrimoGene, we use the MSE loss for reconstructing the masking peaks, which is suitable for continuous-valued accessibility data. Fig. 2 further demonstrates that ChromFound maintains stable performance under severe downsampling, showcasing its robustness to sparsity.

[W3,Q2] OCR Format Unification Across Datasets

Thank you for the question. ChromFound adopts a coordinate-based tokenization strategy (Sec. 3.1), where each OCR is represented by its chromosome ID, start/end positions on GRCh38, and continuous accessibility. This enables ChromFound to unify peaks across datasets without requiring pre-aligned peak sets. Notably, the cross-tissue OCR overlap is quite low, highlighting the necessity of position-aware encoding for robust generalization.

[W4,W5] References to Related Works

We sincerely thank the reviewer for pointing out these three important related works: Epifoundation, ATAC-Diff, and GET.

  • Epifoundation is quite impressive work, building a foundation model by leveraging paired scATAC-seq and scRNA-seq data to learn peak-gene alignments. However, its reliance on paired multi-omics data limits scalability as the paired data remains limited. In contrast, ChromFound adopts a self-supervised approach that does not require paired modalities, enabling pretraining on over 2.6 million cells from diverse tissues and disease conditions. Furthermore, ChromFound extends beyond Epifoundation’s scope by supporting biological applications such as enhancer-gene link inference and perturbation response prediction.
  • ATAC-Diff is designed for identifying differential chromatin accessibility across conditions instead of supporting zero-shot transfer learning.
  • GET, as we have already acknowledged in Section 2, Line 83 of the manuscript, is a transcriptomic foundation model using pseudo-bulk pairs of chromatin accessibility and sequence data, differing in resolution and modality from ChromFound.

In response to your professional suggestion, we will add citations to Epifoundation and ATAC-Diff, and update Section 2 to include a more thorough comparison. In addition, we clarify the positioning of ChromFound as the first genome-wide, self-supervised foundation model for scATAC-seq with zero-shot generalization across tissues.

[Q3] Performance Variation Across Tissues

Thank you for the insightful question. We have thoroughly evaluated ChromFound across various tissue types on all downstream tasks. While minor variations in performance exist due to inherent differences in tissue complexity and data sparsity, ChromFound consistently achieves strong results across all tasks and tissue types. If any specific aspect of the results remains unclear, please do not hesitate to let us know. We would be happy to provide further clarification.

[Q4] Comparison Between WPSA and Linear Attention

Thank you for the question. ChromFound’s WPSA is implemented using FlashAttention, significantly improving efficiency compared to vanilla self-attention. To further compare with linear attention, we implement Performer and Linformer under the same conditions, using identical hidden dimensions, batch size of 4, and FP16 mixed-precision training. The following results are obtained on the cell clustering task using the PBMC VIB_10xv1_1 dataset. As shown in the table below, WPSA outperforms both alternatives in terms of model performance and computational efficiency.

ModelParametersFLOPSInference speed(s/batch)ARI(Trained on 0.02 million cells)
Ours4503057.88E+113.460.6142
Performer18383376.01E+123.890.5978
Linformer9208331.62E+123.740.5751

[Q5] Statistical Significance and Random Seed Control

We thank the reviewer for this professional and important question.

For cell clustering (Table 1) and low-count denoising (Figure 2), we perform each evaluation 20 times with different random seeds and report the mean and standard deviation to reflect performance stability. In the perturbation response prediction task (Figure 4), we report p-values for the PCC metric to establish the statistical significance of the predictions. For other tasks, we ensure that all benchmark methods are implemented and evaluated under the same random seed settings for fair comparison.

Together, these strategies ensure that all reported performance gains are both robust and statistically meaningful. We hope that all these clarifications have addressed reviewer's questions.

[W6,L4] Ablation Study of less OCR

We thank the reviewer for raising this important point. We try to understand the comment on "less OCR" in two possible ways, and we have conducted targeted ablation experiments to address both:

  1. Reduce OCR input per cell: In Section 4.4 (Question 3), we reduce the maximum number of OCRs per cell to one-half and one-quarter of the full set by highly variable peaks filtering. We observe a clear performance drop in both clustering and annotation tasks, demonstrating that genome-wide OCR input is critical for capturing rich regulatory information and ensuring robust generalization.
  2. Reduce pretraining data size: In Section 4.4 (Question 2), we also perform controlled experiments by reducing the pretraining corpus to one-tenth and one-hundredth of the original size. We observe a progressive decline in downstream performance, confirming the benefit of scaling up training data size.

[L3] More Downstream Tasks

We sincerely thank the reviewer for this suggestion. Actually, we have performed perturbation response prediction and presented the results in Figure 4. As the reviewer mentioned, temporal scATAC-seq prediction is quite challenging, and we are currently working on extending ChromFound to support this task.

Conclusion

We sincerely thank the reviewer for your valuable time and constructive feedback. We hope our responses have addressed your concerns. If there are any further questions, we would be happy to provide further clarification.

评论

Dear reviewer:

We sincerely thank the reviewer for your detailed and thoughtful feedback. Your comments have helped us better articulate the design motivations behind ChromFound, clarify technical choices, and position our work within the broader context of scATAC-seq modeling. We appreciate your recognition of the strengths of our approach, as well as your critical perspective on its limitations.

In this rebuttal, we have carefully addressed each of your concerns through additional experiments, expanded comparisons, and clearer explanations. We have also updated our discussion of related work to reflect your suggestions, and we will revise the manuscript accordingly. We hope these responses provide a clearer understanding of our contributions and the rationale behind our methodological decisions.

If there are any remaining questions or further points for clarification, please do not hesitate to reach out. Thank you again for your time, expertise, and engagement.

评论

Thank you for the authors’ response, which addresses several of my earlier concerns. However, I still have reservations regarding the unified token representation across different datasets. While it is possible to encode chromatin accessibility using chromosome IDs, start/end positions (on GRCh38), and continuous accessibility values, the varying fragment lengths across datasets may inherently reflect differences in signal strength or resolution. Additionally, as you mentioned, the overlap of OCRs across tissues is relatively low. This raises concerns that the model may struggle to capture or infer the appropriate granularity of regulatory elements, especially when token representations are not fully harmonized.

Moreover, although the model is capable of predicting both non-zero and zero OCRs, the class imbalance remains a significant issue—most regions are labeled as zero. This could bias the model toward predicting zeros to minimize loss, potentially resulting in a trivial solution. Compounding this, many zero-labeled regions may not correspond to truly closed chromatin, but rather to low signal or dropout events (e.g., promoters active in rare cell types). Treating these ambiguous regions as negatives could penalize the model unfairly and degrade generalization performance.

I am also confused about the reported trend in Figure 2, where performance appears to improve as more aggressive downsampling is applied. This seems counterintuitive and warrants further clarification—does this suggest that the model benefits from reduced data complexity, or are there other confounding factors at play?

评论

Concern 2: Dropout Effect

Concern 2.1: Zero and Non-zero OCRs Imbalance

We appreciate the reviewer’s insightful concern. To mitigate the imbalance between zero and non-zero OCRs, we design our pretraining strategy with the following considerations:

  1. Balanced masking: We apply a symmetric masking strategy that samples equal numbers of zero and non-zero OCRs during pretraining. This ensures that the model learns from both accessible and inaccessible regions, thereby avoiding class imbalance in the reconstruction objective.
  2. Normalized regression loss: Instead of predicting raw count data, we apply a log-transform followed by dataset-specific normalization to the accessibility values. We then use MSE as the reconstruction loss. This loss function improves stability and avoids the sensitivity of Poisson or ZINB losses to extreme count variability across datasets.

We hope this clarifies how our design addresses the potential bias introduced by the imbalance of zero and non-zero OCRs. Please let us know if any concerns remain.

Concern 2.2: Low Signal or Dropout Events instead of Truly Closed Chromatin

We fully agree that many zero-labeled OCRs may result from dropout or insufficient coverage, especially for active regions in rare cell types. This ambiguity is a known limitation of scATAC-seq and is inherently difficult to resolve within any single dataset.

ChromFound does not rely on individual sample-level ground truth to resolve ambiguous zeros. Instead, it captures population-level statistical patterns from diverse datasets, enabling it to implicitly distinguish consistent regulatory signals from dataset-specific sparsity or noise. Our comprehensive experiments (Table 1, Fig.2, Table 2, Fig.3, Table 3) demonstrate that ChromFound achieves strong performance across diverse tissues and tasks. For example, a promoter that is dropped out in one dataset may still be reliably learned from others where it is consistently detected. This cross-context learning is a key strength of foundation models, capturing robust biological signals beyond dataset-specific artifacts.

We hope this explanation addresses your concern regarding dropout-induced zeros and demonstrates how ChromFound mitigates such ambiguity through large-scale cross-context learning. We welcome any further questions or suggestions.

Concern 3: Improvements as more aggressive downsampling

We thank the reviewer for raising this point. To clarify, this trend is only observed in the retina datasets in Figure 2, not across all tissues. Our analysis, shown in the following table, suggests that this is due to differences in sequencing protocols. Retina datasets contain a higher proportion of OCRs that are frequently accessible across cells. These high-frequency OCRs likely reflect technical redundancy or sequencing noise, which reduces the signal-to-noise ratio.

Dataset#Cells#Peaks% OCRs open in ≥20% of cells% OCRs open in ≥10% of cells% OCRs open in ≥5% of cells
Retina(D19D008)45752323540.364.1310.76
Cortex(batch 2)14162127930.020.382.61
Heart(av 3)27854298280.151.604.42
PBMC(BIO_ddseq_1)57084124900.212.275.17

In this case, random downsampling (via Bernoulli sampling) suppresses these abundant OCRs, effectively reducing noise and enhancing signal clarity for downstream representation learning. This explains the observed performance improvement in the retina dataset in Figure 2.

Thank you for the professional suggestion. We will clarify this point in the revised manuscript. If you have any further questions, please do not hesitate to let us know.

评论

Hi reviewer VAFg,

Thanks for your engagement in the lively discussion. If you would like to respond to the authors' most recent messages, please do so by end of the day today AOE.

评论

Concern 1: Unified Token Representation

We thank the reviewer for highlighting this important concern. Traditional scATAC-seq studies often rely on a fixed vocabulary of OCRs (e.g., from a reference atlas) to harmonize scATAC-seq data across datasets. While this offers alignment consistency, it introduces two major limitations:

  1. a loss of OCR length information, since OCR length inherently reflects differences in signal strength or resolution,
  2. limited generalizability, as novel or tissue-specific OCRs may be excluded.

These limitations restrict downstream applications, particularly in tissues with low overlap or previously unseen regulatory regions. To address this, ChromFound introduces a genomic-coordinate-based tokenization strategy that dynamically represents OCRs using their actual start and end positions. This approach explicitly encodes both the genomic span and location of each OCR, enabling the model to adapt to variable-length OCRs and capture regulatory heterogeneity across tissues.

Theoretically, our unified token representation is designed to resolve two core challenges:

  1. Varying fragment lengths: By incorporating start and end coordinates via sinusoidal positional embeddings (Section 3.1), the model learns from the complex extent of each OCR.
  2. Low OCR overlap across tissues: Without relying on a fixed OCR vocabulary, ChromFound can flexibly accommodate tissue-specific or novel OCRs, mitigating misalignment and dropout issues.

ChromFound further harmonizes scATAC-seq data by:

  • Applying log-transformation and dataset-specific normalization to accessibility values, which reduces inter-dataset signal discrepancies.
  • By retaining genomic coordinates during masked OCR reconstruction, ChromFound learns the relative spatial relationships between OCRs, which facilitates its adaptation to varying fragment lengths.
  • Training on a large and heterogeneous corpus (>1.9M cells across 30+ tissues), promoting generalization across a wide spectrum of fragment lengths and OCR compositions.

Empirically, this design yields superior performance across various tissue types. We evaluate ChromFound on cPeaks, a standardized reference from "A generic reference defined by consensus OCRs for scATAC-seq data analysis", which defines 1,657,194 peaks across the human genome. We use a 20,000-cell PBMC subset for training and 8 evaluation datasets from various tissues in Table 1. As shown in the table below, ChromFound (Ours) consistently outperforms the version trained on the aligned datasets with cPeaks (Ours with cPeaks). These results demonstrate ChromFound's strong generalization ability across tissue types (e.g., 30% of peaks in retina, 35% in cortex, and 70% in heart overlap with the cPeaks-aligned PBMC training peaks) and its capability to handle varying OCR lengths (mean OCR lengths range from 500 to 1321 across datasets).

DatasetModel (Trained on 0.02 million cells)ARIFMINMIAMI
Cortex(batch 1)Ours with cPeaks0.58580.72440.75190.7499
Ours0.64750.75030.77010.7689
Cortex(batch 2)Ours with cPeaks0.57510.67200.71450.7123
Ours0.60920.70150.72760.7259
Heart(av 3)Ours with cPeaks0.54000.63470.65880.6567
Ours0.56420.66010.69810.6953
Heart(av 10)Ours with cPeaks0.61500.76950.69680.6963
Ours0.64320.78690.72200.7223
Retina(D026_13)Ours with cPeaks0.48880.70810.72410.7238
Ours0.58020.75490.74100.7403
Retina(D19D008)Ours with cPeaks0.53680.68390.76640.7660
Ours0.59560.72020.78320.7868
PBMC(VIB_10xv1_1)Ours with cPeaks0.59990.68270.74340.7424
Ours0.61420.69950.73540.7345
PBMC(BIO_ddseq_1)Ours with cPeaks0.43470.62370.58670.5861
Ours0.45790.64210.59020.5910

In summary, building a foundation model with position-aware, unified OCR representation enables robust modeling across tissues and unlocks broader applicability compared to approaches based on pre-defined OCR references. We hope our detailed explanation and empirical evidence help address your concerns.

审稿意见
6

The authors proposed ChromFound, an encoder-decoder framework to build foundation model for scATAC-seq data. Genomic coordinate information and chromatin accessibility profile were used to represent open chromatin regions (OCR). The authors proposed window partition self-attention (WPSA) module to capture local dependencies among OCRs, and Mamba layer to handle ultra-long OCRs. The decoder is MLP layer.
ChromFound shows improved performance on cell representation accommodating low count and batch effect, which are popular in scATAC-seq data. ChromFound also improves cell type annotation and cross -omics data prediction.Application of ChromFound on enhancer-gene link prediction and perturbation response prediction shows the results make biological sense.

优缺点分析

The submission is technically sound. The experimental results are well supporting the authors' claim. The methods used are appropriate. The work is almost complete piece of work. The authors mentioned strengths of their work carefully and honestly.

The submission is clearly written and well organized. It adequately inform the reader. No source code was provided.

The authors propose for the first time foundation model for scATAC-seq and achieve promising results compared with existing approaches. It is very likely other researchers will utilize or extend the approach to address more challenges in scATAC-seq data analysis.

The authors offer a novel combination of existing techniques, and the reasoning behind this combination is well-articulated. The work addresses challenges in generating foundation model in scATAC-seq with suitable techniques and adaption of existing technique when needed. The authors clearly distinguish their contributions with previous contributions.

问题

  1. it is interesting to see how will it perform on gene regulatory network prediction?
  2. A nature way is to incorporate DNA sequence and or scRNA-seq information into the model, since DNA sequence is already there and scRNA-seq data is more widely available.
  3. minor comments: reference to Table 4.4 should be Table 4 In figure 6, the order of figure in Harmony panel is reversed compared with other methods.

局限性

  1. The authors forget to include limitations, such as the model is only on human data, which can be extend to include mouse and other species

最终评判理由

The authors have provided insights on future work on incorporating DNA sequence, scRNA-seq, and gene regulatory network to address my comments.

格式问题

no

作者回复

We sincerely thank you for your thoughtful and constructive feedback. We are encouraged by your recognition of the technical soundness, comprehensive evaluation, and potential impact of ChromFound. Below we address your comments in detail.

[W1] Source Code

We appreciate your comment regarding code availability. We are fully committed to promoting transparency and reproducibility. Upon acceptance, we will release the complete source code, pretrained weights and tutorials for all experiments.

[Q2] DNA sequence modeling

Thank you for this insightful suggestion. We totally agree that DNA sequence information can complement chromatin accessibility data by providing base-resolution regulatory context. We note that scBasset also incorporates local DNA sequence to predict binary chromatin accessibility and has demonstrated strong performance in cell representation and batch correction.

While ChromFound currently does not explicitly encode DNA sequence, our tokenization schema can be extended to include local sequence features around each OCR. This would allow the model to jointly learn from static genomic sequences and dynamic chromatin accessibility profiles. One possible way of integrating DNA sequence is to fuse the DNA sequence embeddings from DNA foundation models (e.g., Evo2, Nucleotide Transformer, Enformer, etc.) into the ChromFound architecture. This approach would allow ChromFound to benefit from both dynamic chromatin activity and static regulatory code, potentially improving its ability to resolve fine-grained enhancer-gene interactions. We will explore this direction in future work.

[Q2] scRNA-seq modeling

Thank you for your professional suggestion. We totally agree that incorporating scRNA‑seq is both natural and valuable. Many current methods jointly model scATAC‑seq and scRNA‑seq by using scATAC‑seq as input to predict scRNA‑seq expression (e.g., GET) or perform multi‑omics integration through shared embeddings (e.g., GLUE, scMODAL).

However, these approaches typically treat peaks and genes as separate modalities linked only in downstream alignment. A current limitation in existing approaches is the lack of modeling under a unified genomic scale. We hope that the unified genomic scale will place peaks and genes on the same coordinate framework to enable end‑to‑end training from the genome sequence to transcriptomic output.

In our future work, we aim to bridge this gap by pretraining ChromFound in a truly genomic-centered manner: jointly embedding OCRs and genes in the same latent space from the start, enabling the model to internalize the central dogma (i.e., the information flow from DNA to RNA to protein) via self‑supervised objectives. While paired single-cell scATAC‑seq and scRNA‑seq data at sufficient scale remain limited for pretraining, we are actively pursuing large‑scale collection and simulation strategies to realize this vision.

[Q1] GRN prediction

We appreciate the reviewer’s suggestion on GRN prediction. ChromFound is currently designed to infer cis-regulatory interactions, such as enhancer-gene links and perturbation responses. Looking forward, we aim to expand ChromFound to support broader gene regulatory network (GRN) inference.

As discussed in the response regarding scRNA-seq modeling, we envision extending ChromFound to jointly model scATAC-seq and scRNA-seq data in a truly genomic-centered manner. This unified framework would allow the model to capture not only local cis-regulatory dependencies, but also trans-regulatory interactions. While data limitations currently constrain pretraining across paired multi-omics at single-cell resolution, we are actively exploring strategies to realize this integrative modeling paradigm in future work.

[Q3] Minor comments

Thank you for pointing this out. We will correct the table reference to Table 4 and revise Figure 6 in the revised manuscript to maintain consistent panel ordering.

[L1] Limitations on Human Data

We appreciate your suggestion to more explicitly discuss limitations. While we acknowledge that mouse datasets are often more abundant and widely available, our motivation is to prioritize human data due to its direct relevance for studying disease mechanisms and interpreting non-coding variants associated with human traits. We believe that this choice has greater significance for advancing biomedical research and understanding human-specific regulatory programs.

Nevertheless, extending ChromFound to other species is an important future direction. Our genome-aware tokenization is inherently adaptable, but cross-species modeling will require additional algorithmic revisions, such as species-aware OCR tokenization and the alignment of genomic coordinates across organisms. We plan to explore multi-species pretraining and cross-species transfer learning in future work.

Conclusion

Once again, we truly thank the reviewer for the valuable comments and strong endorsement of our work. We hope our responses have addressed your concerns, and we are happy to provide further clarifications if needed.

评论

Thanks authors for your response and addressing the comments.

审稿意见
5

This paper presents ChromFound, the first foundation model developed specifically for single-cell chromatin accessibility (scATAC-seq) data. The authors propose a novel hybrid architecture that leverages a Mamba block for long-range genomic context and a windowed self-attention (WPSA) module for local regulatory interactions. A core contribution is the genome-aware tokenization scheme that encodes genomic coordinates and continuous accessibility values, addressing the challenge of heterogeneous data inputs. After pre-training on a large-scale corpus of 1.97 million human cells, the model is comprehensively evaluated across six downstream tasks, where it demonstrates state-of-the-art performance and notable zero-shot capabilities.

优缺点分析

Strengths

  • The work successfully establishes the first foundation model for a challenging and important biological data modality. This is a novel and impactful contribution that charts a new direction for the field.
  • The scale of the pre-training is impressive and a key factor in the model's strong performance. The subsequent evaluation is both rigorous and comprehensive, providing convincing evidence for the model's capabilities across a wide array of relevant tasks.
  • The hybrid Mamba-WPSA architecture is not merely a concatenation of popular methods; it's a well-justified design choice that reflects the multi-scale nature of genomic regulation, which is a commendable feature.

Weaknesses

While the work is strong overall, there are several areas where a deeper analysis would strengthen the paper's conclusions and provide a more complete picture for the research community. I've ordered these from minor points to more substantial considerations.

  • On a minor note, while key hyperparameters are provided, the paper would benefit from a brief discussion on the model's sensitivity to these choices (e.g., the temp value in positional embeddings or the Mamba projection dimension). This would provide readers with a better sense of the model's robustness and tuning requirements.
  • A more significant point is that the ablation study, while useful, could be more comprehensive. The current study primarily demonstrates that removing major architectural blocks is detrimental. However, it leaves open important questions about the sources of performance gain. For instance, a comparison to simpler, strong baselines (like a pure Mamba or a pure Transformer architecture) trained on the same massive dataset would be invaluable. Such an analysis would help disentangle the benefits of the proposed hybrid design from the undeniable power of the large-scale pre-training data.
  • Finally, the most substantial area for improvement relates to the model's generalizability and its dependency on upstream analytical choices. The model's performance is intrinsically linked to the input set of Open Chromatin Regions (OCRs), which can vary significantly based on the peak-calling algorithm used. A sensitivity analysis or at least a discussion on this crucial variable would be necessary to fully substantiate the model's claim of universality. Similarly, the model's impressive performance may not fully generalize to disease contexts (e.g., autoimmune disorders) that are not well-represented in the pre-training corpus, a key consideration for its application as a foundational tool for discovery.

问题

  1. Could you elaborate on the design choice to project to a lower-dimensional space (Dlow=32D_{low}=32) for the Mamba block? What were the primary trade-offs you considered between performance and computational efficiency (e.g., parameter count, inference speed)?
  2. The zero-shot performance for cell representation is quite strong, yet fine-tuning yields considerable gains for tasks like cell type annotation. What is your interpretation of this? Does it suggest that the pre-trained model captures general cell state well, but that cell-type-specific regulatory grammars are only fully resolved during task-specific fine-tuning?
  3. Regarding the positional embeddings, could you comment on the model's sensitivity to the temp hyperparameter? Is performance stable across a range of values, or was this a carefully tuned parameter?
  4. The fixed window for WPSA is well-justified for typical enhancer interactions. Have you explored its potential limitations for capturing regulatory phenomena over larger genomic distances, and is the Mamba block intended as the primary mechanism for these cases?
  5. Given the pre-training corpus, what are your expectations for ChromFound's zero-shot or few-shot performance on cell types from disease contexts not well-represented in the training data, such as autoimmune disorders?

局限性

yes

The authors appropriately acknowledge the model's human-centric focus as a primary limitation. For a more comprehensive discussion, I would suggest also including a brief commentary on two other practical points: (1) the model's implicit reliance on the quality of user-provided peak calls, which is an important contextual factor for anyone applying the model, and (2) the practical compute resources required for fine-tuning, which is relevant for assessing the model's accessibility to the broader research community.

最终评判理由

Authors have effectively addressed the key issues I raised:

Hyperparameter Analysis: The systematic evaluation of both the positional embedding temperature and Mamba projection dimension provides exactly the type of sensitivity analysis I was looking for. The clear trade-off between performance and computational efficiency (particularly the memory constraints at d_proj=128) gives readers valuable guidance for practical implementation.

Enhanced Ablation Study: Your comparison with pure Mamba and the creative approach to Transformer baselines (using Geneformer and scGPT as proxies) convincingly demonstrates that the performance gains stem from your hybrid architectural design, not merely from large-scale pretraining. The ~15% ARI drop when removing WPSA and the fundamental limitations of existing Transformer-based models on this data modality strengthen your architectural claims significantly.

Robustness and Generalization: Your clarification on coordinate-based tokenization and the existing disease representation in your evaluation datasets addresses my concerns about generalizability. The downsampling experiments provide good evidence for robustness to preprocessing variations.

The addition of computational profiling results also addresses practical deployment considerations, which will be valuable for the research community.

Your rebuttal demonstrates not only technical rigor but also a deep understanding of the biological context and practical implications of your work. The responses show that ChromFound represents a well-engineered solution to genuine challenges in single-cell chromatin accessibility analysis.

Updated Rating: 5 (Accept) - This is technically sound work with clear contributions that will be valuable to the research community. The comprehensive evaluation and your responsive improvements to the analysis strengthen confidence in the work's impact and reliability.

格式问题

No concerns

作者回复

We sincerely thank the reviewer for the thoughtful and detailed feedback. We appreciate your recognition of ChromFound’s novelty, the biologically informed architectural design, and the breadth of downstream evaluations. Below, we address each of your suggestions and concerns.

[W3,L1] Discussion on Peak-calling Methods

Thank you for the important point. We agree that different upstream data pipelines can lead to variability in the cell-by-peak matrix and peak sets, particularly in sparsity and peak boundary definitions. ChromFound is designed to be robust to these variations.

  • To assess robustness to such sparsity differences, we perform downsampling experiments (Figure 2) simulating varying levels of accessible peaks per cell. ChromFound maintains stable performance across all conditions, indicating strong resilience to sparsity variation introduced by upstream preprocessing.

  • To address potential peak shift caused by different peak calling methods, ChromFound encodes OCRs using sinusoidal positional embeddings of their genomic coordinates. This design enables the model to capture relative spatial relationships between peaks, enhancing robustness to peak boundary shifts. Unlike dictionary-based approaches, our coordinate-based tokenization reduces reliance on exact peak definitions. We indeed conduct experiments aligning peaks across datasets to validate ChromFound’s robustness to peak shift. Due to space limitations, we kindly refer the reviewer to our response to Reviewer dMbP’s comment 2 for detailed results. We appreciate your understanding.

In summary, while the pretraining and evaluation datasets are from different upstream analytical tools, ChromFound still achieves strong performance and generalization. We will include a discussion of these points in the revised manuscript. Please feel free to discuss further if you have additional concerns.

[W3,Q5] Generalization to Disease Contexts

We appreciate your interest in ChromFound's generalization to disease contexts. As detailed in Appendix B Table 5, our pretraining corpus does span 6 disease types, including Alzheimer’s, Parkinson’s, leukemia, glioma, myocardial infarction and breast cancer.

Moreover, we have already included two disease benchmark datasets in downstream tasks: Morabito130K, derived from Alzheimer’s disease, and Kuppe139K, a heart dataset from patients with myocardial infarction. ChromFound demonstrates strong performance on both datasets (Table 1, Figure 2, Table 2), supporting the model’s robustness in disease contexts.

We hope these results address the reviewer’s concerns. In addition, We have not found a scATAC-seq dataset of autoimmune disorders with explicit cell labels. If the reviewer has specific datasets in mind, we would be happy to explore them further.

[W2] More Comprehensive Ablation Study

Thank you for the insightful suggestion. As requested, we compare ChromFound to both a pure Mamba and a pure Transformer baseline under the same pretraining corpus.

  1. The pure Mamba model (Row 3 of Table 4) removes WPSA and uses Mamba alone. This results in a ~15% ARI drop, confirming that local attention is critical and that the hybrid design contributes synergistically to performance.
  2. Training a vanilla Transformer on scATAC-seq inputs containing nearly one million peaks is practically infeasible due to its quadratic scaling in memory and computation. To approximate a pure Transformer baseline under practical settings, we train two representative single-cell foundation models, Geneformer and scGPT, on the same pretraining corpus with increasing peak lengths (4k–32k) and the same hyperparameters as WPSA. The results of comparison on the cell clustering task using the PBMC169K (batch VIB_10xv1_1) dataset are detailed in the table below.
    • Geneformer sorts peaks by accessibility value before input, which disrupts the native genomic order. As a result, it fails to learn meaningful representations.
    • scGPT applies a highly variable OCR selection strategy, limiting inputs to a small subset of peaks. Performance saturates at ARI ≈ 0.39, constrained by the maximum input length that Transformers can handle efficiently.
MethodPeak LengthARI
Geneformer40960.0451
163840.0457
327680.0460
scGPT40960.3075
163840.3774
327680.3868
ChromFound3630660.6953

[W1,Q1,Q3] Model Sensitivity to Hyperparameters

Thank you for this helpful suggestion. We have evaluated both hyperparameters during early-stage model development on the cell clustering task using PBMC169K (batch VIB_10xv1_1) dataset. The results and findings are summarized below.

  1. Positional embedding temperature: This parameter controls the frequency of sinusoidal position encodings and can be viewed as a proxy for “genomic resolution.” As shown below, performance slightly drops when temp is within 1e3 to 1e5, and degrades more substantially when temp is too large, likely due to loss of relative position sensitivity.
temptempARINMIAMIFMI
10000.65350.77640.77550.7259
100000.65120.77090.77010.7243
100000(Ours)0.69530.78600.78520.7601
10000000.60360.73440.73340.6860
100000000.58520.72690.72580.6714
  1. Mamba projection dimension: This parameter controls the compression within the Mamba block. As shown below, DlowD_{low} = 32 achieves a strong balance between performance and efficiency. Our choice of 32 is primarily motivated by the trade-off between performance and computational efficiency. Larger values lead to marginal gains but significantly increase FLOPs and memory, with DlowD_{low} = 128 exceeding the memory limits on A100 80G GPUs.
DlowD_{low}Parameter CountFLOPsInference Speed (s/sample)GPU memory (GB)ARI
16422,5937.41E+113.402748.80.6158
32(Ours)450,3057.89E+113.462460.70.6953
64518,7859.09E+114.005372.30.6927
128707,9691.24E+12OOMOOMOOM

We hope these results address the reviewer's concerns. We will include this analysis in the revised manuscript.

[Q2] Interpretation of Fine-tuning Benefits

Thank you for the thoughtful and professional insight, and we strongly agree with your interpretation. The strong zero-shot performance indeed suggests that ChromFound learns generalizable representations of cellular states and underlying regulatory structures during pretraining. These representations are already informative for unsupervised clustering and can reveal biologically relevant subpopulations.

As you rightly pointed out, fine-tuning further improves performance by enabling the model to specialize in cell-type-specific regulatory features. Such features often involve subtle patterns or rare combinations of regulatory elements that are better resolved with supervision. This fine-tuning complements the global regulatory landscape already captured during pretraining. In addition, we have provided a detailed analysis of the cell type annotation confusion matrix in Appendix F.2, demonstrating ChromFound’s advantage in recognizing rare and long-tail cell types and cell-type-specific regulatory patterns.

We appreciate your expert perspective and will incorporate this interpretation into our revised discussion. Your feedback greatly enhances the clarity of our presentation and strengthens our understanding of how pretraining and fine-tuning synergize in ChromFound.

[Q4] Fixed WPSA Window

Thank you for raising this important point. Both Hi-C[1] and CRISPRi[2] have shown that the vast majority (over 90%) of validated enhancer-gene interactions occur within 200 kb of the transcription start site (TSS). Theoretically, the Mamba block modeling genome-wide dependencies allows ChromFound to infer potential distal enhancer-gene links. That said, we acknowledge that experimental validation of distal enhancer–gene links (>200 kb) remains limited, we view this as an important direction for future work. If the reviewer is aware of suitable datasets, we would be happy to incorporate such experiments.

[1] Bonev, B. & Cavalli, G. Organization and function of the 3D genome. Nat Rev Genet 17, 661–678 (2016).

[2] Fulco, C. P. et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat Genet 51, 1664–1669 (2019).

[L2] Compute Resources for Fine-tuning

We thank the reviewer for highlighting the importance of computational efficiency. We provide profiling results on cell type annotation and cross-omics prediction tasks using a single A100 80G GPU:

TaskDatasetPeak LengthBatch SizeTraining Speed (s/step)Peak Memory Usage (GB)
Cell Type AnnotationBone 43/44495,41640.53567.8
Cross-omics PredictionBMMC multiome 2021269,54441.06056.7

We note that the primary driver of memory consumption is the length of the input peak sequence, which can reach several hundred thousand tokens per cell. In contrast, the model architecture itself remains lightweight (parameter count: 450305). All current resource metrics will be included in the appendix to guide users in estimating compute requirements. In future work, we aim to further optimize memory efficiency and training speed for broader applications.

Conclusion

We hope these responses address your concerns and questions. We sincerely appreciate your valuable feedback and constructive suggestions. If you have any additional questions or suggestions, please feel free to let us know.

评论

Thank you for your comprehensive and thoughtful rebuttal. I appreciate the substantial effort you put into addressing my concerns with concrete experimental evidence and detailed analysis.

Your responses have effectively addressed the key issues I raised:

Hyperparameter Analysis: The systematic evaluation of both the positional embedding temperature and Mamba projection dimension provides exactly the type of sensitivity analysis I was looking for. The clear trade-off between performance and computational efficiency (particularly the memory constraints at d_proj=128) gives readers valuable guidance for practical implementation.

Enhanced Ablation Study: Your comparison with pure Mamba and the creative approach to Transformer baselines (using Geneformer and scGPT as proxies) convincingly demonstrates that the performance gains stem from your hybrid architectural design, not merely from large-scale pretraining. The ~15% ARI drop when removing WPSA and the fundamental limitations of existing Transformer-based models on this data modality strengthen your architectural claims significantly.

Robustness and Generalization: Your clarification on coordinate-based tokenization and the existing disease representation in your evaluation datasets addresses my concerns about generalizability. The downsampling experiments provide good evidence for robustness to preprocessing variations.

The addition of computational profiling results also addresses practical deployment considerations, which will be valuable for the research community.

Your rebuttal demonstrates not only technical rigor but also a deep understanding of the biological context and practical implications of your work. The responses show that ChromFound represents a well-engineered solution to genuine challenges in single-cell chromatin accessibility analysis.

Updated Rating: 5 (Accept) - This is technically sound work with clear contributions that will be valuable to the research community. The comprehensive evaluation and your responsive improvements to the analysis strengthen confidence in the work's impact and reliability.

Thank you again for the thorough responses and congratulations on this solid contribution to the field.

评论

We sincerely thank your for your kind response. We truly appreciate your recognition of our work and your thoughtful feedback. Your suggestions have helped us better clarify our motivations, strengthen the experimental evidence, and improve the overall presentation.

As the discussion process is still ongoing, we may reach out to other reviewers in the coming days to invite further discussion. We apologize in advance for any inconvenience this may cause and greatly appreciate your understanding.

Once again, thank you for your kind support and for taking the time to review our work.

评论

Dear reviewer:

We sincerely thank the reviewer for the thoughtful, detailed, and constructive feedback. We appreciate your recognition of ChromFound’s novelty, biologically grounded architecture, and the comprehensive downstream evaluations. In this rebuttal, we have carefully addressed each of your suggestions, including analysis of upstream dependencies, generalization to disease contexts, expanded ablation studies, and sensitivity to key hyperparameters.

We hope our responses have addressed your concerns and clarified the rationale behind our design choices. If any questions remain or if you would like to discuss any points further, we would be very glad to continue the conversation. Your feedback has been instrumental in strengthening our work, and we are truly grateful for your time and insight.

审稿意见
4

ChromFound is a foundation model for single-cell ATAC-seq data pretrained on ~2 million cells that aims to learn batch corrected representations of cells that (a) cluster by cell type, (b) can be used to predict cell types for new cells, and (c) can be used to predict other modalities (such as RNA-seq). Across all three aforementioned tasks, the authors show that ChromFound outperforms baseline methods that train on specific datasets and are not pre-trained.

The model is pre-trained in a self-supervised fashion to predict masked accessibility values. Crucially, the model parametrizes open chromatin regions (OCRs) by their start and end location, as opposed to having a trainable embedding for a specific OCR, allowing the model to work on OCRs from new datasets, which will not exactly match OCRs from the pre-training datasets.

优缺点分析

To my knowledge, ChromFound is the first foundation model for scATAC-seq data. In my opinion, the paper's primary contribution is the demonstration that pre-training on millions of scATAC-seq cells enables zero-shot representation learning for completely new datasets, without requiring additional training or fine-tuning. (I do not think the model itself minus the pre-training is a general advance given the results in Fig. 3 and Table 3, which show that baselines are competitive with a non-pretrained ChromFound model.) This type of pre-training has mostly not been done before because different datasets have different peak/OCR boundaries and harmonizing peak sets is not straightforward. The authors cleverly solve this problem by parametrizing OCRs by a function of their start and end positions, as opposed to using an embedding table. This I find to be their primary computational contribution.

However, there are other methods of harmonizing peak sets that would enable other architectures (such as a standard VAE) to be used for pre-training that the authors do not compare to. Computational biologists routinely (i) aggregate fragment files from multiple datasets to re-call a unified set of peaks or (ii) adopt the peak set of a large reference atlas. If fragment files are available for the new dataset, then peak counts can be called specifically against the reference peak set. But even if fragment files are not available and only a count matrix is, counts from the new peak set can be transferred to the reference peak set by examining peak overlaps. Demonstrating that ChromFound outperforms (a) both itself with a different harmonization strategy and (b) other methods pre-trained using one of the aforementioned harmonization strategies would substantially strengthen the paper.

The other notable weakness of the paper is the benchmarking to baselines. The authors omit several standardly used tools, do not systematically compare against the same tools for all tasks even though most methods are generally applicable, and crucially do not detail how they implement the baselines. For cell clustering and batch-effect correction, Signac, peakVI, and SnapATAC2 are the most commonly used tools and should be compared against. For cross-omics predictions, Signac and multiVI should also be compared against. Additional recommendations are listed in the "Questions" section.

问题

Major suggestions/questions:

  1. Please compare to non-pretrained ChromFound for all experiments, not just Fig. 3, Table 3, and Table 4.
  2. Benchmark against Signac (TF-IDF + LSI), peakVI/poissonVI, and SnapATAC2 for cell clustering tasks, as these are the methods most commonly used by practioners. For each, the appendix should indicate the exact pre-processing done before applying the method (e.g. are peaks filtered by their observation frequency before passing them into peakVI?).
  3. Do not use scVI/scANVI for batch-effect correction, as they are designed for RNA-seq. Use peakVI/poissonVI instead.
  4. What input representations do you pass to Harmony for batch-effect correction? PCA or LSI or neither?
  5. What l2 regularization parameter do you use for scBasset for batch-effect correction? How do you choose this regularization parameter?
  6. A small linear layer on top of low-dimensional representations learned by Signac, peakVI, and SnapATAC2 can also be used for cell type annotation. Please compare to those baselines.

Minor suggestions/questions:

  1. At test time, when cell type labels are unavailable, how does OCR filtering (as described on lines 611-612) work?
  2. Why do you use an MSE loss after log transformation instead of a Poisson loss on the raw counts, which likely better fits the generative process.
  3. In Fig. 4, are you using ABC predictions as ground-truth? This seems suspect as ABC itself is far from perfect and has several biases (i.e. overly discounting distal enhancers).
  4. Question 3 in the ablations asks if long-context modeling using Mamba layers helps, but I'm not sure if the experiment performed actually answers that question. My reading of Table 4 is that half and three-fourths of the OCRs are just removed from the input data, but that doesn't test the utility of the Mamba layers directly. A better test would be to use all OCRs, but pool after the WPSA layer without having any Mamba layers.

局限性

Yes

最终评判理由

I've increased my score from a 3 to a 4 based on the authors re-running of baselines with more suitably chosen hyperparameters. It seems like this method is at least as good as existing methods. And even if further hyperparameter optimization of the baselines reveals that this method is only slightly better than them (which I suspect to be the case), it at least offers a novel and interesting approach to merge OCRs from different datasets that will be valuable to the single-cell ATAC-seq community.

格式问题

None

作者回复

We appreciate the reviewer for recognition of ChromFound’s novelty and constructive feedbacks. Below, we summarize our key revisions and additional analyses in response:

[1]. Benchmark completeness

  1. We expand the clustering benchmarks (Table 1) by adding Signac (Sig), SnapATAC2 (SAT2), peakVI (pkVI), and poissonVI (psVI). All methods use the same preprocessing (Appendix C). Signac, peakVI, and poissonVI skip log normalization; scBasset uses binarized input.
DatasetModelARIFMINMIAMI
Cortex(batch 1)Sig0.50130.70350.55800.5534
SAT20.51180.67400.72020.7179
pkVI0.45580.62860.69640.6940
psVI0.36730.56730.58780.5845
Ours0.68900.79430.77790.7760
Cortex(batch 2)Sig0.50690.64780.58880.5851
SAT20.53250.63800.70380.7015
pkVI0.55120.65300.70500.7028
psVI0.53340.63880.60040.5983
Ours0.62780.71560.73210.7299
Heart(av 3)Sig0.31230.50930.22940.2281
SAT20.45710.72850.68670.6863
pkVI0.41590.63830.61490.6144
psVI0.39690.59290.44720.4464
Ours0.58280.67570.72070.7187
Heart(av 10)Sig0.11590.46010.52720.5238
SAT20.55150.55640.66250.6502
pkVI0.43010.54340.60240.6000
psVI0.36060.52320.57170.5688
Ours0.67740.81090.73690.7365
Retina(D026_13)Sig0.55380.74510.49310.4919
SAT20.56060.72070.70450.7041
pkVI0.28850.55430.56440.5638
psVI0.31620.57660.60810.6076
Ours0.66680.81490.76440.7641
Retina(D19D008)Sig0.51460.73390.52350.5222
SAT20.58140.71730.78810.7877
pkVI0.47020.63180.72290.7224
psVI0.49780.65270.74000.7395
Ours0.66880.77670.81830.8179
PBMC(VIB_10xv1_1)Sig0.18320.45430.30640.3027
SAT20.62460.70850.75510.7541
pkVI0.54170.63400.66390.6627
psVI0.63140.70730.73530.7343
Ours0.69530.76010.78600.7852
PBMC(BIO_ddseq_1)Sig0.18850.52490.28520.2839
SAT20.41260.61490.57610.5754
pkVI0.29150.50410.46230.4615
psVI0.41030.59940.56560.5649
Ours0.48350.66040.59500.5944
  1. We expand batch correction evaluation (Table 2), including peakVI (pkVI) and poissonVI (psVI) to replace scVI and scANVI.
TissueModelAvgBIOAvgBATCH
BonepkVI0.29810.9173
psVI0.29360.9178
Ours0.64080.9289
HeartpkVI0.56570.7564
psVI0.56210.7719
Ours0.81800.8679
PBMCpkVI0.61720.7889
psVI0.60290.8081
Ours0.64430.8217
CortexpkVI0.72560.9254
psVI0.72400.9279
Ours0.74400.9565
  1. The descriptions for all benchmark methods are provided in Appendix H. The APIs and tools used are as follows:

    • scBasset, peakVI, poissonVI, scVI: scvi.model
    • Liger: the pyliger Python package
    • Scanorama, Harmony: scanpy.external
    • scANVI: scib.integration
    • Cross-omics benchmarks: DANCE.modules.multi_modality.predict_modality
    • Others: official source codes and tutorials from their respective publications

    To directly address the reviewer’s question:

    • L2 regularization parameter for scBasset is set to 1e-8;
    • Input representation passed to Harmony is PCA.
  2. We include logistic regression results in cell type annotation (Fig. 3) using Signac (Sig), peakVI (pkVI), and SnapATAC2(SAT2) embeddings. The results of baselines are much worse than ChromFound's.

TissueTrainTestMethodAccuracyF1
PBMCEPF_hydrop_1VIB_10xv1_1Sig0.35460.1643
pkVI0.27560.1430
SAT20.12560.1751
PBMCEPF_hydrop_3VIB_10xv1_2Sig0.08410.0488
pkVI0.12860.1961
SAT20.35480.1565
Bonebatch_27batch_26Sig0.26650.2477
pkVI0.21070.1939
SAT20.48970.3525
Bonebatch_43batch_44Sig0.15970.1764
pkVI0.21180.1795
SAT20.44790.4540
Cortexbatch_2batch_1Sig0.02400.0216
pkVI0.09730.0717
SAT20.40070.2635
Cortexbatch_3batch_2Sig0.49300.3629
pkVI0.51450.2610
SAT20.08800.0843
RetinaD19D003D018_13Sig0.03530.1037
pkVI0.09350.0763
SAT20.00000.0000
RetinaD021_13D19D003Sig0.80880.5642
pkVI0.02670.0393
SAT20.30320.2500
  1. Clarification on non-pretrained ChromFound:

    In manuscript, we have already included non-pretrained ChromFound as a baseline in Fig. 3 and Table 3. We emphasize that for tasks such as cell clustering and denoising, ChromFound is evaluated in zero-shot settings. Non-pretrained ChromFound loading randomly initialized weights in these scenarios is not a meaningful or fair baseline. We hope the reviewer understands our rationale, and we are happy to further discuss this point if needed.

[2]. Other peak calling methods and VAE pretraining

We sincerely thank the reviewer for insightful suggestions. In response, we perform an experiment to directly compare ChromFound with VAE-based models trained on reference-peak-aligned data. Specifically, we adopt the cPeaks reference set proposed by the recent work "A generic reference defined by consensus peaks for scATAC-seq data analysis", which defines 1,657,194 peaks across the human genome. We use a 20,000-cell PBMC subset (same as in Table 4, Row 6) as the training data. After mapping peaks to the cPeaks reference and filtering out peaks and cells with all-zero values, we obtain a training set of 306,784 peaks and approximately 18,000 cells.

We train four VAE-based baselines on the aligned data and compare their performance on zero-shot cell clustering (Table 1) against ChromFound trained on the same aligned data (Ours). Evaluation metrics and datasets follow Table 1, all aligned to 306,784 peaks. Results are summarized below:

DatasetModelARIFMINMIAMI
Cortex(batch 1)SCALE0.29320.50500.41270.4081
SCALEX0.42300.60780.55430.5513
pkVI0.36900.56430.50230.4983
psVI0.33640.54080.46360.4594
Ours0.58580.72440.75190.7499
Cortex(batch 2)SCALE0.24840.40520.36570.3609
SCALEX0.44810.56790.57420.5710
pkVI0.36100.50350.48760.4835
psVI0.28200.43740.41040.4058
Ours0.57510.67200.71450.7123
Heart(av 3)SCALE0.35430.48170.52280.5195
SCALEX0.27950.41500.47870.4752
pkVI0.30230.44370.48160.4779
psVI0.37780.50250.54460.5414
Ours0.54000.63470.65880.6567
Heart(av 10)SCALE0.40510.65060.44610.4453
SCALEX0.30130.56850.41990.4191
pkVI0.39660.60800.50180.5011
psVI0.39390.59970.50100.5003
Ours0.61500.76950.69680.6963
Retina(D026_13)SCALE0.30650.57710.44470.4440
SCALEX0.24300.51490.41270.4117
pkVI0.25140.52600.46510.4644
psVI0.23470.51220.40630.4055
Ours0.48880.70810.72410.7238
Retina(D19D008)SCALE0.31430.50920.42200.4210
SCALEX0.24900.45420.39200.3909
pkVI0.26640.46930.47450.4736
psVI0.23960.44730.39370.3926
Ours0.53680.68390.76640.7660
PBMC(VIB_10xv1_1)SCALE0.57690.67560.70660.7057
SCALEX0.57180.65140.72050.7196
pkVI0.53060.62490.70570.7046
psVI0.55930.64790.67420.6732
Ours0.59990.68270.74340.7424
PBMC(BIO_ddseq_1)SCALE0.40310.57970.53800.5374
SCALEX0.33260.54300.54650.5458
pkVI0.38300.58040.57370.5731
psVI0.39590.59170.55090.5503
Ours0.43470.62370.58670.5861

We highlight two key observations from this experiment:

  1. Baseline models perform poorly on unseen tissues. We find that only 30% of peaks in retina, 35% in cortex, and 70% in heart overlap with the cPeaks-aligned PBMC training peaks, suggesting that cross-tissue peak heterogeneity is a major factor limiting generalization.
  2. Using cPeaks reduce ChromFound's performance compared to its native OCRs, suggesting that reference-based harmonization cannot capture novel or shifted OCRs introduced by batch effects or disease-specific variation. In contrast, ChromFound’s dynamic OCR tokenization allows the model to implicitly learn genomic distance and neighborhood structures, enhancing both robustness and extensibility.

[3]. Minor questions

  1. OCR filtering when cell type labels are unavailable

    When cell labels are unavailable, we apply TF-IDF + LSI to infer pseudo labels for OCR filtering. We will clarify this in the manuscript.

  2. Choice of loss function

    Due to the high sparsity of scATAC-seq data, we adopt a dual masking strategy (Sec 3.3) during pretraining, masking equal number of both nonzero and zero peaks. Under this setting, Poisson/ZINB losses often cause unstable optimization due to gradients dominated by nonzero entries. Instead, MSE loss on log-transformed, normalized signals ensures stability. This choice is also supported by xTrimoGene discussed in both their manuscript and rebuttal.

  3. Ground truth in Fig. 4

    We would like to clarify that the ground truth labels in Fig. 4 are not ABC predictions, but experimental CRISPRi perturbation data in the Fulco4K dataset, as stated in lines 235–236 of the manuscript. Additionally, the citation of Fulco4K dataset will be corrected as "Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations".

  4. Ablation study on the Mamba layer

    Thank you for the comment. To clarify, Question 3 in our ablation study is designed to assess the importance of long input peak sequences, not the effectiveness of the Mamba layers. Many existing scATAC-seq methods apply dimensionality reduction (e.g., highly variable peaks or LSI), but we aim to show that modeling long-range chromatin context leads to better cell representations and cross-omics performance.

    The effectiveness of the Mamba layers is directly evaluated in Question 1. The significant performance drop in this setting highlights the contribution of each component, especially Mamba layer.

Conclusion

We sincerely thank the reviewer for these thoughtful and professional suggestions. We hope our responses have addressed the reviewer’s concerns. Please let us know if further clarification is needed.

评论

Many thanks to the authors for performing these additional experiments in such a short period of time. The results are quite impressive, though at points unintuitive (not necessarily wrong, just surprising given my previous usage of these methods). I have a few additional questions (apologies if they're already included in the appendix or your response and I missed them) that might help me build some intuition for these results:

  1. Let's focus on a specific evaluation, Cortex (batch 1) for cell clustering in experiment [1] of your response. My understanding is the same peaks are passed to all methods. How many peaks are there? How many cells? I find it surprising that VAEs do so poorly on a simple cell clustering task unless there are very few cells (so pre-training becomes essential) or too many peaks (such that they can't handle the high-dimensionality well).

  2. It's still unclear to me if the baselines are run with reasonable hyperparameters. I am most familiar with peakVI/poissonVI, which is why I will focus on those methods. How many training epochs were they run for/what learning rate/is early stopping used?

  3. I am not sure why it is unfair for non-pretrained ChromFound to be included in Table 1. I think there should be three versions of ChromFound evaluated: (1) pre-trained and fine-tuned on the dataset of interest, (2) only pre-trained, and (3) only trained on on the dataset of interest.

  4. Thank you for running experiment [2]. These results seems to indicate that VAE models pre-trained on a small amount of PBMC data and applied zero-shot to tissues as different as retina and the cortex do poorly. This makes perfect sense. Somehow ChromFound does a lot better. So well in fact that VAEs trained specifically on the dataset of interest (taking numbers from experiment [1]) do worse than ChromFound pre-trained on PBMCs and applied zero shot to this very different tissue. Is that a correction interpretation of the results? If so, can you provide some intuition as to how ChromFound has such impressive generalization capabilities? I understand that OCR tokenization helps and can allow the model to treat peaks in the test set similar to close by peaks in the pre-training set, but I am shocked it helps that much.

  5. The cell type annotation results of baselines are shockingly poor. But I think this is because of batch effects between the training and test sets. Do you use something like scArches do integrate the test (query) dataset into the training (reference) dataset?

  6. I'm still quite confused how exactly the ablations for Question 3 are run. How are the number of OCRs per cell reduced?

评论

Dear reviewer:

We sincerely thank you for the time, expertise, and thoughtful suggestions provided throughout the review process. In this rebuttal, we have done our best to carefully address every concern raised, including expanded benchmarking, detailed comparisons with alternative peak harmonization strategies, and clarification of modeling and implementation details. These revisions were made with the goal of strengthening the work and making our contributions clearer.

We greatly value your feedback and would be very glad to further discuss any remaining questions or concerns. Please do not hesitate to reach out. We truly welcome continued dialogue and are committed to engaging with all comments in depth. Thank you again for your time, insight, and consideration.

评论

We sincerely thank the reviewer for the follow-up questions and for acknowledging the additional experiments. We address each point below.

[Q1] Details of evaluation datasets

We appreciate your insightful questions. Below we provide detailed information about the evaluation datasets in experiment [1]:

Dataset#Cell#Peak
Cortex(batch 1)1,119141,389
Cortex(batch 2)1,416212,793
Heart(av 3)3,733333,393
Heart(av 10)2,785343,027
Retina(D026_13)7,529235,366
Retina(D19D008)4,575232,354
PBMC(BIO_ddseq_1)5,592192,229
PBMC(VIB_10xv1_1)2,707363,066

[Q2] Hyperparameters of peakVI/poissonVI

We thank the reviewer for pointing out the hyperparameter settings of peakVI and poissonVI. We follow the default setting in the scvi-tools library. Specifically we detail the hyperparameters you mentioned below:

Modeltraining epochslearning rateearly stopping patience
peakVI5000.000150
poissonVI5000.000150

We observe that peakVI typically runs through 500 epochs, whereas poissonVI usually stops after fewer than 100 epochs due to the early stopping strategy. Thank your again for your professional suggestion. We hope this clarification of the hyperparameter settings will address your question.

[Q3] Non-pretrained ChromFound

Thanks for your suggestions and clarification. We include additional results as below, including pre-trained and continuous training on the dataset of interest (ChromFound with continuous training on evaluation datasets) and training from scratch on the dataset of interest (ChromFound trained from scratch on evaluation datasets).

DatasetModelARIFMINMIAMI
Cortex(batch 2)ChromFound with continuous training on evaluation datasets0.63060.71760.73060.7285
ChromFound trained from scratch on evaluation datasets0.61000.69970.72920.7271
ChromFound0.62780.71560.73210.7299
Retina(D19D008)ChromFound with continuous training on evaluation datasets0.67100.77830.81890.8186
ChromFound trained from scratch on evaluation datasets0.63620.75340.80380.8035
ChromFound0.66880.77670.81830.8179

As shown in the table above, ChromFound with continuous training on evaluation datasets achieves better performance than ChromFound and ChromFound trained from scratch on evaluation datasets. Nevertheless, training on the evaluation datasets incurs significant additional computational cost. We report the inference and training speed of one A100 80G GPU, along with GPU memory usage.

Dataset#Cell#PeakSettingBatch SizeTotal time(s)GPU memory (GB)
Cortex(batch 2)1416212,793Training8229263.53
Inference833613.9
Retina(D19D008)4575232,354Training8827068.31
Inference8109015.12

Given the additional computational cost and the limited performance gain, we recommend generating cell representations in a zero-shot setting without further training. Due to time constraints, we will include results for all datasets in the revised manuscript, along with corresponding computational resource requirements, as part of the usage instructions for ChromFound. We once again thank the reviewer for the valuable suggestion.

评论

[Q4] Correction interpretation of the results

We sincerely thank the reviewer for pointing out the correct interpretation of the results. As you mentioned that "VAEs may do poorly on very few cells or too many peaks", we also notice that the used peak number of datasets in the tutorial of peakVI implemented in scvi-tools is 33142, which is much smaller than the number of peaks in our datasets. Therefore, we believe the poor performance of peakVI/poissonVI in our experiments is likely due to the large number of peaks in our datasets. We include the additional experiments to filter peaks with highly variable selection and retrain peakVI/poissonVI on the filtered datasets. The results are summarized below:

DatasetModel#PeakARIFMINMIAMI
Cortex(batch 2)peakVI212,7930.55120.65300.70500.7028
peakVI(hv)30,0000.58670.68140.72210.7200
poissonVI212,7930.53340.63880.60040.5983
poissonVI(hv)30,0000.57230.67030.72030.7181
ChromFound trained on PBMC cPeaks212,7930.57510.67200.71450.7123
ChromFound212,7930.62780.71560.73210.7299
Retina(D19D008)peakVI232,3540.47020.63180.72290.7224
peak(hv)30,0000.59070.73950.78410.7836
poissonVI232,3540.49780.65270.74000.7395
poissonVI(hv)30,0000.60370.74430.79660.7961
ChromFound trained on PBMC cPeaks232,3540.53680.68390.76640.7660
ChromFound232,3540.66880.77670.81830.8179

As shown in the table above, peakVI and poissonVI perform much better after highly variable peak selection, confirming their strength as baselines under reduced dimensionality. However, as the input dimensionality increases, their performance exhibits a marked decline, suggesting they may struggle to model long-range chromatin dependencies. ChromFound’s hybrid WPSA-Mamba architecture addresses this limitation by integrating local cis-regulatory context with efficient long-range modeling, enabling broader genome-wide applications such as enhancer–gene link inference and perturbation effect prediction (Section 4.3).

Due to time constraints, we are unable to extend this integration experiment to all datasets. We will be glad to include results for all datasets and additional baselines with highly variable peak selection in the revised manuscript. Thank you again for your professional suggestion.

[Q5] Batch effect in the cell type annotation

We sincerely thank the reviewer for this professional and insightful suggestion. We did not integrate the test (query) dataset into the training (reference) dataset in the response before. Following your professional suggestion, we implement scArches for scPoli to compare cell type annotation performance with and without applying scArches-based integration between the training (reference) and test (query) datasets. The integrated setting yields a substantial performance improvement, fully consistent with your observation that batch effects largely account for the poor baseline results.

DatasetTrainTestMethodAccuracyF1
Bonebatch_43batch_44scPoli0.25000.1822
scArches-integrated scPoli0.81370.8005
ChromFound0.83680.8335
Cortexbatch_2batch_1scPoli0.41200.2286
scArches-integrated scPoli0.86400.6973
ChromFound0.93660.7715

Due to time constraints, we are unable to extend this integration experiment to all datasets. We will be glad to include results for all datasets and additional baselines with train/test integration by scArches in the revised manuscript. We greatly appreciate your suggestion, which has helped us optimize and strengthen our work.

[Q6] The number of OCRs per cell reduced in ablation study Question 3

We sincerely thank the reviewer for the follow-up questions. For pretraining, we set the maximum OCR sequence length to 440,000. If a dataset contains more than 440,000 OCRs, we retain the top 440,000 most variable OCRs; if it contains fewer, we apply zero padding to match the maximum length. In the ablation for Question 3, we compare maximum lengths of 220,000 and 110,000 OCRs. The reduction in OCRs for these settings is performed via dataset-specific highly variable OCR selection, ensuring that the retained peaks capture the most informative variability within each dataset.

Summary

We are deeply grateful to the reviewer for their thoughtful and professional feedback. Your comments not only demonstrate a thorough understanding of our work, but also reflect a high level of expertise and academic rigor. Your insightful suggestions significantly improve the clarity and depth of our manuscript. We feel privileged to have our work evaluated by someone with such discernment and scholarly acumen. We hope that our response has addressed all your concerns. Thank you again for your invaluable contribution to the refinement of our research.

评论

In this paper, we present ChromFound, a foundation model for scATAC-seq data that integrates genome-aware tokenization, specialized long-range modeling, and robust cross-tissue generalization. We sincerely thank the AC and all reviewers for your constructive engagement, and are encouraged by the strong recognition of our work, including one Strong Accept and another score raised to Accept. In response, we add extensive experiments, expanded benchmarking, hyperparameter sensitivity analysis, and methodological clarifications, further strengthening ChromFound's technical soundness, clarity, and impact.

We summarize the key points of our responses during rebuttal and discussion:

  1. Comprehensive baseline expansion
    • Signac, SnapATAC2, peakVI, and poissonVI are included across clustering, batch-effect correction, and annotation tasks for fair and widely recognized comparisons. (requested by reviewer dMbp)
    • Shallow baselines, including logistic regression and a two-layer MLP, confirms ChromFound’s gains even over strong alternatives. (requested by reviewer dMbp and VAFg)
    • Evaluations on continuous training and training-from-scratch variants of ChromFound confirm its consistent superiority. (requested by reviewer dMbp)
    • We optimize peakVI/poissonVI by highly variable peak selection, beneficial in low-dimensional settings but degraded with full OCR lengths, highlighting ChromFound’s ability to model long-range chromatin dependencies without accuracy loss. (requested by reviewer dMbp)
  2. Harmonization and peak-calling methods comparison:
    • Compared with a pre-defined OCR reference (cPeaks), dynamic OCR tokenization outperforms on unseen tissues, showing robustness of genome-aware OCR representation. (in response to reviewer dMbp and VAFg)
    • Aligned-data experiments with VAE-based models confirm generalization mainly stems from unified OCR tokenization. (in response to reviewer dMbp)
    • Downsampling experiments in Fig. 2, along with strategies such as log-transformation, dataset-specific normalization, genome-aware reconstruction, and large heterogeneous pretraining, jointly support robust cross-tissue modeling and broader applicability over pre-defined OCR references. (in response to dMbp, ybr7 and VAFg)
  3. Architectural validation and hyperparameter sensitivity analysis:
    • Each component of ChromFound is tailored to scATAC-seq for biologically meaningful representation and practical applications. (in response to reviewer VAFg)
    • Transformer-based proxies (Geneformer, scGPT) saturate well below ChromFound’s accuracy on full-length OCR inputs, indicating that its architecture, rather than data scale alone, enables effective modeling of ultra-long genomic sequences. (in response to reviewer ybr7)
    • WPSA outperforms linear attention variants (Performer, Linformer) in balancing scalability and representation. (in response to reviewer VAFg)
    • Hyperparameter sensitivity experiments on positional embedding temperature temptemp and Mamba projection dimension DlowD_{low} identifies optimal settings that balance accuracy, memory efficiency, and computational cost. (in response to reviewer ybr7)
  4. Loss function design and sparsity handling
    • We emphasize that the balanced masking strategy prevents trivial zero-prediction bias, and that MSE loss on log-normalized accessibility values avoids the instability of Poisson/ZINB loss across diversified datasets, both ensuring robustness to zero and non-zero OCRs imbalance. (in response to reviewer ybr7, VAFg and dMbp)
    • We highlight that ChromFound captures population-level statistical patterns from diverse datasets, enabling it to implicitly distinguish consistent regulatory signals from dataset-specific sparsity or noise. (in response to reviewer VAFg)
  5. Robustness to biological diversity
    • Strong performance on Alzheimer’s and myocardial infarction datasets validates ChromFound’s applicability to diverse disease contexts beyond healthy tissues. (in response to reviewer ybr7)
    • We explain the retina-specific performance improvement under more aggressive downsampling in Figure 2 by showing that retina datasets contain a higher proportion of high-frequency OCRs, where downsampling reduces redundancy and improves signal-to-noise ratio. (in response to reviewer VAFg)
  6. Practical application
    • Training and inference runtime as well as memory usage on A100 GPUs provides transparent resource requirements for reproducibility and deployment planning. (in response to dMbp and ybr7)

We sincerely thank the AC for ensuring a fair and constructive process, and all reviewers for their insightful feedback. Your comments have directly strengthened our experiments, refined our methodology, and more clearly demonstrated ChromFound’s robustness, scalability, and broad applicability. We also deeply appreciate your dedication and contributions to advancing the AI for Science community, which have greatly enriched and elevated this review process.

最终决定

The submission describes a foundation model for scATAC-seq data which can be pretrained on data from diverse cell types and tissues and used in a zero-shot or fine-tuned fashion on downstream tasks. Reviewers had high appraisal for the novelty and significance for this application domain. Questions about the technical details of the experiments led to a lively discussion, in which authors provided further experimental comparisons. 3/4 reviewers indicated positive appraisal of these new results presented in the rebuttal and the resolution of most concerns.