ViTally Consistent: Scaling Biological Representation Learning for Cell Microscopy
We develop new methods to train scientific foundation models and evaluate a new 1.9B parameter MAE-G/8 (and others) along a large set of novel benchmarking tasks, demonstrating effective performance for many drug discovery usecases.
摘要
评审与讨论
High content screening (HCS) involves subjecting cells to thousands of perturbations in parallel, and capturing subsequent morphological changes via fluorescent imaging. The scale of data generated my modern experimental workflows has since necessitated automated analysis. In light of the success of foundation models, which leverage generic datasets to create representations that are more useful than representations derived from task-specific data alone, the field of machine learning has developed a number of foundation models for HCS. Experimental data, however, is inherently limited, and hence methods are needed to address how to improve model performance without simply scaling to every larger amounts of data.
This work provides evidence to support three key claims i) that careful curation of data generated by HCS that balances a dataset in terms of morphological variation, while keeping diversity high, can improve the performance of HCS foundation models ii) that intermediate layers of models for HCS can provide stronger performing representations than the commonly used final layer and iii) that they train the largest HCS model to data, that significantly improves on a number of challenging biological benchmarks, and compare performance with models trained on non-biological images, and models trained on alternative HCS data.
Update after rebuttal
I have reviewed the paper, the reviews, and rebuttals again. I am inclined to keep my score at a strong accept - this is a well written paper, with results which I think are noteworthy for the community working in machine learning for high throughout screening data.
I believe that curating a dataset by filtering images with respect to perturbation consistency and diversity is sufficiently novel for publication at ICML. I also think the work showing that the linear probing task is correlated with performance on biological recall is noteworthy, and adds merit for publication for an Applications-Driven submission.
给作者的问题
The claim that the curated dataset allows to train effective models with higher performance, by training with a curated set of images for more epochs would be more clearly supported had the authors trained a MAE-L/8 with 93M for more epochs. In particular, the recall performance between the MAE-L/8's trained on PP16M and 93M is similar in Table 1. How much impact did training length have on replicate consistency?
论据与证据
There are four main claims in this paper. The experiment design and choice of benchmarks supports these claims with evidence that is sufficient for acceptance.
Claim 1: by curating the images in the proposed manner, and training for longer, better performance can be achieved from fewer epochs.
- The selection of baselines includes MAE-L/8 trained on either PP-16M or 93M. The performance in table 1 shows little change in the recall of known biological relationships between these two models, despite the large difference in number of images and epochs used in training. More importantly, the KS and CM statistics reflecting replicate consisitency, show large improvements when using the curated dataset.
- These claims would be better had the authors trained a MAE-L/8 with 93M for longer. How much impact did training length have on replicate consistency?
Claim 2: Intermediate layers can provide better representations for downstream tasks
- This claim is well supported throughout the manuscript, where representations from earlier blocks are found in many cases to outperform final layer representations. This is not only noteworthy for performance gains, but also reducing inference costs.
Claim 3: linear probing performance on a subset of genetic perturbations correlates strongly with downstream performance on whole-genome benchmarks
- Figure 4 shows a clear linear relationship, between linear probing and whole-genome benchmarks.
Claim 4: MAE-G/8 outperforms SOTA across biologically relevant benchmarks
- MAE-G/8 (trimmed) is consistently shown to outperform the well-selected baselines.
方法与评估标准
This work includes comparisons between a number of models. Each of these are variants of Vision Transformers (ViT), differing in number of parameters and training dataset. Including a MAE-L/8 trained on RPI-93M and comparing this to a MAE-L/8 PP-16M provides good evidence for the claim that effective models for HCS can be produced with curated datasets. It would have been interesting to include a comparison with the MAE-G/8 trained on the 93M dataset, albeit for less epochs, to show that it is the PP-16M dataset which is the decisive factor in scaling to the ‘giga’ family of ViTs for microscopy.
The use of linear probing to evaluate the performance of embeddings at earlier layers of a trained model, is well motivated in the paper, as it is computationally unfeasible to perform a whole genome evaluation. Using the 1139-class RxRx1 dataset for genetic perturbation prediction is a challenging task, and demonstrates the claim that earlier layers provide better embeddings with sufficient evidence.
Perhaps most crucially, the model is evaluated on external data, namely the JUMP-CP dataset, which the authors describe as likely to be generated by different assay protocols. Recall of biological relationships using any model is lower, but their MAE-G/8 still shows best recall overall.
Overall, high quality datasets and models are used to provide evidence to support the author’s claims. At a high level, this work is about developing a high-capability model for data generated by high content screening. Due to the variety of sources of batch effects that challenge machine learning in this domain, the combination of evaluation on high quality datasets and focussing on replicate consistency and recall of biological relationships, positions this paper well in the literature.
理论论述
Theoretical proofs are not a point of discussion within this paper.
实验设计与分析
I checked the validity of the experiment testing linear probing across different ViT blocks. A new dataset, different to the pretraining datasets used to train the models, was used for linear probing. Comparing each of these pretrained models on a new evaluation dataset is a fair comparison, and the set of baselines used for comparison are strong. Hence, the experiment seemed to be well designed and supports the relevant claim that representations from earlier model layers can be more useful than the output layer.
补充材料
I reviewed the appendices provided in the manuscript.
This included a detailed account of how the initial dataset was curated. This revealed the under/over sampling that was applied with respect to the negative and positive controls of the screen that generated the data. The consideration of negative/positive controls in the balance of the dataset is not a point that I think has been a focus of in the literature, and I think the impact of this manuscript could be improved if this detail was described in the main text. Perhaps rephrase lines 134-141 to be more specific about the issue that positive/negative controls introduce to the balance of the dataset generated from HCS.
The details on perturbation and replication consistency are also welcome, and could set a new standard for benchmarking foundation models for HCS.
与现有文献的关系
This paper provides a new model for HCS that scales to a 1.9 billion parameter model. Previous work [1] has shown that the downstream performance of HCS models is correlated with the number of FLOPs in training. This work has shown that this scaling continues into the regime of billion parameter models. Crucially, to achieve this, they did not have to scale the size of their dataset, by scaling their HCS data generation pipeline, but instead only needed to curate their dataset and scale the number of model parameters and training time.
This model is a ViT and is not channel agnostic as in works [1] and [2]. Previous work has shown that regular ViTs can outperform channel agnostic ViTs and it remains an open question whether a channel-agnostic model can be developed with SOTA performance.
Previous work has shown that representations of data from layers prior to the output layer have better downstream performance, for example the work [3]. This work also demonstrates that earlier layers of a trained ViT can provide representations that perform better on downstream tasks. As such, this is not a novel result in and of itself, but this is the first time this has been demonstrated in models for HCS. However, using linear probing to more efficiently search for the best performing model is a novel result, as it was not clear that linear probing performance on a genetic perturbation classification task would correlate with the whole genome biological relationship recall task.
- [1] Masked autoencoders for microscopy are scalable learners of cellular biology O. Krauss et al 2024
- [2] ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images, N. Bourriez et al 2024
- [3] MIM-Refiner: A Contrastive Learning Boost fromntermediate Pre-Trained Masked Image Modeling Representations, B. Alkin et al 2025
遗漏的重要参考文献
No essential references come to mind. The related work sections correspond to the core contributions of the paper, and mention the relevant literature.
其他优缺点
The overall quality of authorship is very high, this was a good paper to read. The potential impact of this work is high, due to it addressing the issue of needing to generate HCS data at greater scales to scale ML models.
其他意见或建议
Figure 5 is a little difficult to read due to the large number of overlapping lines. Would a table work as an alternative to highlight its main points?
Thank you for the thoughtful and encouraging review.
On training ViT-G/8 on RPI-93M: We agree this would have been valuable. Due to compute constraints, and the insight gained from training ViT-L/8 on both RPI-93M and PP-16M, we prioritized training ViT-G/8 on the curated PP-16M dataset only. Based on the consistent improvements we observed with ViT-L/8 across replicate consistency metrics, we hypothesized that scaling model size on PP-16M would be the most effective use of resources.
On recall differences on JUMP-CP: Thank you for highlighting this point. The comparison to RxRx3 is complicated by both the reduced set of gene KOs in JUMP-CP (∼8,000 vs ∼17,000 in RxRx3) and the increased variability due to assay differences and batch effects. As such, recall values across datasets are not directly comparable, but we agree it's a valuable direction to further explore model generalization across assay types.
On dataset curation and positive/negative controls: We appreciate this suggestion and agree that the handling of positive and negative controls is important for shaping the final training distribution. We’ll clarify these decisions in the main text for the camera-ready version, particularly in the section describing dataset construction (lines 134–141).
On Figure 5 readability: Thank you for pointing this out. We’ll include a summary table in the appendix to complement Figure 5 in the camera-ready version. This benchmark is also now described in detail in RxRx3-core [Kraus et al. 2025].
On impact of longer training for MAE-L/8 on RPI-93M: This is a great question. While we did not extend training for MAE-L/8 on RPI-93M beyond what was reported in Kraus et al. 2024, the contrast between RPI-93M and PP-16M models trained for the same number of epochs suggests that dataset composition has a major effect on replicate consistency. We agree that further experiments exploring longer training on RPI-93M would help isolate the contributions of training length vs. data curation.
Thank you again for your supportive feedback and helpful suggestions.
References:
- Kraus et al. 2025 [RxRx3-core] – https://arxiv.org/abs/2503.20158
This paper proposes a three-stage framework for pretraining foundation models on large-scale microscopy datasets to address measurement errors and enhance biological signal extraction. The framework involves (1) curating diverse and self-consistent training samples, (2) scaling a vision transformer architecture (ViT-G/8 MAE) trained on 8 billion microscopy image crops, and (3) evaluating intermediate model layers to optimize representations for downstream tasks. The authors introduce the largest known foundation model for cell microscopy (1.9B parameters), demonstrating a 60% improvement in linear separability of genetic perturbations compared to the prior ViT-L/8 MAE. The model achieves state-of-the-art performance across four benchmarks: whole-genome relationship recall, batch effect correction consistency, compound-gene activity prediction, and perturbation analysis. Key innovations include systematic error mitigation through pretraining, intermediate layer feature optimization, and scaling laws applied to biological imaging.
update after rebuttal
Thanks for the clarification. I have no further questions and will keep my score.
给作者的问题
I have no more questions.
论据与证据
All the claims are well-supported.
方法与评估标准
I am not familiar with the application and evaluation criteria, but they seem to make sense. The evaluation leverages multiple biologically relevant benchmarks, including Linear probing tasks (RxRx1 gene classification, Anax functional group classification) to assess representation quality. Whole-genome relationship recall using public databases (e.g., CORUM, StringDB) to measure biological relevance. Replicate consistency tests (Kolmogorov-Smirnov and Cramer-Von Mises) to ensure model robustness. These benchmarks are well-suited for assessing biologically meaningful representations and align well with the intended application.
Some potential issues are:
The computational cost of whole-genome benchmarking is extremely high (e.g., 80M forward passes). While this provides strong empirical evidence, an analysis of whether a subset of perturbations could yield similar insights with lower computational demands would be valuable.
The study provides some interpretability through linear probes and analysis of intermediate ViT layers, which is a notable contribution. While the correlation between linear probing results and whole-genome benchmarks is strong, further discussion on the biological interpretability of learned features would enhance the paper. For example, are certain gene classes more separable than others? Do embeddings capture known biological pathways?
理论论述
There are no theoretical claims made in the paper.
实验设计与分析
The use of self-supervised masked autoencoder (MAE) training aligns with state-of-the-art methods in computer vision and biological imaging. The authors explore different model sizes and configurations, providing valuable insights into scaling behavior. Some potential issues are that the justification for certain hyperparameters (e.g., patch size, mask ratio) is not explicitly discussed. A brief ablation study or sensitivity analysis on these parameters would strengthen the experimental design. Furthermore, the computational cost of training the 1.86B parameter model (48,000 GPU hours) is substantial, and an efficiency analysis comparing model performance versus computational cost would be beneficial.
The study uses a diverse set of evaluation metrics, but some alternative distance metrics (e.g., Euclidean, Mahalanobis) could be explored to ensure robustness.
补充材料
The paper does not include supplementary material.
与现有文献的关系
This work makes significant contributions to the field of biological representation learning by scaling self-supervised Vision Transformer (ViT) models for cell microscopy and improving the quality of learned embeddings. The work is related to the areas of self-supervised learning (SSL), biological image analysis, and model scaling in deep learning.
遗漏的重要参考文献
The paper appears self-contained, with citations covering relevant literature.
其他优缺点
The paper presents the largest-scale foundation model for cell microscopy to date, leveraging self-supervised ViTs and a curated dataset (Phenoprints-16M). While self-supervised learning (SSL) has been explored in biological imaging, the combination of large-scale MAE training, dataset curation, and systematic layer-wise probing represents a novel and impactful approach.
My major concern is about the technique innovation. Given that the study primarily applies existing methodologies, its contribution in terms of methodological novelty may be limited.
The study focuses on cell microscopy images, but its applicability to other biological imaging modalities (e.g., histopathology, electron microscopy, organoid imaging) is not discussed. A brief analysis of transferability would improve the paper’s broader impact.
The paper focuses on ViT-based MAE models but does not compare against CNN-based architectures (e.g., EfficientNet, ResNet, U-Net), which are still widely used in biomedical imaging. While ViTs have advantages in self-supervised learning and scaling, a baseline comparing MAE-G/8 to a strong CNN or hybrid transformer-CNN model would provide additional perspective.
其他意见或建议
I have no other comments or suggestions.
Thank you for the thoughtful review and supportive comments.
On the cost of whole-genome benchmarking: This was a key motivation for developing lightweight classification proxy tasks. Selecting a small subset of perturbations for genome-wide benchmarks is challenging due to the random distribution of gene KOs across HCS plates, which limits the number of relationships we can evaluate without embedding thousands of full plates. That said, a promising direction is RxRx3-core [Kraus et al 2025], a curated subset of RxRx3 that may provide more efficient evaluation going forward.
On interpretability: We appreciate this suggestion. Related work explored mechanistic interpretability in these embeddings and directly motivated our use of intermediate layer analysis [Donhauser et al. 2024]. We agree that further investigation into which biological pathways or gene classes are best captured by these models is an important next step.
On hyperparameter choices and CNN baselines: Patch size, mask ratio, CNN baselines (e.g., ResNet, EfficientNet), and SSL vs. supervised training regimes were thoroughly benchmarked in Kraus et al. [2024]. Our work builds directly on those findings by focusing instead on scaling ViTs, refining the training dataset, and evaluating intermediate representations.
On computational cost: We agree the training cost is substantial (48K GPU hours), but we believe the investment is justified. Improved microscopy representations can accelerate early biological discovery and drug development. We are exploring ways to make training and evaluation more efficient in future work.
On distance metrics: We considered alternatives but chose cosine distance due to its popularity and strong performance in deep embedding–based image retrieval. It is scale-invariant and more robust in high-dimensional spaces compared to Euclidean or Mahalanobis distance (Zhe et al. 2018, Deng et al. 2018).
On applicability to other biological imaging modalities: We would have loved to explore this further but were limited by space. Recent work has trained ViT-G with DINO on histopathology data (Virchow 2), which suggests promising transferability to other imaging domains. We're excited by the possibility of adapting our approach to other modalities like H&E, organoids, or EM.
On novelty and contribution: While our techniques build on existing methods, we identify three underexplored strategies that, when combined, significantly improve performance on microscopy-specific benchmarks:
- Curation: Our dataset curation focuses on selecting diverse, experimentally consistent conditions across replicates. This contrasts with typical SSL curation strategies that aim to reduce redundancy by removing similar examples. We show that this approach enhances model consistency and downstream performance.
- Scaling: We demonstrate that further scaling ViTs beyond ViT-L/8 continues to yield improvements, validating the neural scaling hypothesis in microscopy representation learning.
- Intermediate representations: While intermediate layers have been used in CNN transfer learning (Moshkov et al. 2024), they remain under-utilized in ViTs trained with SSL. We show that selecting the right layer, determined via a fast, low-cost proxy task, leads to meaningful improvements in zero-shot biological recall.
Together, these findings offer a generalizable framework for building scalable, biologically relevant foundation models across experimentally derived datasets. We hope our study serves as a foundation for further work at the intersection of self-supervised learning and foundation model development for scientific datasets.
Thank you again for your constructive feedback and support.
References
- Kraus et al. 2024 – https://arxiv.org/abs/2404.10242
- Kraus et al. 2025 [RxRx3-core] – https://arxiv.org/abs/2503.20158
- Donhauser et al. 2024 – https://arxiv.org/abs/2412.16247
- Zhe et al. 2018 – https://arxiv.org/pdf/1802.09662
- Deng et al. 2018 – https://arxiv.org/abs/1801.07698
- Zimmermann et al. 2024 [Virchow2] – https://arxiv.org/html/2408.00738v1
- Moshkov et al. 2024 – https://www.nature.com/articles/s41467-024-45999-1
The authors present a framework for training large-scale computer vision models for microscopy imaging data.
给作者的问题
What is the "recall % @ 0.05-0.95 cosine threshold" ???
论据与证据
The authors claim to train a large-scale ViT model that should work better than a previous model by Kraus et al, 2024.
The evidence for that is quite unclear and scarce: The tables report performance values for a set of architecture, where it is unclear which architectures come from the authors and which are from previous works. Furthermore, almost all performance metrics are presented without error bars and confidence intervals (and also without statistical tests), such that it is unclear which method performs best. The effect size (difference) between methods is very small. For some metrics it remains even unclear that they are (e.g. "KS" and "CM" in Table 1).
Also the methodological contributions that are claimed are unclear: it is unclear what the main contributions to this framework are and how they are justified and motivated. It appears that the proposed steps (data curation, scaling, and selecting a block for features) are ad-hoc decisions.
方法与评估标准
The work does not propose a method, but a framework.
The benchmark datasets make sense, like RxR3 and JUMP-CP, because those are large imaging datasets. However, the evaluation metrics are not well motivated, or even completely unclear: e.g. Table 1 has only recall without mentioning precision (a method can always trade-off recall against precision), Table 2 has "REactome" and "Stringdb" as metrics, which is also unclear what kind of metric this should be and why it is relevant. Table 3 again only reports only precision and not recall (or AUC-PR). Across the whole paper, it remains unclear what the main evaluation criteria for foundation models for microscopy images are.
For comparison, ref [1] sets up a battery of zero- and few-shot downstream tasks, for which clear evaluation criteria and metrics exist. The authors should set up a set of zero- and few-shot downstream tasks (together with metrics and evaluation criteria), which should be solved by the foundation models. Also Kraus et al [2], should be an inspiration for the authors.
References: [1] Sanchez-Fernandez, A., Rumetshofer, E., Hochreiter, S., & Klambauer, G. (2023). CLOOME: contrastive learning unlocks bioimaging databases for queries with chemical structures. Nature Communications, 14(1), 7339. [2] Kraus, O., Kenyon-Dean, K., Saberian, S., Fallah, M., McLean, P., Leung, J., ... & Earnshaw, B. (2024). Masked autoencoders for microscopy are scalable learners of cellular biology. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11757-11768).
理论论述
There are not theoretical claims in this work.
实验设计与分析
The experimental analysis is mostly done on checking how good intermediate layers of the ViTs are w.r.t. to performance. This is a reasonable analysis.
Question: Is this analysis done on a separate validation set (distinct from all other downstream tasks), such that selecting the best layer here does not introduce a bias for the later comparisons and benchmarks?
补充材料
No supplementary material provided. Appendix read, but not reviewed in detail.
与现有文献的关系
The paper is well placed within the broader scientific literature of ViT, MAE, etc. However, early works on using deep neural nets for microscopy images are completely absent, and also other self-supervised learning approaches for microcopy images are missing. (see below).
遗漏的重要参考文献
There are many earlier works on deep learning, e.g. CNNs, for microscopy images, e.g. ref [3], which are not mentioned and referred to. I only provide an exemplary reference here, but the authors should re-do the literature research to give a better view on this field:
[3] Ciresan, D., Giusti, A., Gambardella, L., & Schmidhuber, J. (2012). Deep neural networks segment neuronal membranes in electron microscopy images. Advances in neural information processing systems, 25.
Also, contrastive learning approaches (e.g. SimCLR, CLIP, etc) are hardly mentioned, which have also been successfully applied to microscopy imaging data. The authors should provide at least a paragraph in related work on papers around that topic.
其他优缺点
Strengths:
- Effort to scale ViTs to large-scale microscopy datasets
- Substantial computational resources used to develop the model
Weaknesses:
- Unclear what the novelty is or just a framework proposed
- Limited relevance because of missing error bars and statistical test; unclear benchmark task and metrics
- Main analysis steps are not well justified and motivated
- Unclear what the machine learning aspect of this work is; maybe better suited for a bio-venue
其他意见或建议
Citing Micrographia by Hooke is really a great one!
Thank you for the detailed review and thoughtful comments.
On tables and metrics clarity: We apologize for not clearly marking the Kraus et al. 2024 model in the tables. As noted in the “Models/prior work” section, MAE-ViT-L/8+ trained on RPI-93M is from Kraus et al. In Table 1 and 3, we report mean ± standard deviation computed from 3 and 100 random seeds, respectively. Table 2 caption notes a maximum standard deviation of ±0.0023. Perturbation consistency is evaluated using Kolmogorov-Smirnov (KS) and Cramér-von Mises (CVM) statistics, which compare replicate similarities to an empirical null. These are detailed on line 281 and Appendix 7, and were introduced in Celik et al [2024].
On evaluation and metrics: Our relationship recall task evaluates whether embeddings can zero-shot identify known interactions between perturbations based on cosine similarity. Because we use genome-wide CRISPR KO screens, many true interactions may be missing from annotation databases. Thus, precision is not meaningful in this context, as we cannot assume non-annotated gene pairs are false positives. The “recall % @ 0.05–0.95 cosine threshold” measures the fraction of annotated gene-gene relationships found in the most similar (top 5%) or dissimilar (bottom 5%) embedding pairs. We use multiple knowledge bases (CORUM, hu.MAP, Reactome, StringDB) and report results across them. These metrics are described in Appendix 6 and used in both Celik et al. and Kraus et al. 2024.
On methodological contributions and motivation: This work extends efforts to scale ViT architectures and microscopy datasets. We found three key strategies that meaningfully improve performance:
- Curation: Surprisingly, training ViT-L/8 on a smaller, curated dataset improved performance vs. RPI-93M. The curation focuses on selecting diverse experimental conditions with consistent replicate phenotypes, rather than removing redundancy as done in typical SSL data prep. This led us to train MAE-G/8 on the curated PP-16M set.
- Scaling: Training larger ViTs continues to yield benefits, consistent with the neural scaling hypothesis. Our largest model, MAE-G/8, outperforms smaller ones in both replicate consistency and biological relationship recall.
- Intermediate representations: We found intermediate layers often outperform penultimate layers for zero-shot tasks. Since fine-tuning is not feasible in our setting, we propose lightweight proxy tasks (RxRx1, Anax classification) to select the best layer. We show these proxy tasks strongly correlate with more expensive genome-wide evaluations. These insights, while based on well-known techniques in general vision or language settings, are novel in their application and combination for microscopy-specific representation learning. They provide a practical and reproducible framework applicable to any experimental dataset with replicate measurements.
On prior work and references: We appreciate the suggestion to cite earlier microscopy-specific deep learning work. Due to space constraints, we omitted a drafted section discussing such papers and instead prioritized prior work most relevant to our methods: dataset curation for SSL, layer selection, and embedding evaluation in microscopy. While we did not focus on contrastive approaches like CLOOME or MolPhenix, we agree they are valuable contributions. Our work differs in focus as we do not relate images to molecular structures, but instead aim to build general-purpose embeddings for microscopy data.
We hope this clarifies the design and motivation behind our choices, and the contributions our framework makes to the field. Thank you again for your detailed and constructive feedback.
References:
- Celik et al. 2024 – https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012463
- Kraus et al. 2024 – https://arxiv.org/abs/2404.10242
- Fradkin et al. 2024 [MolPhenix] – https://arxiv.org/abs/2409.08302
This paper provides a framework to improve biological representation learning on large-scale microscopy datasets. Three steps are introduced: (1) Curating the training dataset to have a better distribution of samples over the phenotypic spectrum, (2) scaling training to larger models, and (3) evaluating intermediate representations to identify the best representation for downstream tasks.
Superior performance of the largest model trained on the curated dataset is shown over previous baselines in terms of biological relationship recall and zero-shot generalization. It is shown that finding the best layer for downstream tasks using linear probing improves accuracy, and linear probing accuracy on a smaller validation set correlates well with whole genome evaluation results.
给作者的问题
NA
论据与证据
I will address each claim one by one:
(1) Curating the training dataset to have a better distribution of samples over the phenotypic spectrum leads to improved performance: I’m not sure if this is necessarily true. Imbalanced datasets do affect the model’s ability to learn about certain semantic concepts, but I doubt this is true at the scale of models and datasets we are talking about. It might happen that certain phenotypes have a very low frequency and the model will not learn about these, but in general imbalances in the number of samples per phenotype should not affect the semantic quality of the model’s embeddings. The results in Table 1 reflect this – the differences between the MAE L/8 trained on the RPI-93M and PP-16M are minor, and in fact are the same in terms of biological relationship recall for the untrimmed versions. Overall I don’t think there is enough evidence to suggest that the curation described here leads to meaningful results in performance. (2) Scaling model size leads to improved performance: This is not surprising, but the improvements shown in Tables 1, 2, and 3 are pretty minor given that the model size increase 6 times between MAE L/8 and MAE G/8. It is hard to evaluate whether these differences are meaningful, and even if they are meaningful, if they are worth the increase in model size. (3) Evaluating intermediate representations to choose the best layer for downstream tasks improves performance: Again, this is not really a novel idea. It is well known since the early days of representation learning that successive layers learn information at different layers of abstraction. While the penultimate layer is generally used for downstream tasks, it is not uncommon to use another intermediate layer. That said, mostly consistent, if somewhat minor, increases in accuracy are shown for most models, which does support the claim that trimming models is useful in the context of representation learning for cell microscopy.
方法与评估标准
The methods generally make sense for the problem of representation learning for microscopy, and the evaluation is shown on multiple datasets and for multiple models. None of the methods in the paper are necessarily novel, but experiments are performed at scale and a novel foundation model is introduced.
I would prefer it if the evaluation of the model scaling and dataset curation were done independent of each other in Tables 1, 2, 3, i.e. train models of the same size on the 3 datasets (RxRx3, RP-93M, Phenoprints-16M) and also train models of different sizes on all 3 datasets. The scale of improvement is not drastic enough for me to determine whether either of these contributes significantly.
Addition if out-of-distribution generalization on JUMP-CP and RxRx3 are good additions to the evaluation section. Biological validation is also mostly robust. While I will say that it is not surprising that linear probing accuracy on the manually curated Anax dataset correlates well with the whole-genome score, I think this is a good addition and could be a benchmark for future studies.
理论论述
There are no theoretical claims made in the paper.
实验设计与分析
NA
补充材料
I did not review the supplementary material in detail
与现有文献的关系
NA
遗漏的重要参考文献
NA
其他优缺点
Strengths:
I think the main strength of this paper is that it provides baseline results related to the effect of 3 components of the model building pipeline in the context of representation learning for microscopy – dataset curation, scale, and best intermediate layer to choose for downstream tasks. The trained foundation model as well as the results would provide a good benchmark for future work in this area. The paper is well written and understandable, and generally thorough in terms of evaluation and metrics.
Weaknesses:
I have two main issues with the paper: (1) The methodological contributions are minimal and the paper is pretty bland in terms of novel ideas. All of the methods discussed here have been widely evaluated in different settings, and I’m not sure if it has enough methodological novelty for being published at this venue, and (2) The performance improvements on the whole seem pretty minor to me, for example ~1-2% for the microscopy MAEs in Table 1. I think combined with the lack of novelty I this makes me question if the contribution is that strong.
其他意见或建议
NA
Thank you for the detailed review and thoughtful feedback.
On the significance of data curation: Filtering is critical even in large language models. Penedo et al [2023] highlight the value of de-duplication in RefinedWeb. While such techniques (e.g., MinHash) don't directly apply to images, our curation serves a similar role. Dataset curation had a strong impact on replicate consistency (KS): ViT-L/8 improved from .52 (RPI-93M) to .59 (PP-16M). We hypothesize that RPI-93M contains many control/inactive perturbations, leading to overfitting to subtle batch effects even after alignment. In contrast, PP-16M better captures phenotypic diversity and yields more consistent representations.
On the significance of model scaling: We show that combining curation, scaling, and layer selection yields strong gains. In Table 1, ViT-G/8 on PP-16M-trimmed improves replicate consistency (KS: .52 → .63, +21%; CM: 12.3 → 18.2, +48%) over prior SOTA (ViT-L/8 on RPI-93M). This matters as biological recall is less meaningful without consistent perturbation representations. Table 2 shows improved zero-shot recall on an OOD dataset (JUMP-CP): ViT-G/8 trimmed recalls more interactions in the 0.05/0.95 cosine thresholds [CORUM (134), HuMAP (65), Reactome (42), STRING (157)] demonstrating generalization across experimental labs. Table 3 reinforces this with gains on a benchmark for compound-gene relationships, showing ViT-G/8 more effectively encodes biological interactions.
On the novelty of evaluating intermediate representations: Layer selection has precedent in supervised transfer learning, but its use in SSL-trained transformers is rare. Prior work typically fine-tunes the penultimate layer, but in microscopy, supervised fine-tuning often reduces performance [Kraus et al. 2024]. We show that selecting intermediate layers in ViTs trained with SSL improves zero-shot performance without retraining. This holds across models, including those pre-trained on natural images, and offers a practical contribution to microscopy-based representation learning. The only related work we’re aware of is MIM-Refiner [Alkin et al. 2024], which appeared contemporaneously. While intermediate layers were known to sometimes yield better representations, their inference compute advantages became significant with recent model sizes. For ViT-G/8, metrics from layer 38 required ~3,000 L4 GPU hours vs. ~4,000 from the final layer.
On model/dataset ablations: We appreciate the suggestion. ViT-G/8 training required 256 H100 GPUs for over a week, so full cross-comparisons were infeasible. However, we provide targeted ablations:
- Layer selection: All models were evaluated at multiple layers. Intermediate layers consistently outperform penultimate. Proxy tasks like RxRx1 and Anax classification correlate with genome-wide evaluations, offering cost-effective benchmarks.
- Dataset comparison: We compare ViT-L/8 trained on RPI-93M vs. PP-16M across all metrics, motivating our choice to train ViT-G/8 on PP-16M.
- Model scaling: Scaling benefits in microscopy SSL have been demonstrated [Kraus et al. 2024, Fig. 5], supporting our decision to go beyond ViT-L/8.
On the novelty of contributions: While curation, scaling, and layer selection have been explored individually in other domains, we demonstrate their combined effectiveness for representation learning in microscopy. These insights also apply to other biological settings with repeated measurements.
- Curation strategy: Unlike prior SSL curation that removes duplicates, we identify experimental conditions with consistent phenotypes across replicates, improving signal and reducing batch effects. This is a novel form of statistical deduplication.
- Model scaling: Scaling ViTs beyond ViT-L/8 provides measurable gains and supports the neural scaling hypothesis in microscopy. At this scale, sharing what works is crucial. Reproducing one ViT-G/8 run would cost ~\470K (43,000 H100 GPU hours × \11/hour). Prior scaling papers (e.g., Touvron et al. 2023, DeepSeek-V3) have proven valuable by sharing such insights.
- Layer selection: While intermediate layers are used in supervised CNNs [Moshkov et al. 2024], they’ve not been systematically explored in SSL ViTs. We show that layer-wise evaluation improves performance without fine-tuning, highlighting a practical and underused method for biology-specific tasks.
We appreciate your recognition that the paper provides strong baselines and a new foundation model, and hope our responses clarify the significance and generalizability of our contributions.
References:
- Alkin et al. 2024 – https://arxiv.org/abs/2402.10093
- Kraus et al. 2024 – https://arxiv.org/abs/2404.10242
- Moshkov et al. 2024 – https://www.nature.com/articles/s41467-024-45999-1
- Touvron et al. 2023 – https://arxiv.org/abs/2302.13971
- DeepSeek-V3 – https://arxiv.org/pdf/2412.19437
- Penedo et al. 2023 – https://arxiv.org/abs/2306.01116
This paper received very divergent scores, with a strong accept, an accept, a weak reject, and a reject.
Specifically, the reviewers generally acknowledge the effort to train a foundation model on large-scale microscopy datasets, which could become a benchmark for future work in microscopy image analysis as well as the extensive evaluation. While the proposed framework, consisting of curating the training dataset, scaling to larger models, and evaluating intermediate representation, is appreciated by some reviewers, concerns about the novelty are raised. In the first revision, the reviewers generally agree on the fact that the single components used in the proposed framework are not novel, limiting the technical and methodological novelty of the work. Specifically, while two reviewers acknowledge the novelty and impact in the specific field of microscopy image analysis, the other two question the limited improvements in terms of performance that, together with the limited novelty, hinder the overall contributions of the paper.
In the rebuttal, the authors provide more information on the specific significance of the three steps of the pre-training approach, pointing out the novelty with respect to the domain of application, and after discussion, all the reviewers maintain their original scores.
After reading the paper, the reviews and the rebuttal, and considering the overall scores, despite the AC agreeing on the single components being not novel in terms of pre-training strategies, they believe that the main contributions may be impactful in the specific domain of application, including the release of the trained foundation model, that can potentially be exploited for more application works. On top of this, considering that this is an application paper, well-written and clear overall, the AC leans toward acceptance of this work.