How Molecules Impact Cells: Unlocking Contrastive PhenoMolecular Retrieval
We address the challenge of contrastive phenomic molecular retrieval. We demonstrate pre-trained uni-modal representation methods can be used in a variety of ways to significantly improve zero-shot molecular retrieval rates.
摘要
评审与讨论
The paper introduces MolPhenix for tackling the contrastive phenomolecular retrieval problem. MolPhenix employs a uni-modal pretrained phenomics model, an inter-sample similarity-aware loss function, and conditions on the representation of molecular concentration. This approach effectively addresses challenges such as experimental batch effects, inactive molecule perturbations, and encoding perturbation concentration. Experimental results demonstrate the model's effectiveness in the retrieval task.
优点
- The paper is well-written and the motivation is clear.
- The identified challenges are highly relevant and significant in the biological context, and the proposed methods are logical and well-conceived.
- The experimental results indicate substantial improvements over baselines across various settings in the retrieval task.
缺点
- In Table 2 and Table 4, there is a significant performance increase for DCL, CWCL, SigLip, and S2L compared to other baselines. The source of these improvements is unclear. Conducting an ablation study for each of the three components of the method, corresponding to the three challenges, would provide more insights.
- The paper primarily compares MolPhenix with CLOOME and a few other general domain objectives. However, numerous related studies specifically within the molecular-phenotype contrastive learning domain [1-6] are not discussed or compared.
- The paper emphasizes the retrieval task, with experimental results showing only the top 1% recall accuracy. This limits the overall impact of the findings. Considering broader application scenarios based on the proposed method could enhance the paper's significance.
[1] Cross-modal graph contrastive learning with cellular images. bioRxiv 2022.
[2] Contrastive learning of image- and structure-based representations in drug discovery. ICLR MLDD 2022.
[3] Molecule-Morphology Contrastive Pretraining for Transferable Molecular Representation. ICML CompBio 2023.
[4] MMCL-CDR: enhancing cancer drug response prediction with multi-omics and morphology images contrastive representation learning. Bioinformatics 2023.
[5] Removing Biases from Molecular Representations via Information Maximization. ICLR 2024.
[6] Learning Molecular Representation in a Cell. arXiv 2024.
问题
- The retrieval accuracy results in the paper are reported as top 1% accuracy. However, this metric may be less informative when the retrieval set is very large, as the top 1% can still include a substantial number of molecules. Could the authors also provide results for top N accuracy, where N = 1, 10, and 100?
局限性
The authors have noted several limitations in their methods and experiments. Additional limitations are listed in the Weaknesses and Questions sections.
We thank the reviewer for providing detailed feedback for our paper. Below, we aim to address your concerns point-by-point
Concern #1: In Table 2 and Table 4, there is a significant performance increase for DCL, CWCL, SigLip, and S2L compared to other baselines. The source of these improvements is unclear.
We assess the effectiveness of S2L, SigLip, and DCL losses by analyzing the gradient flow of the InfoNCE and DCL formulations. In particular, we analyze the cases where inactive molecules cause the gradient to vanish, inhibiting training using InfoNCE and variants thereof.
Decoupled contrastive learning (DCL) loss is an effective alternative to the InfoNCE loss due to the modification by removing the positive term from the denominator:
The authors show that when computing the gradient of InfoNCE, the term which they name the negative-positive coupling (NPC) term modulates the gradient of each sample and in cases where the term becomes small gradient flow is inhibited:
NPC can be small when the positive samples are too close to one another, when there is a small number of negative samples, or when the negative samples are too simple to discriminate versus the positive pair. By removing from the denominator, DCL simplifies the training thus removing the NPC term from the gradient calculation.
We hypothesize that in our case inactive molecules tend to be simple negatives, thus inhibiting the gradient flow that would otherwise be helpful to the model training. Training the model with a loss that separates the positive gradient term (attracting samples from two modalities) from the negative (repelling negative pairs) achieves higher overall performance. Similarly to DCL, SigLIP and S2L losses are computed for each pair of samples independently, thus separating the positive and negative loss terms. Thus the gradient calculation for informative samples is unaffected by the NPC term, resulting in higher overall performance.
Concern #2: The paper primarily compares MolPhenix with CLOOME and a few other general domain objectives… numerous related studies specifically within the molecular-phenotype contrastive learning domain.
We thank the reviewer for an in depth exploration of relevant works. We note that the second paper linked is an early presentation of CLOOME (2023) to which we extensively compare (2nd referenced paper in reviewer suggested list). In addition we utilized a number of strong baselines from the Image-Text multi-modality literature such as CWCL (2023), SigLIP (2023), and DCL (2021).
We agree with the reviewer that we can strengthen the pheno-molecular multi-modal related works section. We focussed our assessment on works that innovate on the methodological components but agree it is important to capture a broader perspective. We note that these additional works support the importance of MolPhenix as it demonstrates the importance of learning an effective joint embedding of phenomic experiments and molecular structures. MolPhenix contributions can be used in further improvement of utilizing phenomics and cell-profiliing experiments for cancer cell line drug response prediction (4th referenced paper in reviewer suggested list) and gene knockout predictions (6th referenced paper in reviewer suggested list). As the authors of “Cross-modal graph contrastive learning with cellular images” conclude in their paper: “There are still some challenges that need to be addressed, such as inherent data noise and batch effect. These could be resolved by designing specific encoders for cellular images, optimizing cross-modal fusion mechanisms, and introducing heterogeneous cross-modal data”. Our work aims to introduce general methods for extracting additional information from phenomic data such as k-patch averaging, S2L inter-sample similarity aware loss, and concentration encoding.
We have updated our related Molecular-Phenomic Contrastive Learning section to reference and discuss the listed works.
Concern #3: Could the authors also provide results for top N accuracy, where N = 1, 10, and 100?
In the Image-Text domain it is common to provide N=1,10,100 0-shot class retrieval results. Since the number of classes between studies for a given dataset is consistent, top-K results always correspond to a semantically meaningful metric.
However, in our case we believe these statistics might be misleading since different works evaluate on datasets of different sizes, thus making the N=1 recall task of varying difficulty in a dataset of 100 molecules VS 100,000. Other published works resort to a similar strategy, for example “Cross-modal graph contrastive learning with cellular images” in Table 1 report Hit@1, 5, 10 which actually correspond to percentages due to the subsampling of the overall data to 100 data points. Similarly, “Molecule-Morphology Contrastive Pretraining for Transferable Molecular Representation” report top 1, 5, and 10% retrieval in Figure 2.
We note that in the external dataset - held out dose evaluation setting, the dataset consists of 1639 molecules making the top1% include just 16 molecules total.
We thank you for a detailed and thorough assessment of this work. Please let us know if there are any additional points we can discuss, to improve your assessment of our work.
I want to thank the authors for their detailed responses to my questions.
Please let us know if there are any additional clarifications that we can provide that could increase your support for the acceptance of the paper
The paper makes a significant step forward in the task of phenotype-molecular retrieval–the task to find the molecule applied to perturb a set of cells given a microscopy readout. This problem can be modeled using multi-modal contrastive learning. The proposed MolPhenix method leverages pre-trained foundation models to encode the microscopy images and molecular structures. Then a novel contrastive loss is used which takes into account domain specific issues like batch effect and different concentrations of the applied molecules. Through thorough evaluations and ablations the authors show increased performance over a baseline.
优点
- The paper tackles a task relevant for drug discovery and shows a significant step forward in predictive performance
- The paper is very well written
- The authors perform extensive experiments on held out molecules, phenotypes and dataset
- The CLOOME baseline is tuned convincingly
- The proposed guidelines for the task are supported by ablation studies
- The paper combines and improves upon SOTA methods in multi-modal contrastive learning
- The authors show domain knowledge by incorporating reasonable data processing and training strategies (batch effect removal + undersampling of inactive molecules)
缺点
- The authors focus on retrieval instead of evaluation of the latent space. Given the expressiveness of the leveraged foundation models, generating phenotypes or molecules should be possible. It would be great if the authors could comment on this or other explorations of the phenotypic space.
- The authors train on a private dataset which makes the paper almost not reproducible. However, due to detailed model descriptions, the guidelines could be evaluated on other datasets.
问题
- Please elaborate on generative or mechanistic opportunities of the model
- You argue MolPhenix cannot be applied to images. Please elaborate since instead of ph-1 a simple non domain-specific image feature extractor could be used. Have you studied the results?
- Please comment on which model weights of ph-1, Mol-1, and MolPhenix are publicly available or you will be releasing
局限性
- Limited reproducibility due to private datasets and the model weights likely not being released
We thank the reviewer for the positive, insightful comments and the detailed feedback. We believe your perspective will help shape this into a stronger work. Below, we aim to discuss each of the suggestions and discussion points individually.
Discussion point #1: “Generating phenotypes or molecules should be possible”
We agree with the reviewer that this is a really exciting future direction for the work. As a first step in establishing pheno-molecular multi-modal learning, we decided to focus on the retrieval task due to its straight forward quantitative evaluation advantage. Our goal in this work was to establish a set of design decisions and guidelines that we can quantitatively investigate for effective latent space stratification.
High pheno-molecular retrieval rates would indicate to us, that the model learns effective stratification of the latent space, opening up the door for future experiments with molecular and phenomics generation. This is a critical research direction with applications in drug-discovery such as identifying phenotypically similar molecular analogs with applications to generic molecule research.
There is a unique set of challenges in evaluating high quality, diverse molecule designs. The state of the art for assessing molecule quality are still biological experiments, which we are excited to explore in future works.
Discussion #2: The authors focus on retrieval instead of evaluation of the latent space.
We certainly agree with the reviewer that MolPhenix embeddings should be able to provide additional interesting assessments as to the quality of the model latent space. To that end we have multiple experiments described in Appendixes E1 and E2 which we could not include in the main text due to space constraints. In addition, we conduct experiments assessing the quality of the learned molecular representation by experimenting on 35 downstream tasks, discussed in more detail in the general response (Table 4 attached PDF).
Some brief details on appendix experiments: We were interested in assessing whether our molecular encoder captures pheno-activity throughout training. We found that by specifying only the molecular structure and the corresponding concentration, we are able to predict molecular activity levels with an ROC-AUC of 0.9. Visualization of the model latent space can be found in supplementary Figure 7. These results open up the door for in-silico activity screening that have previously been infeasible. In addition, these can be used as in-silico dose-response curve evaluations, finding an activity - dosage trade-off for a previously unscreened molecule.
In appendix E2 we evaluate MolPhenix’s ability to identify previously known concordant perturbations between small molecules and genetic knockouts. From a database of known relationships, we encode the molecular perturbation with the MolPhenix molecular encoder and assess whether it is able to match them with a corresponding embedding of a genetic knockout. To create an embedding of a genetic perturbation, we embed results from a phenomic experiment of cell lines with a corresponding gene knocked out. We find that this in-silico perturbation concordance experiment is able to provide strong results relative to a fully experimental baseline.
We believe these findings are important initial experiments demonstrating downstream utility applications of MolPhenix learned embeddings.
Concern #3: The authors train on a private dataset which makes the paper almost not reproducible. However, due to detailed model descriptions, the guidelines could be evaluated on other datasets.
We are unfortunately unable to release the training dataset for MolPhenix but aim to disseminate generalizable findings that can be helpful to other scientists working in this domain. To that end we provide pseudo-code and algorithmic descriptions for S2L loss. Additionally, our algorithm implementation is kept in PyTorch-like syntax for easier reproducibility. In addition, we evaluate our models on a large openly accessible dependent RxRx3 dataset which can be used for evaluation of other models and benchmarking other design choices by the community.
Concern #4: Please elaborate on generative or mechanistic opportunities of the model
This is an important point that we hope that we’ve addressed in sufficient detail in discussion points 1 and 2.
Concern #5: You argue MolPhenix cannot be applied to images. Please elaborate since instead of ph-1 a simple non domain-specific image feature extractor could be used.
We thank the reviewer for bringing up this important point, and would like to clarify that any sufficiently expressive image feature encoder can be used. In the general reviewer response, for example, we demonstrate that we can use an alternative encoder that is trained in a supervised fashion to predict identity of genetic perturbations (as an alternative to θPh-1). We hope that these experiments in addition to use of an ensemble of publicly accessible fingerprints provide sufficient evidence that the guidelines are generalizable across a number of public and private encoders.
We have updated Table 1 caption to be a more clear description of our training pipeline: “We note that MolPhenix’s main components such as S2L and embeddings averaging relies on having a pre-trained uni-modal phenomics model.”
Concern #6: Please comment on which model weights of ph-1, Mol-1 and MolPhenix are publicly available or you will be releasing?
We note that the code and training data for Mol-1 are available, which we will provide references to in the final version of the paper. In addition, a public version of θph-1 will be available for inference.
Thank you for your positive review and your interest in our work. We will be happy to further discuss and answer any additional points of interest.
Thank you for your comprehensive response. I have increased my score.
The paper introduces MolPhenix, a framework for contrastive phenomolecular retrieval that integrates phenomic data and molecular structures into a joint embedding space. Key contributions include combining phenomic and molecular data for improved retrieval accuracy, proposing effective training guidelines, and addressing cumulative concentrations and label noise. The framework demonstrates significant performance improvements over existing methods, particularly in zero-shot settings, and is supported by comprehensive experiments and ablation studies.
优点
The paper is well-structured and comprehensive, showcasing rigorous experimentation through extensive ablation studies, comparisons with baseline methods, and evaluations. Its strength lies in the detailed and methodical approach to validating the MolPhenix framework, providing strong evidence for its effectiveness and potential applications. The experimental design is robust, demonstrating the framework's superiority in various scenarios and contributing valuable insights to the field. Additionally, the integration of molecular and phenomic data into a joint multi-modal embedding offers a fresh perspective, enhancing the overall impact and originality of the work.
缺点
- The clarity of this work could be significantly improved. The task and background are not well stated, and the related work is not adequately introduced. Many biological terminologies are mentioned, creating a large gap between the introduction and the real dataset.
- The paper could benefit from including comparisons with a broader range of state-of-the-art methods, particularly those from adjacent fields such as multi-modal representation learning in genomics and proteomics. This would highlight the specific innovations of MolPhenix more distinctly.
- The significance of the work could be more explicitly articulated and demonstrated. Although terms like drug discovery are mentioned, the real-world impact is not clear for readers.
问题
- For molecular representations, did you consider combining GNNs and fingerprints? In other tasks, such as property prediction, their combination often surpasses the performance of each method individually.
- Did you consider to report and compare metrics for top-5 and top-10 recall?
局限性
The authors have adequately addressed the limitations.
We thank the reviewer for providing detailed feedback on our paper. We aim to address the feedback point by point below:
Concern #1: The clarity of this work could be significantly improved
We thank the reviewer for this constructive feedback, we believe this work is best assessed with the required context of molecular biology, so we aim to enhance the accessibility of our paper by adding a biological terms explanation in the appendix. In particular, we will define the following terms: phenomics, cell morphology, cell line (ARPE-19), molecular concentration, phenomolecular retrieval, cell staining, molecular perturbations, inactive molecule perturbations, batch effects, molecular fingerprints, initial cell state assumption.
While we agree the clarity of the paper can always be further improved, we note that the presentation of our paper was a highlight for other reviewers: YXwu “The paper is well-structured and clearly written”, ELHC “The paper is very well written”, Vz3t “The paper is well-written, and the motivation is clear.” Getting high ratings in soundness (3, 3, 4, 3) and presentation (3, 3, 4, 3). Additional information on related works can be found in section 2, and we have expanded the “Related Work” section on molecular-phenomic approaches.
Please let us know if this clarifies the accessibility of the paper, or if there is any additional terminology that we can add to the glossary.
Concern #2: Comparisons with a broader range of state-of-the-art methods in … multi-modal representation learning in genomics and proteomics
We’ve restricted the scope of this paper to studying phenomics and molecular modalities. While expanding to other biological modalities such as genomics or proteomics would undoubtedly be very interesting, it requires a significant commitment to curating the data and innovation in biological sequence model space. This is a direction we leave for future exploration.
The reviewer could be interested in a 0-shot evaluation that we perform to investigate MolPhenix’s ability to generalize to genetic knock-out perturbations (Appendix D2). We assess the model’s 0-shot generalization capabilities that are known to have similar molecular effects to learned molecular perturbations. The evaluation demonstrates that MolPhenix learns the landscape of genetic perturbations, allowing the model to recover known biological pairs.
On the methodological front, we compare to SOTA in the Image-Text multi-modal training consisting of DCL(2021), CWCL (2023), and SigLip (2023) - recent methods that have demonstrated significant success. In addition, we conduct thorough evaluations benchmarking recently published CLOOME (November 2023) model.
Concern #3: Significance of the work could be more explicitly articulated and demonstrated
Although the main text of our paper is mostly focused on pheno-molecular retrieval, we have some initial experiments in the appendix assessing the model’s ability to perform other biologically relevant tasks. We perform pheno-activity experiments, demonstrating that the learned embedding is predictive of molecule - concentration tuple morphology impact (Appendix E1). This opens the door to potential in-silico activity pre-sceening and in-silico dose-response curve construction. In addition, we perform 0-shot biological activity experiments mentioned earlier in Appendix D2.
In the general reviewer response, we include new experiments showcasing the effectiveness of the learned latent space by conducting a KNN evaluation of the MolPhenix latent space. We assess the learned embedding on 35 molecular property prediction tasks across the Polaris and TDC datasets (see Table 4 in the attached PDF). Our findings indicate that MolPhenix, when trained with Fingerprint embeddings, consistently outperforms standalone input fingerprints, effectively clustering molecules according to their molecular properties.
Concern #4: “Combining GNNs and fingerprints”
This is a valuable evaluation for the completeness of our work. Please find the results for these experiments in the general reviewer response (Table 2 and 3 in the attached PDF). We find that combining pre-trained GNN and molecular fingerprints further enhances retrieval performance.
Concern #5: “Report and compare metrics for top-5 and top-10 recall”
We also report top-5% accuracy metrics in supplementary tables 8, 9, 10 and 11. We choose to report hits in top-K% since we have variable sized test sets. Top K metrics do not control for the difficulty of this task unless artificially subsampled to a pre-determined size. This is proportionally equivalent to Top-K% evaluation metrics, but has the downside of being stochastic contingent on composition of the batch.
Kindly consider adjusting our overall score if our response addressed your primary concerns. We would be happy to answer any additional questions you may have in order for you to support acceptance of our work.
This paper introduces MolPhenix, a model designed to learn a joint latent space between molecular structures and microscopy phenomic experiments, addressing the challenge of contrastive phenomolecular retrieval. The authors point out three key challenges in this domain: limited paired data & batch effects, inactive molecular perturbations, and variable concentrations. To address these issues, they propose a set of guidelines: 1) leveraging a pre-trained phenomics foundation model (ph-1), 2) mitigating the impact of inactive molecules through undersampling and a novel soft-weighted sigmoid locked loss (S2L), and 3) encoding molecular concentration both implicitly (within the S2L loss) and explicitly (by parsing dosage concentration).
The primary experiments focus on the task of phenomolecular retrieval. For active molecules, MolPhenix achieves up to 77.33% top-1% retrieval accuracy, representing an 8.1-fold improvement over the baseline. The paper includes necessary ablation studies and evaluations across multiple datasets, demonstrating the effectiveness of their approach in both cumulative and held-out concentration settings.
优点
- Novel approach: The paper introduces MolPhenix, a comprehensive framework that addresses several key challenges in contrastive phenomolecular retrieval by combining multiple innovative techniques. This new framework demonstrates notably strong performance.
- Concentration encoding: The study explores both implicit and explicit methods for encoding molecular concentration, enhancing the model's ability to capture dose-dependent effects and generalize across concentrations. The explicit concentration module and inactive molecule undersampling techniques may inspire future multi-molecule research.
- Detailed ablation: The paper provides an in-depth analysis of results, including the impact of various components (e.g., loss functions, concentration encoding methods) on model performance, presented in both the main body and appendix.
缺点
The following are just some minor concerns, listed in descending order of importance:
- Insufficient justification for S2L loss: While the paper introduces the S2L loss as a key contribution, it lacks a thorough theoretical analysis explaining why this loss function is suitable for the phenomolecular retrieval task. A more rigorous mathematical analysis of S2L would strengthen the paper's contribution.
- Inadequate ablation of pretrained components: As a phenomolecular retrieval "framework", the study should include ablation studies using different molecular and phenomic encoders. However, it appears that experiments were conducted only with fixed molecular GNN and phenomic models.
- Oversimplified treatment of batch effects: The paper claims to address batch effects through embedding averaging, but this approach may be too simplistic. A more detailed explanation of why simple averaging can alleviate batch effects, or some minor adjustments to this batch effect removal procedure, would be beneficial.
问题
Regarding the equality of involved datasets:
- How would the model's performance vary if only the open-source RxRx3 data or only the private novel data were used?
- Given that many components of MolPhenix are publicly pretrained models, is it feasible to construct a comparable model using solely open-source resources?
局限性
The authors have properly adressed the limitations in their paper.
We thank the reviewer for providing detailed feedback on the paper.
Concern #1: Additional justification for the S2L loss
In this section we provide some additional intuition for the S2L loss and further relate it to previous works. We first assess the conceptual similarities between InfoNCE and CWCL loss and justify a similar extrapolation for the relationship between S2L and SigLIP losses.
InfoNCE can be considered a special case of the CWCL loss, where is set to 0 for all pairs of and unless . Conceptually, this is equivalent to stating that all the negative pairs are equally distant from the reference sample. We will consider a uni directional loss CWCL, for identifying from :
If we set when and otherwise then the term evaluates to 1 and the above expression simplifies to:
In the case of CWCL, a non 0 determined by a within modality similarity function informed by a pre-trained model
Similarly SigLip can be considered a special case of S2L when when and in the case . This is the formulation of S2L
It can be simplified to SigLIP by setting to when thus setting the term , corresponding to , we set to thus negating the first part of the loss, evaluating to:
Having a allows us to inform the training by going between discrete negative labels to continuous informed by some prior information. This information is given by a pre-trained encoder , in our case but can be informed by any pre-trained model.
Concern #2: “Inadequate ablation of pre-trained components”
To investigate the impact of pre-trained encoders, we perform additional experiments evaluating a supervised phenomic image encoder and highlight the ablation study of molecular fingerprints described in Figure 5.
Instead of Ph1, we trained Molphenix framework using AdaBN, a CNN-based supervised phenomic encoder, with an analogous implementation discussed in [1]. We find that the general trends between Ph-1 and AdaBN are consistent with a slight decrease in overall performance. These findings provide additional support to the generality of the proposed guidelines.
In addition, we leveraged from Mol1, which is a MPNN based GNN model with 1B parameters - an expressive capable molecular encoder [2, 3]. We note that combining Mol-1 embeddings with ECFP, MACCS, and Morgan fingerprints can provide Molphenix with richer information and yields overall higher MolPhenix performance (Table 2 and 3 of PDF in general response). We also performed an ablation study over publicly available fingerprint encoding methods, which are an effective baseline for molecular representations. We demonstrate that MolPhenix demonstrates strong retrieval performance with the use of fingerprints as an alternative to pre-trained GNN (Figure 5).
Concern #3: “Oversimplified treatment of batch effects”
The reviewer is accurate in noting that batch effect area is a smaller portion of the overall paper contributions as it is a rich area of research, especially in the biological sciences. Our intention was to highlight the ability of averaging phenomic encoder embeddings which is infeasible when working with samples directly in the image space. We perform an ablation, studying the effects of taking a random number of embeddings when performing averaging and find a small improvement in retrieval performance (Figure 5).
Concern #4, Q1 & Q2:
Training data employed is comprised of 1.3M pairs of perturbations, however, RXRX3 is composed of 1,674 known chemical entities at 8 concentrations each and is primarily used as a validation dataset. JUMP-CP [4] is a promising open source dataset which is being released in increments is a promising future phenomic data resource. With the availability of JUMP-CP it will be possible to train a fully open source analog of Molphenix. A beta version of Ph-1 is publicly available and will be linked in the updated paper. In our ablations demonstrated in Figure 5 we show that molecular fingerprinting methods are a strong baseline and are comparable with Mol-1 performance.
Kindly let us know if our response above addressed your concerns. We will be happy to further discuss and answer any questions you may have.
This work focuses on predicting the molecular impact on cellular functions and investigates the problem of contrastive phenomolecular retrieval. It introduces MolPhenix, a model that leverages a joint latent space between molecular structures and microscopy-based phenomic experiments using contrastive learning. The main contributions include the use of a pre-trained phenomics model, a novel inter-sample similarity-aware loss (S2L), and molecular concentration conditioning, leading to a significant improvement over the previous state-of-the-art in zero-shot molecular retrieval of active molecules.
优点
The introduction of the MolPhenix model, which utilizes a pre-trained phenomics model, a novel inter-sample similarity-aware loss (S2L), and molecular concentration conditioning, is original and demonstrates improvements over existing methodologies.
The experimentation of this work is comprehensive and detailed to justify the proposed method.
The paper is well-structured and clearly written with clear explanations of the methodologies.
The problem investigated is important in the field of drug discovery and has a growing interest.
缺点
For the captions of Table 2~5, it is better to mention the experimental setting (cumulative concentration & held-out concentrations). The current captions are nearly the same (Table 2 vs. Table 4, and Table 3 vs. Table 5), which is kind of confusing.
问题
For the pretrained GNN, are there any specific advantages to choosing the current one, given that there are many other pretrained GNNs that can be used to extract molecular representations?
Are there any discussions regarding the results of different concentration encoding choices? In a cumulative concentration setting, one-hot performs the best, while in a held-out concentration setting, not using any explicit concentration is the best choice overall.
局限性
The authors have adequately addressed the limitations.
We thank the reviewer for a thorough and rigorous examination of our paper. Below we aim to address the questions and clarifications to further improve the work.
Concern #1: For the captions of Table 2~5, it is better to mention the experimental setting.
Thank you for your feedback. We changed the captions to clarify and simplify the experimental setting and explanation of results in Tables 2, 3, 4 and 5. For example, following is the updated caption for Table 2 “ Evaluation on cumulative concentrations: Top-1% recall accuracy with use of the proposed MolPhenix guidelines evaluating impact of training loss on retrieval. We omit explicit concentration from this experiment.”.
Concern #2: “Specific advantages to choosing the current [GNN].”
Molphenix architecture is flexible, allowing that the proposed components be replaced by other phenomic or molecular pretrained models. However, we leveraged from Mol-1, which is a MPNN based GNN model with 1B parameters which allows us to maximize architecture expressivity while minimizing the risk of overfitting [2, 3]. We also note that combining Mol-1 molecular embeddings with ECFP, MACCS, and Morgan fingerprints can provide Molphenix with richer molecular information and yields overall higher performance of MolPhenix in both cumulative and held-out concentration scenarios. Results for active and all molecules retrieval of Molphenix trained on the discussed combinational molecular embeddings are available in table 2 and 3 of our global response (attached pdf). To evaluate the impact of GNN encoder we also perform a fingerprint ablation assessing the impact of fingerprint expressivity on retrieval indicated in Figure 5 of the paper.
Concern #3: “discussions regarding the results of different concentration encoding choices”
We note in the last sentence of section 5.2 in the paper that one-hot encoding shows significant improvements in a cumulative concentration setting, however it is limited to unseen dosage. In evaluation on held-out concentration, the model isn’t required to discriminate between molecules with different concentrations, thus explicitly providing dosage isn’t directly useful. In this experiment, sigmoid embedding can be thought of as a continuous way of separating “high” and “low” concentrations, showing effectiveness in this setting. We believe that the best encoding choice is dependable on the type of in-silico application. For example, if the objective is to identify unseen molecules with same morphological impacts or if the objective to simulate the impact of a molecule for un-tested dosage. We thank the reviewer for providing this feedback and we will add this additional discussion to the final version of the paper.
Thank you for your positive feedback, and we will be happy to further discuss and answer any questions you may have.
We thank all the reviewers for providing detailed feedback on the paper. We are appreciative of the general support regarding the thoroughness and value of our scientific work:
- “experimental design is robust, demonstrating the framework's superiority in various scenarios and contributing valuable insights to the field” - kfG5
- “significant step forward in the task of phenotype-molecular retrieval” - ELHC
- “The experimentation of this work is comprehensive and detailed to justify the proposed method.” - YXwu
- “comprehensive framework that addresses several key challenges.. demonstrates notably strong performance” - r9h5
- “the proposed methods are logical and well-conceived” - Vz3t
We also appreciate the comments on the clarity of delivery, which is crucial for interdisciplinary works:
- “The paper is well-structured and clearly written with clear explanations of the methodologies.” - YXwu
- “The paper is well-structured and comprehensive” - kfG5
- “The paper is very well written” - ELHC
- “The paper is well-written and the motivation is clear.” - Vz3t
The reviewer feedback has been extremely useful in improving the work in terms of clarity and additional evidence. In our rebuttal, (1) broaden the scope of our work beyond pheno-molecular retrieval by highlighting and performing additional experiments, (2) we demonstrate additional encoders to support the generalizability of our guidelines, (3) enhance the overall clarity and scientific background.
We expanded our evaluation with additional experiments supporting the utility of MolPhenix beyond retrieval. We conducted experiments evaluating the learned latent space, pheno-activity prediction, and zero-shot biological perturbation matching. In the attached PDF document, reviewers will find a KNN evaluation of the MolPhenix latent space, assessing the learned embedding on 35 molecular property prediction tasks across the Polaris and TDC datasets (Table 4 attached PDF). We find that MolPhenix trained with Fingerprint embeddings consistently outperforms standalone input fingerprints, demonstrating that the MolPhenix latent space effectively clusters molecules according to their molecular properties. We observed an interesting effect where prediction quality is positively correlated with implied dosage, indicating that MolPhenix learns dosage-specific effects. Furthermore, we aim to expand on this analysis in the appendix of the full paper. Additionally, we point reviewers to pheno-activity experiments demonstrating that MolPhenix can predict dosage-dependent molecular activity, opening opportunities for in-silico prediction of previously unseen molecular structures and dosages. Furthermore, we performed zero-shot biological perturbation predictions, by pairing known biological relationships between molecular structures and gene knockout phenotypes. This analysis provides preliminary evidence that MolPhenix learns underlying biological signals. These experiments are described in appendices E1 and E2, highlighting the utility of the learned latent space for biological challenges beyond identifying pheno-similar molecules (Supplementary Figure 7, 8).
Several reviewers (YXwu, r9h5, ELHC) recommended expanding our evaluation to include additional encoders to demonstrate a broader capability of pheno-molecular recall guidelines. To that end, we conducted an additional study evaluating a supervised pre-trained vision encoder trained to predict the identity of a perturbation. The results are available in the included one-page document. In brief, we demonstrate the proposed guidelines generalize to a supervised CNN encoder used as a phenomolecular backbone. Additionally, we highlight our ablation study (Figure 5) where we investigate the impact of publicly available molecular fingerprinting methods as molecular encoders.
Finally, we received valuable feedback on clarifying background terminology, adding pheno-molecular prior works, and improving the clarity of our table legends. We aim for this work to be accessible to scientists across disciplines, and have added a glossary of terms to the appendix and expanded the related works. We also provide additional justification and intuition for the S2L inter-sample similarity-aware loss in individual response to reviewer r9h5.
We thank the reviewers for a careful reading of our paper and their broadly positive feedback. We believe that our changes, guided by your suggestions, strengthen the paper in terms of clarity and contribution for which we are grateful. We hope this work will be of interest to the broader community.
References for all the responses:
- [1] Sypetkowski, Maciej, et al. "Rxrx1: A dataset for evaluating experimental batch correction methods." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
- [2] Masters, Dominic, et al. "Gps++: An optimised hybrid mpnn/transformer for molecular property prediction." arXiv preprint arXiv:2212.02229 (2022).
- [3] Sypetkowski, Maciej, et al. "On the Scalability of GNNs for Molecular Graphs." arXiv preprint arXiv:2404.11568 (2024).
- [4] Chandrasekaran, Srinivas Niranj, et al. "JUMP Cell Painting dataset: morphological impact of 136,000 chemical and genetic perturbations." BioRxiv (2023): 2023-03.
The authors present a cross-modal framework that tackles key challenges in contrastive phenomolecular retrieval. The framework shows good performance, particularly in capturing dose-dependent effects through both implicit and explicit concentration encoding methods. The paper also includes a comprehensive ablation study, detailing the contributions of various components such as loss functions and concentration encoding, with results thoroughly presented in both the main text and rebuttal. Overall, all reviewers agree that the paper makes an interesting contribution to phenomic-based drug discovery, addressing a class of practically significant problems. However, I suggest the authors provide a more thorough discussion and refinement in the writing and related work sections, particularly concerning the comments of kfG5 and Vz3t, to better acknowledge relevant contributions in the field.