6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.8

置信度

创新性2.8

质量3.3

清晰度3.0

重要性2.8

NeurIPS 2025

Self-supervised Blending Structural Context of Visual Molecules for Robust Drug Interaction Prediction

Tengfei Ma,Kun Chen,Yongsheng Zang,Yujie Chen,Xuanbai Ren,Bosheng Song,hongxin xiang,Yiping Liu,xiangxiang Zeng

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

TL;DR

S²VM self-supervised learning from large unlabeled drug pairs, achieving SOTA DDI prediction and superior generalization/interpretability on novel/few-shot drugs.

摘要

Identifying drug-drug interactions (DDIs) is critical for ensuring drug safety and advancing drug development, a topic that has garnered significant research interest. While existing methods have made considerable progress, approaches relying solely on known DDIs face a key challenge when applied to drugs with limited data: insufficient exploration of the space of unlabeled pairwise drugs. To address these issues, we innovatively introduce S$^2$VM, a Self-supervised Visual pretraining framework for pair-wise Molecules, to fully fuse structural representations and explore the space of drug pairs for DDI prediction. S$^2$VM incorporates the explicit structure and correlations of visual molecules, such as the positional relationships and connectivity between functional substructures. Specifically, we blend the visual fragments of drug pairs into a unified input for joint encoding and then recover molecule-specific visual information for each drug individually. This approach integrates fine-grained structural representations from unlabeled drug pair data. By using visual fragments as anchors, S$^2$VM effectively captures the spatial information of local molecular components within visual molecules, resulting in more comprehensive embeddings of drug pairs. Experimental results show that S$^2$VM achieves state-of-the-art performance on widely used benchmarks, with Macro-F1 score improvements of 4.21% and 3.31%, respectively. Further extensive results and theoretical analysis demonstrate the effectiveness of S$^2$VM for both few-shot and novel drugs.

关键词

Drug Interaction PredictionDrug DiscoveryMolecule Representation Learning

评审与讨论

审稿意见

评分: 4置信度: 42025-06-27

The paper introduces S2VM, a novel self-supervised pretraining framework designed to improve the prediction of drug-drug interactions (DDIs). The core problem it addresses is the poor performance of existing models on new or data-scarce drugs, which stems from an insufficient exploration of unlabeled drug pairs and weak fusion of structural information. To solve this, S2VM employs a "pre-fusion" strategy. It represents pairs of drugs as 2D images, splits them into visual fragments, and "blends" these fragments into a single, unified input. This blended input is processed by a Transformer-based encoder-decoder architecture. The model is pretrained on a massive dataset with a self-supervised objective: to reconstruct the original, separate molecular images from the blended representation. The resulting pretrained encoder is then fine-tuned for DDI prediction, where it demonstrates state-of-the-art performance.

优缺点分析

strengths:

Large-Scale Self-Supervised Learning: By pretraining on 200 million unlabeled drug pairs, the model learns generalizable structural representations that are not confined to the limited scope of known DDIs. This directly addresses the key challenge of data scarcity for novel and few-shot drugs.
Strong Empirical Performance: The paper provides robust evidence of the model's superiority. It achieves significant improvements over state-of-the-art baselines on multiple datasets (Deng's, Ryu's, and TWOSIDES) and across various settings, including common, few-shot, and inductive (new drug) scenarios.

weaknesses:

High Computational Cost: Pretraining a Vision Transformer model on 200 million pairs of images is extremely resource-intensive. This high computational barrier could make it difficult for researchers with limited resources to reproduce the results or build upon the work.
Simplistic Blending Mechanism: The "blending" is achieved by randomly sampling patches from each of the two drug images. While empirically effective, this random approach might not be optimal. A more informed sampling strategy based on preliminary structural analysis could potentially yield even better results.

问题

The paper states that 2D images capture spatial relationships of visual molecules. However, functional groups that appear distant in a 2D projection can be very close in 3D space due to molecular folding, which is critical for interactions. How does the model account for this fundamental discrepancy, and what is the justification for accepting this potential loss of critical 3D conformational information compared to using 3D structural data?
Why were 2D images chosen as the input modality over molecular graphs, which are a more direct and common representation of molecular structure? Given that graphs explicitly encode atom connectivity and bond types, while an image-based model must learn these rules implicitly, what are the perceived advantages of the visual approach that outweigh the benefits of graph-based methods for this task?
The architecture uses two distinct linear projection heads (E1 and E2) to reconstruct d_u and d_v, respectively. This design is inherently asymmetric. Consequently, providing the input pair as (d_u, d_v) versus (d_v, d_u) would result in each drug being processed by a different reconstruction head, potentially leading to different loss values and training dynamics. What is the rationale behind this asymmetric design rather than using a single, shared reconstruction head for both molecules?
The structure-level encoding involves a stochastic sampling process to blend the two molecular images. This implies that for a single trained model, the same drug pair can yield different latent representations and potentially different predictions across multiple inferences. How large is the variance brought by sampling?
The model is pretrained on a large dataset of molecules from PubChem before being fine-tuned and tested on benchmark datasets. Was a rigorous data leakage check performed to ensure that all molecules present in the validation and test sets of the downstream tasks were explicitly excluded from the large-scale pretraining dataset? If not, the model's reported performance, especially on "new" drugs, might be artificially inflated.
The paper claims that the joint reconstruction objective enables the model to learn mutual information and structural correlations between the two drugs. To validate this central claim, a crucial control experiment would be to pretrain the model using a standard masked autoencoder objective on single molecules (i.e., masking and reconstructing fragments of only one molecule at a time). If this single-molecule pretraining yields similar or better downstream DDI performance, it would suggest the model's strength derives from learning robust single-molecule features, rather than from learning interaction information between pairs. Was such a control experiment considered?
During the downstream DDI prediction task, the model's input is the blended matrix of visual fragments, meaning the encoder is not provided with the complete structure of either molecule at the moment of prediction. This design appears to intentionally sacrifice complete structural information for the sake of maintaining the fusion-before-encoding architecture. What is the rationale for this choice? Why is making a prediction from this partial, mixed representation considered superior to a more conventional approach for the downstream task, such as encoding each complete molecule separately with the pretrained encoder and then combining their representations for a final prediction?

Of all the issues raised, Questions 5 and 6 are the most critical to me. A convincing response to these two concerns would strongly incline me to give a higher rating.

局限性

Use of 2D Representations: The authors explicitly acknowledge that using 2D molecular images ignores 3D conformational structures, which are critical for understanding physical interactions between molecules. This represents a fundamental trade-off between computational scalability and representational granularity.
Neglect other biological entities: This paper focuses on molecular features, neglecting other biological entities involved in drug interaction events, such as proteins, pathways, and diseases, which are crucial to identifying DDIs.

最终评判理由

Most of my concerns are addressed, especially Q5 and Q6. I raise my rating to 4.

格式问题

作者回复

2025-07-31

Thank you for your detailed review and constructive suggestions. We address the concerns and questions below.

Q1: What is the justification for accepting this potential loss of critical 3D conformational information compared to using 3D structural data?
Response: We acknowledge that 3D conformations play a critical role in drug interactions. While S $^2$ VM does not aim to model full 3D structures, it leverages blended visual fragments to approximate pseudo-spatial proximity between functional groups. This design allows efficient and scalable modeling of local structural co-occurrence patterns. As shown in our interpretability analysis (Figure 5), S $^2$ VM consistently highlights biologically meaningful substructures (e.g., 1,3-benzodioxole), indicating its ability to capture structure-dependent features. We also explicitly recognize in the Limitation section that extending to 3D or multimodal representations is a promising future direction.

Q2: Why were 2D images chosen as the input modality over molecular graphs?
Response: We chose 2D molecular images as the input modality because they provide an intuitive and spatially continuous view of molecular structures, making it easier to capture functional group arrangements [1] (Nature Machine Intelligence, 2022). Compared to graphs, images are often more robust—minor errors or missing bonds in molecular graphs may break connectivity, whereas images still preserve chemical context visually. This visual format also facilitates the use of powerful vision models, offering a natural path toward enhancing drug safety analysis through structure-aware attention. Empirically, S $^2$ VM outperforms graph-based methods (e.g., SSI-DDI, MRCGNN) across standard and few-shot settings (Table 1–3 and Figure 4 in the main paper), validating the effectiveness of visual representations for DDI prediction.

Q3: Why does the model use two separate projection heads for the two drugs instead of a shared one, given the resulting asymmetry?
Response: The use of two distinct projection heads reflects a deliberate choice to model the inherent asymmetry in drug interactions. In real-world DDI scenarios, drug pairs often exhibit asymmetric roles. For example, one drug may alter the metabolism or concentration of the other. This asymmetric nature is well-documented in biomedical literature and has been emphasized in prior work [2] (Nature Machine Intelligence, 2024), which frames DDI prediction as a direction-sensitive task. An example of the 72-th Event type is: “The serum concentration of Drug2 can be increased when combined with Drug1” where Drug1 acts as the perpetrator and Drug2 as the victim. By assigning separate reconstruction heads to $d_u$ and $d_v$ , the model learns role-specific structural representations, which are aligned with the biological reality of DDIs and contribute to more faithful pretraining dynamics. This confirms that explicitly modeling role-specific representations brings tangible benefits and better aligns with the asymmetric nature of real-world drug interactions.

Q4: Since the model uses stochastic blending during training, does this cause significant variance or inconsistency in predictions for the same drug pair during inference?
Response: We clarify that stochastic blending is applied only during pretraining to encourage robust representation learning. For downstream DDI prediction, the model receives deterministic inputs by concatenating the visual tokens of the two drugs in a fixed order, with no randomness involved during inference. This practice is consistent with standard MAE-style reconstruction pretraining, where masking is used only during training, while downstream tasks use the full input [3,4]. Additionally, in Appendix C.7, we report results across different blending strategies for DDI prediction, showing the model’s robustness to fusion variations.

Q5: Was there a rigorous check to prevent data leakage from the PubChem pretraining set into the benchmark test sets?
Response: We confirm that our pretraining process is entirely self-supervised and task-agnostic, with no access to DDI-related labels or annotations. This eliminates any risk of label leakage and follows standard practice in representation learning frameworks such as MAE [3] and MoCo [5], where the pretraining stage is decoupled from downstream tasks.

To assess potential data overlap, we rigorously analyzed the intersection between the 200K molecules used in pretraining and the test sets of benchmark datasets. We identified no shared drugs in the TWOSIDES dataset under the inductive S1 and S2 settings, and only one shared molecule (DrugBank ID: DB01351) in Deng’s and Ryu’s datasets (<0.001%, settings for both drugs are existing). This negligible overlap makes structural memorization extremely unlikely.

To further ensure strict evaluation, we removed all DDI instances involving the shared molecule from the test sets (34 out of 7,474 in Deng’s and 93 out of 38,349 in Ryu’s). As shown in Table 1, the model’s performance remains virtually unchanged, confirming that our results are unaffected by any potential data leakage.

	Deng's dataset		Ryu's dataset
	ACC.	Macro-F1	ACC.	Macro-F1
S $^2$ VM	0.9105	0.8212	0.9586	0.9207
S $^2$ VM (removed)	0.9117	0.8193	0.9580	0.9203

Table 1. Performance comparison after removing the single shared drug from the test sets.

We will clarify this verification and filtering process more explicitly in the revised manuscript.

Q6: The paper claims that the joint reconstruction objective enables the model to learn structural correlations between the two drugs. Was a single-molecule pretraining control experiment considered to validate this claim?
Response: We conducted the suggested control experiment by pretraining a ViT-based masked autoencoder (MAE) using single-molecule reconstruction only. This baseline uses the same architecture, the same 200K PubChem molecules, and identical training epochs as S $^2$ VM, ensuring a fair comparison. During downstream DDI prediction, we adopted a post-fusion strategy where each drug is encoded individually and the resulting representations are concatenated for classification.

As shown in Table 2, the single-molecule MAE baseline consistently underperforms S $^2$ VM across all metrics and datasets. This demonstrates that joint reconstruction pretraining yields more expressive and interaction-aware representations, beyond what is achievable through standard single-drug encoding.

	Deng's dataset		Ryu's dataset
	ACC.	Macro-F1	ACC.	Macro-F1
MAE	0.8276	0.6961	0.9215	0.8920
S $^2$ VM	0.9105	0.8212	0.9586	0.9207

Table 2. Performance comparison between S $^2$ VM and single-molecule representation baselines.

We will add this control experiment and detailed discussion in the revised manuscript to further strengthen the theoretical claims.

Q7: Why does the model use a blended input for DDI prediction, instead of encoding each full molecule separately and combining them later?
Response: The model uses full molecular structures during inference, with stochastic blending applied only in pretraining to enhance representation learning (see response to Q4). Our choice to maintain fusion-before-encoding is based on: 1) Many DDIs arise from localized interactions between functional substructures, not from global structural similarity [2,6]. By blending fragments early, the encoder can directly model cross-molecular spatial dependencies, which are crucial for detecting such interactions. 2) In Appendix C.1 (Table 7), we further compare pre- and post-fusion strategies using molecular fingerprints, again showing the advantage of early fusion. Similarly, in Figure 3, early fusion outperforms a shared-encoder baseline that encodes drugs separately and fuses representations afterward.

These results demonstrate that predicting from fused inputs is not a compromise, but a deliberate and effective design choice that enhances structural interaction modeling. We will make this rationale more explicit in the revised paper.

W1: The method requires high computational resources, which may hinder reproducibility.
Response: The model can be run on machines with >=16GB GPU RAM. To support reproducibility and broader adoption, we provide pretrained checkpoints and code for easy integration and evaluation.

[1] Zeng, Xiangxiang, et al. "Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework." Nature Machine Intelligence (2022)
[2] ZHONG, Yi, et al. Learning motif-based graphs for drug–drug interaction prediction via local–global self-attention. Nature Machine Intelligence, 2024
[3] HE, Kaiming, et al. Masked autoencoders are scalable vision learners. CVPR. 2022.
[4] ZHAO, Fan, et al. Self-supervised feature adaption for infrared and visible image fusion. Information Fusion, 2021
[5] HE, Kaiming, et al. Momentum contrast for unsupervised visual representation learning. CVPR. 2020.
[6] Nyamabo, A. K., Yu, H., & Shi, J. Y. (2021). SSI–DDI: substructure–substructure interactions for drug–drug interaction prediction. Briefings in Bioinformatics, 22(6).

Thank you once again for your valuable feedback, which has helped us improve our work. We hope these clarifications effectively address your key concerns and provide sufficient justification for a higher evaluation.

2025-08-04

Thanks for your response. Most of my concerns are addressed.

2025-08-04

Thank you for taking the time to review our rebuttal and for acknowledging that most concerns have been addressed. We sincerely hope that our clarifications and additional experiments have effectively addressed your key concerns (e.g., Q5 and Q6), and we would greatly appreciate your positive support in the final evaluation.

2025-08-04

Of course, I will raise my final rating.

2025-08-04

Thank you so much for your support and for deciding to raise your rating. We truly appreciate it and are grateful for your thoughtful review and constructive feedback.

审稿意见

评分: 4置信度: 32025-07-01

The paper propose S2VM, a method for the problem of drug-drug interaction. The key idea is its self-supervised learning framework that learns the presentations of pairs of drugs (2 molecules) from a large collection of drug pairs without label. It shows a high performance in predicting DDI in various datasets.

优缺点分析

I find the key idea of the paper is interesting. Usual method would have two different parts, learning the presentations of individual drugs and learning the interaction between the presentations to predict the label. Given that there are few labeled and many unlabeled pairs, the method here can exploit a much larger datasets. The choice of learning representation of pairs directly instead of each drug is reasonable, blending both parts of the previous methods into one. The paper is clear and easy to understand for me. Its performance is convincing. I could not check the theory in detail.

问题

It would make sense in biology to consider the effect of drugs on proteins and pathways that give rise to DDI, why the paper do not use this data?

局限性

yes

格式问题

n/a

作者回复

2025-07-28

Thank you for your detailed review and constructive comments. We’re glad you found the idea interesting, the methodology sound, and the experiments robust.

Q1: It would make sense in biology to consider the effect of drugs on proteins and pathways that give rise to DDI, why the paper do not use this data?
Response: We focus on molecular structures from large-scale unlabeled data to address new or under-represented drugs, where protein or pathway data are often missing from biomedical knowledge graphs. Many DDIs, especially those related to drug metabolism such as enzyme inhibition or induction, are may driven by molecular substructures [1] (Nature Machine Intelligence, 2024). This makes structure-level modeling particularly relevant. Our results support this view. While S $^2$ VM and MUFFIN perform similarly on common drugs (Table 1), S $^2$ VM achieves 22.6% and 32.5% relative gains in rare-event settings on Deng’s and Ryu’s datasets, respectively. This suggests that knowledge-based models may struggle when entity coverage is low, whereas S $^2$ VM generalizes more effectively by learning directly from structure.

We acknowledge that integrating biomedical knowledge into S $^2$ VM is a valuable future direction and plan to explore it further.

	Deng's dataset		Ryu's dataset
	Common setting	Rare setting	Common setting	Rare setting
MUFFIN	0.8269	0.4597	0.9510	0.6428
S $^2$ VM	0.9105	0.6854	0.9586	0.9675
impr. (%)	8.4%	22.6%	0.8%	32.5%

Table 1. The Accuracy of S $^2$ VM and MUFFIN on Deng's and Ryu's datasets under different settings.

[1] ZHONG, Yi, et al. Learning motif-based graphs for drug–drug interaction prediction via local–global self-attention. Nature Machine Intelligence, 2024, 6.9: 1094-1105.

Thank you once again for your valuable feedback, which has helped us improve our work. We believe these additions will strengthen our model's presentation.

评论- All points clear

2025-08-04

Thank you for the answers. It all makes sense to me.

2025-08-04

Thank you for your kind reply and for taking the time to review our responses. We sincerely appreciate your positive and thoughtful feedback.

审稿意见

评分: 4置信度: 42025-07-02

The paper presents S2VM, a self-supervised pretraining scheme for drug–drug interaction modeling. Molecules are rendered as 2-D “visual” images; fragments from two drugs are randomly blended into a single composite image, which an encoder must process so that a decoder can reconstruct the original pair. The resulting pair embeddings, when fine-tuned, deliver state-of-the-art DDI prediction without relying on pre-existing interaction networks or biomedical knowledge graphs.

优缺点分析

Strengths: 1 The approach is highly original: representing molecules as images and blending fragments of two drug structure images into one, then training an encoder–decoder to reconstruct the originals from the blend, going beyond traditional single-molecule pretraining.

2 The authors provide both theoretical analysis and extensive empirical validation. S2VM achieves state-of-the-art DDI prediction performance, with notable Macro-F1 improvements of 4.21% and 3.31% on two benchmark datasets.

Weaknesses: 1 Images may omit chemical details like atom types or bond orders if not rendered or processed carefully. The paper doesn’t discuss how variations in drawing (different orientations or depiction styles) are handled. Are the molecule images generated via a consistent algorithm and do they ensure invariant features? If the visual representation isn’t canonical, the model might learn image-specific features that don’t generalize.

2 The method focuses purely on structural features of drugs. This neglects other known factors in DDI, such as biological context (targets, pathways) or pharmacological effect data. The authors cite knowledge graph approaches in related work, and indeed S2VM purposely avoids using that information to remain self-supervised. It’s impressive S2VM achieves SOTA without explicit biomedical network data, but it might also miss interactions that are pharmacological. The paper does not discuss this trade-off in depth.

3 Clarity could be improved in describing the blending process: do they simply overlay fragments from two molecule images into one image? How are fragments chosen or positioned? A more detailed algorithm or visual example in the main text would help readers replicate this transformation.

问题

1 How are the molecular images generated and processed? Are they 2D structural diagrams with atoms/bonds drawn, and if so, how do we ensure consistency? Clarifying the image generation pipeline and any augmentation used would help assess robustness.

2 Training on 200M drug pairs is mentioned, which is enormous. What strategies were used to make this feasible? Also, how does model performance scale with the number of pretraining pairs? If possible, providing insight into the compute time or infrastructure used and the effect of dataset size on performance.

3 The paper presents a theoretical justification for the effectiveness of S2VM. How can this theory guide practical decisions? For instance, does it inform the choice of blending ratio, or the architecture of the decoder?

局限性

Yes

最终评判理由

I keep my positive review after the rebuttal.

格式问题

作者回复

2025-07-30

Thank you for your thoughtful and encouraging review. The point-to-point answer is provided below.

Q1: How are the molecular images generated and kept consistent? Are they 2D diagrams with atoms and bonds, and is any augmentation used?
Response: We generate molecular images through a standardized and reproducible pipeline designed to ensure visual consistency and structural fidelity:

Canonicalization of SMILES: All molecules are first canonicalized using RDKit to obtain a unique and deterministic SMILES representation, eliminating variations due to atom ordering or tautomers.
Image Rendering via RDKit: The 2D molecular structures are rendered using RDKit’s MolsToGridImage function, which explicitly depicts atoms and bonds. Each molecule is rendered as a 224×224 pixel image without any stochastic augmentation to ensure deterministic and consistent visual representation across different runs.
Layout Standardization: We fix all layout-related parameters, including sub-image spacing, drawing style, and molecule alignment, to ensure that chemically identical molecules yield identical image representations.

The full image generation pipeline is available in our codebase for reproducibility. We will clarify this process in the revised manuscript and consider exploring augmentation strategies in future work to enhance generalization.

Q2: How was training on 200 million drug pairs made feasible, and how does model performance scale with dataset size?
Response: We designed S $^2$ VM with scalability in mind.

Pretraining efficiency: To ensure feasibility, we use a lightweight ViT encoder (12 layers, embedding dim 192) and patch size of 16×16, which significantly reduces the computational cost while maintaining performance. We also apply efficient data loading and batching strategies without extensive augmentations (Appendix B).
Effect of Dataset Size: We conducted systematic experiments to evaluate how model performance scales with pretraining data size. Specifically, we varied the number of base molecules from 50K to 300K, which corresponds to approximately 50M to 300M drug pairs. As shown in Table 1 and Appendix C.4 (Figure 9 and Figure 10 in main paper), we observe consistent performance improvements with larger datasets, particularly from 50K to 200K molecules. Beyond 200K, the marginal gains begin to plateau, suggesting diminishing returns at very large scales.

# Base Molecules	# Molecular pairs	Deng's dataset	Ryu's dataset
50k	~50M	0.7712	0.8883
100k	~100M	0.7963	0.9021
200k	~200M	0.8212	0.9237
300k	~300M	0.8261	0.9295

Table 1. The performance of S $^2$ VM on different pretraining data scales.

We will include further discussion and clarification in the revised manuscript.

Q3: The paper presents a theoretical justification for the effectiveness of S $^2$ VM. How can this theory guide practical decisions? For instance, does it inform the choice of blending ratio, or the architecture of the decoder?
Response: Our theoretical analysis, which frames S $^2$ VM as maximizing mutual information between masked and blended molecular fragments, provides valuable guidance for several design choices.

Blending ratio: The theory supports the idea that mutual information is maximized when both source distributions contribute sufficiently to the reconstruction target. This insight informed our choice of a balanced blending ratio (e.g., 0.5:0.5), and we validated this empirically in Appendix C.1 (i.e., Table 9 in the main paper), where extreme ratios (e.g., 0.7:0.3) yielded lower performance as shown in Table 2.
Decoder architecture: The theory emphasizes the need to recover both intra- and inter-molecular dependencies. This motivated us to adopt a lightweight but expressive decoder, sufficient to reconstruct fine-grained structural features from the fused latent space without overpowering the encoder’s role in learning meaningful representations.

We will incorporate a brief discussion of this connection in the revised paper.

$(p_1 : p_2)$	Deng’s dataset	Ryu’s dataset
0.7:0.3	81.59	91.76
0.5:0.5	82.97	92.53
0.3:0.7	80.87	92.08

Table 2. The performance of S $^2$ VM with various mask ratios.

W1: The current method focuses solely on structural features, potentially overlooking pharmacological or biological factors involved in DDIs. Please discuss this trade-off.
Response: S $^2$ VM focuses on structural learning to support new or under-annotated drugs, where biological context is often missing. While knowledge-enhanced models like MUFFIN [1] perform well on common drugs, they rely on entity coverage and may struggle with rare or unseen cases. Our comparison in Table 3 shows that S $^2$ VM achieves comparable results on common settings, but significantly outperforms MUFFIN on rare events with up to 32.5% higher accuracy. This highlights that structure-driven self-supervised learning offers stronger generalization under low-resource scenarios. We will elaborate on this trade-off in the revised manuscript and consider integrating knowledge graphs into S $^2$ VM in future work.

	Deng's dataset		Ryu's dataset
	Common setting	Rare setting	Common setting	Rare setting
MUFFIN	0.8269	0.4597	0.9510	0.6428
S $^2$ VM	0.9105	0.6854	0.9586	0.9675
impr. (%)	8.4%	22.6%	0.8%	32.5%

Table 3. The Accuracy of S $^2$ VM and MUFFIN on Deng's and Ryu's datasets under different settings.

W2: The blending process lacks clarity. How are fragments selected and positioned, and could a visual example help replication?
Response: We adopt an anchor-based replacement strategy: one molecule (e.g., $d_u$ ) serves as the structural anchor, and a fixed proportion of its visual fragments (e.g., 30%) are randomly masked. The corresponding fragments from the second molecule $d_v$ are then inserted at the same positions, forming the blended input. This leads to an effective fragment ratio of 0.7:0.3 between $d_u$ and $d_v$ in the final representation. This strategy ensures spatial alignment while introducing meaningful structural variation across training pairs. We will include a visual illustration of this blending process in the revised paper to enhance clarity and reproducibility.

[1] CHEN, Yujie, et al. MUFFIN: multi-scale feature fusion for drug–drug interaction prediction. Bioinformatics, 2021, 37.17: 2651-2658.

Thank you again for your valuable insights, which will help us further strengthen the presentation and impact of our work.

2025-08-08

Thanks for the rebuttal, I will keep my positive rating.

审稿意见

评分: 5置信度: 42025-07-06

The paper presents a self-supervised model, S $^2$ VM, which pre-trains a vision transformer on 200 million drug pairs for downstream drug–drug interaction (DDI) prediction.

The model processes pairs of 2D molecular images, from their original SMILES representations, through a masking and blending operation that generates a unified image. The pretraining objective is to reconstruct each molecule in the pair, encouraging the encoder to learn useful representations that are later reused for the DDI task; a popular objective in representation learning tasks for other modalities especially images.

The authors report state-of-the-art performance across multiple DDI benchmarks and similarly argue that the learned latent representations in their self-supervised learning model captures both the intrinsic structure of individual molecules and their extrinsic relational properties.

This work addresses a critical problem with significant implications for combinatorial therapy, particularly in the treatment of complex diseases.

优缺点分析

With self-supervised learning approaches, often my main concern lies with demonstrating that the latent representations learned in pre-training tightly capture the intrinsic properties of its data distribution, in this case, two molecular images. Accuracy can sometimes be a misleading statistics and can often not be a robust signal for demonstrating true generalization. Here are some factors that dictate robust latent representations, and I'll use this as a measure of the strengths/weaknesses of the paper:

Transfer learning - Out of distribution. Although accuracy can be an ineffective measure of SSL, it is still a valid datapoint:

The model was pretrained with data not exposed within their transfer learning tasks, namely the Deng & Ryu datasets. This is a fair evaluation measure and is indicative of a robust latent representation.
The authors demonstrate that across benchmarks, S $^2$ VM handsomely outperforms other methods. What I found to be notable is that the authors invested time into implementing other approaches to validate their results, this is a high-fidelity result, and code was similarly provided.

"Mutual Information" - How much information is shared between the latent embedding and the corresponding input distribution:

The authors similarly provide some theoretical analysis in support of this in Section 4.2. Although this is encouraging, I found the section to be quite thin. Clarity could be improved in this section, potentially by including a short summary of the proof instead of leading the reader to the Appendix. The proof as well lacks clarity, I imagine $X_1$ and $X_2$ are the images, is $Y$ the latent embedding? All in all, this is a section that could greatly benefit from clarity.

Geometry of the representations: E.g. do clusters emerge in the representation:

The authors provide a t-SNE visualization of the representations that emerge from their SSL model in Figure 1b (although small), and the Appendix. Compared to other approaches, S $^2$ VM, provides a clear delineation in its representations across DDI event types. This is a promising sign of robustness.

An understanding of when & where representational collapse lies:

Less trivial, but here, we'd like a measure of the variance of the latent representation, and what are the conditions that enable representational collapse in an SSL architecture. I noticed in the paper that the model employs random masking on the 2D images, one could similarly also apply a learnable masking operation for effective representational learning, although I can imagine a situation in which this could diminish the variance of the latent representation. This is something for the authors to think about.
As a follow up to this, I would encourage the authors to consider an experiment in which you measure performance on the Deng & Ryu transfer learning task, as a function of the pretraining dataset size (10K, 1M, 10M, 100M, 200M, ...). This could be a good plot which would demonstrate how important the size of the pretraining dataset is towards the downstream performance of the model. As far as I can tell, this is not communicated in the paper. (This is, in my view, the most addressable weakness of the paper). It would be greatly appreciated if the authors could address this.

State of the art?:

The authors should consider moving away from a pixel level reconstruction of the masks of the images and a JEPA-inspired approach in which the SSL objective is within embedding space and not a pixel-level reconstruction. Literature suggests that it will provide more stronger representations: https://arxiv.org/abs/2301.08243 .

Unless there is a convincing argument against using a JEPA-inspired approach, it begs the question if this model is the best model for 2D DDI prediction. Could the authors share their thoughts on why a JEPA-based approach was not considered for me to better understand their architectural choices?

问题

Could the authors share their thoughts on why a JEPA-based approach was not considered for me to better their architectural choices? I'll consider strengthening my score if I get better understanding of this, and if the authors could add the aforementioned experimental results on scaling pretraining data.

局限性

I appreciate that this was communicated within the manuscript: 3D conformational effects. A 2D image of a molecule is not as rich as 3D representation of a molecule, hence a graph based approach is likely the better method to represent the data accurately. Beyond the 3D structure, you can similarly encode other properties which you could not really express in a 2D image. This is a key limitation, and begs the question if this work would be impactful towards advancing our understanding DDI prediction.

最终评判理由

My biggest concern with the manuscript was the lack of comparisons against JEPA-inspired SSL approaches. The authors had run experiments demonstrating that their approach is competitive with JEPA-inspired approaches for DDI prediction.

The authors similarly addressed other plots that I would have liked to see in the manuscript.

格式问题

None.

作者回复

2025-07-29

Thank you for your detailed review and constructive comments. We address the concerns and questions below.

Q1: Could the authors share their thoughts on why a JEPA-based approach was not considered to better understand their architectural choices?

Response: We appreciate your insightful suggestion and interest in JEPA-style architectures. Our initial choice of a ViT-based encoder–decoder design was motivated by its demonstrated success in visual molecular representation learning [1] and its compatibility with pixel-level reconstruction, which aligns well with our objective of capturing fine-grained spatial correlations between drug fragments.

That said, we also agree that JEPA is a compelling alternative for structure-level pretraining. To explore this direction, we implemented an image-level JEPA (I-JEPA [2]) variant of S $^2$ VM, trained under the same settings and scale (50K base molecules, 50M drug pairs) for fair comparison. As shown in Table 1, I-JEPA achieves comparable performance on the Ryu dataset and slightly underperforms on the Deng dataset. These results suggest that JEPA-style models can capture complementary information and may serve as a promising backbone for future extensions of S $^2$ VM.

We will include these comparative results and release the I-JEPA-based pretrained checkpoints in our revised manuscript. Due to time constraints, we have evaluated I-JEPA only on the small-scale setup, but we plan to further scale up the training to fully assess its potential.

Method	Deng's dataset	Ryu's dataset
I-JEPA (ours)	0.7154	0.8878
S $^2$ VM (ours)	0.7712	0.8883

Table 1. Performance (Macro-F1) comparison between ViT-based and JEPA-based S $^2$ VM models under small-scale pretraining.

Q2: I encourage the authors to report downstream performance as a function of pretraining dataset size. This would help quantify the benefit of large-scale pretraining, which seems underexplored in the current paper.

Response: We have conducted additional experiments to evaluate the impact of pretraining dataset size on downstream DDI performance, as detailed in Appendix C.4 (Figures 9 & 10).

As shown in Table 2, we observe that S $^2$ VM benefits consistently from scaling up the number of pretraining drug pairs. The Macro-F1 scores increase steadily from 50M to 300M molecular pairs, with diminishing returns beyond 200M. Based on this trend and computational considerations, we selected 200K base molecules (yielding ~200M drug pairs) as our default setting.

#Base Molecules	#Molecular Pairs	Deng's dataset	Ryu's dataset
50K	~50M	0.7712	0.8883
100K	~100M	0.7963	0.9021
200K	~200M	0.8212	0.9237
300K	~300M	0.8261	0.9295

Table 2. Macro-F1 scores of S $^2$ VM on Deng's and Ryu's datasets across different pretraining data scales.

W1: The theoretical analysis in Section 4.2 lacks clarity. A concise summary in the main text and clearer notation in the proof would improve readability.
Response: We will add a concise summary of the theoretical insight directly in the main text, rather than deferring entirely to the Appendix. Specifically, we will highlight the core idea as follows:

Our training objective maximizes the mutual information between the blended visible fragments and the original unobserved parts of each molecule, thereby encouraging the model to learn both intra- and inter-molecular dependencies critical for DDI prediction.

Meanwhile, we will:

Explicitly clarify that $X_1, X_2$ denote the visible fragments from drug A and drug B (e.g., $A_1, B_2$ ) used as input;
$Y$ denotes the reconstruction target consisting of the missing parts (e.g., $A_2, B_1$ );
The mutual information term $I(X_1, X_2; Y)$ thus quantifies how well the latent representation captures information necessary to reconstruct the original molecules.

We will align the theoretical notation with the data flow used in the model (e.g., using $A_1, A_2, B_1, B_2$ ) and briefly explain how the decomposition supports the design of our fusion-before-encoding framework.

[1] XIANG, Hongxin, et al. A molecular video-derived foundation model for scientific drug discovery. Nature Communications, 2024, 15.1: 9696.
[2] ASSRAN, Mahmoud, et al. Self-supervised learning from images with a joint-embedding predictive architecture. CVPR. 2023. p. 15619-15629.

We sincerely appreciate your thoughtful suggestions, which will help us further strengthen the impact of our work. We hope our responses have effectively addressed your concerns and provided the necessary evidence to support a more favorable assessment.

2025-08-05

Dear Reviewer yTyP,

Thank you once again for your thoughtful and constructive feedback. As the discussion phase comes to a close, we would be sincerely grateful if you could let us know whether our responses have fully addressed your concerns, particularly the discussion about the JEPA-style architecture.

We truly appreciate your time and consideration.

Best regards, Authors

2025-08-05

Dear Reviewer,

Please engage in the discussion and then acknowledge that you have read the rebuttal by clicking "Mandatory Acknowledgement."

Best,

Your AC

2025-08-08

Hi Authors,

All of my comments have been addressed, I appreciate the experiments on the JEPA model, which demonstrates that S $^2$ VM is competitive in performance against the JEPA approach.

Could you provide further details on the i-JEPA variant of S $^2$ VM? It seems like a vanilla implementation of i-JEPA applied to DDI prediction with 2D molecular images?

I will adjust my score, as my comments have been addressed.

2025-08-08

Dear Reviewer yTyP,

We sincerely thank you for your positive feedback and for acknowledging that our work has addressed your comments. For the I-JEPA variant, we replaced the ViT backbone in S $^2$ VM with the I-JEPA architecture and adapted it to handle paired molecular images. Each drug image is masked and encoded separately to obtain its context features, which are then combined to reconstruct the missing regions of each image following the standard I-JEPA procedure. All default I-JEPA hyperparameters were retained, except for those shared with ViT (e.g., embed_dim=192). For fair comparison, we used the same pretraining scale as in our baseline (50K base molecules, ~50M drug pairs). After pretraining, the encoder was used as the feature extractor for paired drugs in the downstream DDI prediction task.

We will expand this description and release the implementation of the I-JEPA variant in the revised manuscript to increase the accessibility and impact of S $^2$ VM. We sincerely appreciate your recognition and support of our work.

Best regards, Authors

最终决定Accept (poster)

2025-09-17

This paper proposes a self-supervised framework for drug-drug interaction (DDI) prediction. During pretraining, S2VM considers pairs of molecules as 2D images, and jointly encodes the pairwise information using Vision Transformer by reconstructing each molecule’s image from a blended image. The resulting pretrained encoder is fine-tuned for DDI prediction. Extensive experiments demonstrate the effectiveness of the proposed method. The reviewers initially suggested the following strengths and weaknesses:

Strengths:

The paper presents an original contribution to the field.
Extensive experiments demonstrate the effectiveness of the suggested self-supervised objective trained on large-scale (~200M) unlabeled pairs.

Weaknesses:

The method only uses 2D images as input, while neglecting potentially informative features, such as biological context.
The justification for the choice of input modality should be provided. S2VM does not use graph-based molecule representation, despite the fact that it is more direct and a common representation of molecular structure.
The comparison against JEPA-inspired approach (I-JEPA [1]) is not sufficient.
The supporting analysis to validate that the model truly learns mutual information and structural correlation between two drugs is insufficient.

The authors have adequately addressed most of the reviewers’ concerns, though some clarifications and additions remain necessary. In particular, the authors are required to provide the additional experiments presented during the rebuttal, especially the comparisons with MUFFIN [2]. Furthermore, the inference phase of the model should be clarified (Reviewer Kenp Q4) for better understanding. The authors need to include experiments conducted during the rebuttal and revise the content related to the inference process in the revised paper.

Since all reviewers provided positive ratings after the rebuttal, and considering the importance of the work in DDI as well as its powerful empirical performance, I recommend acceptance to NeurIPS.

[1] Assran, Mahmoud, et al. “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture.” CVPR, 2023.

[2] Chen, Yujie, Muffin: multi-scale feature fusion for drug–drug interaction prediction. Bioinformatics, 37(17): 2651–2658, 2021.