Reinforced Active Learning for Large-Scale Virtual Screening with Learnable Policy Model
We present GLARE, a reinforced active learning framework that improves virtual screening efficiency by dynamically optimizing molecular selection while balancing chemical diversity, biological relevance, and computational constraints.
摘要
评审与讨论
This paper proposes GLARE, a novel active learning framework for drug virtual screening. Unlike traditional methods that use hand-crafted selection heuristics, GLARE learns a reinforcement learning-based policy to directly select molecules for labeling. By formulating the task as a Markov Decision Process and optimizing a policy using Group Relative Policy Optimization (GRPO), GLARE dynamically balances exploration (diversity) and exploitation (biological relevance). Extensive experiments show that GLARE significantly outperforms existing active learning methods, achieving up to 64.8% higher enrichment factors and enhancing the performance of foundation models like DrugCLIP by up to 8-fold with minimal labeled data.
优缺点分析
Strengths
- The overall framework design is well-aligned with the virtual screening task. Reformulating the process as a Markov Decision Process and learning a molecule selection policy is a reasonable and principled approach.
- The proposed method shows consistent improvement over baseline models across several datasets and evaluation metrics.
- The paper is clearly written and easy to follow. The figures, including framework diagrams and UMAP visualizations, are clear, professional。
Weakness
- While the approach is effective, the overall idea is relatively straightforward. It applies existing reinforcement learning techniques (specifically GRPO) within an active learning setting. The novelty mainly lies in combining known components. It would be beneficial to evaluate other reinforcement learning algorithms beyond GRPO, or provide stronger justification for its selection.
- The policy model only considers molecular features when making selection decisions. In structure-based virtual screening, incorporating protein or pocket-level features might improve performance.
- The method appears to require training a separate policy model for each target, which limits scalability in applications involving large sets of proteins or binding pockets. It does not support efficient many-to-many matching.
- The baselines in the experiments are somewhat limited. Only TcsAL is used for the LIT-PCBA benchmark, and only PtAL for the Enamine datasets. It would strengthen the empirical evaluation to include a wider range of baseline methods.
问题
see weakness
局限性
n/a
最终评判理由
This paper uses the RL objective to replace conventional acquisition functions in an active learning based virtual screening setting. I acknowledge that applying an RL objective in an active learning setting is an interesting exploration, and its success is not surprising. However, I still have some reservations about the contributions, as I feel the method is highly dependent on the specific setting. In particular, if RL for active learning is applied only to virtual screening, the chosen baselines (simple acquisition functions) are relatively weak, and it is expected that RL would outperform them. This makes the contribution appear somewhat setting-dependent. That said, from another perspective, introducing a method to a new task where it has not been applied before does offer a degree of novelty. Overall, after careful consideration and reading other reviewers’ comments, I still think the weaknesses outweigh the strengths and will maintain my rating at borderline reject.
格式问题
n/a
We sincerely thank the reviewer for the detailed feedback and constructive suggestions. Below, we summarize the reviewer's questions and provide point-by-point responses.
W1: The Contribution of the Proposed Methodology
R: Thank you for this insightful question. We would like to clarify that, while our approach leverages existing reinforcement learning (RL) techniques, our contribution goes well beyond simply combining known components. Our study specifically addresses the unique challenges of active learning in virtual screening, an area that differs substantially from standard RL tasks.
To our knowledge, this is the first work to integrate GRPO within an active learning framework and apply it to the context of virtual screening. We innovatively formulate the virtual screening process as a Markov Decision Process (MDP) and embed active learning principles, allowing GLARE to adaptively select the most informative molecules for labeling. By coupling the RL agent's selection strategy with docking feedback in an active learning loop, our method, GLARE, efficiently prioritizes the most informative docking computations, offering a efficient and practical solution for real-world virtual screening workflows.
In addition, based on GRPO, we have substantially redesigned the framework. Specifically, we introduced a new reward formulation specifically tailored for virtual screening, which incorporates an exploration enhancement based on gradient uncertainty, a feature absent from prior methods.
We also experimented with Direct Policy Optimization (DPO), another effective RL algorithm, for optimizing the policy network. As shown in the table below, GRPO consistently outperforms DPO in our setting. We attribute this to GRPO's use of group-wise advantage estimation, which better captures the relative quality of selected molecules within a batch (a property that aligns well with the objectives of virtual screening). This provides a direct motivation for our choice of GRPO.
| Method | Model | Strategy | ALDH1 | PKM2 | VDR | |||
|---|---|---|---|---|---|---|---|---|
| Iter 10 | Iter 16 | Iter 10 | Iter 16 | Iter 10 | Iter 16 | |||
| GLARE | MLP | Policy (DPO) | 5.759 | 6.012 | 1.878 | 5.526 | 2.397 | 5.961 |
| Policy (GRPO) | 6.574 | 6.535 | 2.067 | 5.904 | 3.173 | 6.512 | ||
| GNN | Policy (DPO) | 4.892 | 6.743 | 1.927 | 6.491 | 3.618 | 6.635 | |
| Policy (GRPO) | 6.179 | 7.067 | 2.480 | 7.146 | 7.535 | 7.104 | ||
| Pre. GNN | Policy (DPO) | 6.387 | 6.834 | 4.166 | 7.394 | 7.293 | 7.590 | |
| Policy (GRPO) | 7.274 | 7.205 | 4.547 | 7.768 | 7.932 | 7.992 |
We hope this clarifies the novelty and significance of our contribution.
W2: Incorporating Protein or Pocket Features in the Policy Model
R: Thank you for this valuable suggestion.
We would like to clarify that our previous experiments have utilized protein features by incorporating DrugCLIP pretrained representations. However, the policy model did not include a protein encoder to fully exploit these protein or pocket-level features during the selection process. To address this, we have conducted additional experiments by explicitly integrating protein or pocket-level features into the policy model.
| Method | Mol. Model | Protein Model | Strategy | ALDH1 | PKM2 | VDR | |||
|---|---|---|---|---|---|---|---|---|---|
| Iter 10 | Iter 16 | Iter 10 | Iter 16 | Iter 10 | Iter 16 | ||||
| GLARE | MLP | × | Policy | 6.574 | 6.535 | 2.067 | 5.904 | 3.173 | 6.512 |
| MLP | Pre. GNN | Policy | 6.732 | 6.607 | 4.159 | 6.575 | 4.124 | 6.716 | |
| GNN | × | Policy | 6.179 | 7.067 | 2.480 | 7.146 | 7.535 | 7.104 | |
| GNN | Pre. GNN | Policy | 6.501 | 7.175 | 4.027 | 7.318 | 6.827 | 7.224 | |
| Pre. GNN | × | Policy | 7.274 | 7.205 | 4.547 | 7.768 | 7.932 | 7.992 | |
| Pre. GNN | Pre. GNN | Policy | 7.027 | 7.269 | 4.672 | 7.795 | 7.897 | 8.021 |
These results also highlight both the strong extensibility and superior performance of our approach in virtual screening scenarios. We plan to further explore and design more effective protein feature fusion modules in future work.
W3: Scalability for Applications
R: Thank you for raising this important point.
We would like to clarify that our method does not require training a separate policy model for each target. In our experiments, we utilized DrugCLIP-pretrained representations to encode protein or pocket features, enabling the model to generalize across multiple targets. For example, on the LIT-PCBA benchmark, our model simultaneously handles 15 protein targets and their associated binding pockets within a single framework, demonstrating our superior performance across all targets (Table3 in manuscript).
To better exploit protein or pocket-level information, we also have conducted additional experiments incorporating these features directly into the policy model via a protein encoder. During training, the policy model takes both the molecular features and the corresponding protein or pocket features as input. This design enables efficient many-to-many matching between molecules and multiple targets or pockets within a single model, significantly improving scalability and practical applicability in large-scale virtual screening scenarios.
| Method | Mol. Model | Protein Model | AUROC(%) | BEDROC(%) | EF0.5% | EF1% | EF5% |
|---|---|---|---|---|---|---|---|
| DrugCLIP | - | - | 57.17 | 6.23 | 8.56 | 5.51 | 2.27 |
| GLARE(1) | Pre. GNN | × | 57.21 | 9.65 | 12.56 | 7.65 | 2.69 |
| Pre. GNN | Pre. GNN | 57.96 | 12.08 | 16.17 | 9.11 | 3.10 | |
| GLARE(3) | Pre. GNN | × | 61.76 | 18.76 | 23.32 | 13.61 | 4.25 |
| Pre. GNN | Pre. GNN | 62.51 | 21.36 | 26.96 | 15.57 | 4.62 | |
| GLARE(10) | Pre. GNN | × | 68.33 | 25.17 | 57.93 | 32.31 | 8.21 |
| Pre. GNN | Pre. GNN | 69.79 | 27.25 | 61.49 | 34.49 | 8.90 | |
| GLARE(15) | Pre. GNN | × | 70.17 | 28.49 | 77.03 | 40.64 | 9.94 |
| Pre. GNN | Pre. GNN | 71.72 | 33.72 | 81.69 | 43.80 | 10.66 |
W4: Baseline Methods in Experiments
R: Thank you for your comments.
In active learning research, comparisons are typically organized around both acquisition functions (strategies) and surrogate models. Accordingly, TcsAL and PtAL each represent a series of baselines in our experiments. TcsAL includes baselines with various strategies and has been evaluated with both MLP and GNN surrogate models. PtAL, on the other hand, emphasizes comparisons using different surrogate models, including pretrained models such as MolCLR and MoLFormer, while employing two widely validated strategies (Greedy and UCB).
To further compare active learning methods with traditional virtual screening baselines, and to demonstrate the added value of GLARE, Table 3 presents results of GLARE and several virtual screening baselines (including DrugCLIP) across all 15 LIT-PCBA subsets.
Additionally, we have included comparisons with a classic baseline (ε-Greedy strategy) and a recently proposed baseline (Rank strategy)[1], as shown in the table below. While the Rank strategy performs better than the methods reported in Table 1, it still underperforms compared to GLARE. This is primarily because Rank remains reliant on manually defined selection strategies.
| Model | Strategy | ALDH1 | PKM2 | VDR | |||
|---|---|---|---|---|---|---|---|
| Iter 10 | Iter 16 | Iter 10 | Iter 16 | Iter 10 | Iter 16 | ||
| MLP | ε-Greedy | 5.395 | 5.794 | 2.019 | 4.102 | 3.814 | 4.370 |
| Rank | 5.762 | 6.045 | 2.304 | 4.905 | 4.265 | 4.983 | |
| GLARE | 6.574 | 6.535 | 2.067 | 5.904 | 3.173 | 6.512 | |
| GNN | ε-Greedy | 3.861 | 4.267 | 2.135 | 2.914 | 2.911 | 3.327 |
| Rank | 4.397 | 5.637 | 2.278 | 4.662 | 3.767 | 5.492 | |
| GLARE | 6.179 | 7.067 | 2.480 | 7.146 | 7.535 | 7.104 |
We hope these extensive comparisons address the reviewer's concerns and further validate the robustness and effectiveness of GLARE.
[1] Deng, Xun, et al. Improving the Hit Rates of Virtual Screening by Active Learning from Bioactivity Feedback. Journal of Chemical Theory and Computation, 4640-4651, 2025.
Thanks for the authors' detailed response and additional experiments for my questions.
I’m still a bit unclear about the exact setting. From my understanding, in the experiments described in Section 4.2, a separate policy model needs to be trained for each new target since there are no protein or pocket features involved. In contrast, in Section 4.3, based on the authors’ response, it seems that with the use of the foundation model DrugCLIP, all targets are screened within a unified framework. Specifically, for each target and at each iteration, a shared policy model selects 64 molecules, receives labels for each (target, molecule) pair, and then uses all the labeled pairs to update the policy model. Is this interpretation correct? Given the complexity of the setting, a more detailed explanation of the methodology would be helpful.
After reading the other reviews and the authors’ rebuttal, I believe this paper presents a very interesting new setting. For experiment in 4.3, there is no doubt that it could significantly improve conventional screening performance under a limited active learning budget. However, such test-time improvements are somewhat expected. For example, a recent paper (DrugTTA [1]) has shown that with a pretrained model like DrugCLIP, test-time adaptation alone, even without any active learning budget, can boost screening performance to a BEDROC of 45.08% on LIT-PCBA. Currently, there are no active learning or test-time adaptation baseline comparisons provided for the results in Section 4.3, which makes it difficult to assess the relative effectiveness of the proposed method.
The active learning setting for virtual screening in Section 4.2 follows the setup from TcsAL. I appreciate the authors’ efforts in providing new baselines. However, I still find the baselines somewhat too simplistic. That said, I acknowledge the contribution of introducing a reinforcement learning objective to enhance the active learning acquisition strategy. The work would be more impactful if this reinforced active learning approach were explored across a broader range of molecular tasks beyond virtual screening and evaluated against more advanced or competitive baselines where possible.
[1] Shen, Ao, et al. "Drug-TTA: Test-Time Adaptation for Drug Virtual Screening via Multi-task Meta-Auxiliary Learning." Forty-second International Conference on Machine Learning.
Experiments Setting
Thank you for your thoughtful comments and for your attention to the experimental setup. Your understanding is correct. We will further clarify the methodology in more detail in the revised manuscript:
-
In Section 4.2: As no protein or pocket features are used in this section, we follow the standard baseline protocol for active learning-based virtual screening. Specifically, a separate policy model is trained independently for each target (subset). Leveraging generalizable protein or pocket representations to enable a single policy model for all targets in this section also is a promising direction we will explore in future work.
-
In Section 4.3: In this section, we incorporate target (protein) features using DrugCLIP. In the original experiments, we used pre-trained protein features without further training the protein encoder. Following your suggestion, we additionally designed and trained a protein encoder during the rebuttal phase, which led to improved performance.
With the incorporation of target features, the setting of Section 4.3 differs from Section 4.2. During active learning, we employ a single, unified policy model (GLARE) to screen all targets jointly. At each iteration, the policy model selects molecules from each sub-library, and the corresponding (target, molecule) pairs are labeled and used to update the model. Once the number of labeled active molecules reaches the budget , the active learning process stops and the remaining molecules in each sub-library are screened.
We will revise the manuscript to provide more detailed explanation of these experimental protocols as suggested.
More Competitive Baselines
Thank you for recognizing that our work presents a very interesting new setting. We also appreciate your valuable suggestion regarding baseline comparisons.
We would like to clarify that DrugTTA [1] was published after our initial submission, so it was not possible to include it as a baseline at that time.
Our work is fundamentally different from DrugTTA: GLARE achieves significant improvements over DrugCLIP with only a small number of actives (e.g., an 8-fold increase in EF0.5% with just 15 actives), while DrugTTA performs test-time adaptation across the entire library without active learning. These two methods operate under different assumptions and settings, and may even be complementary in practice.
In terms of scalability, DrugTTA requires test-time adaptation for every molecule, which is computationally expensive (e.g., 2.1 days for 100 million molecules). In contrast, our method maintains highly efficient inference, screening 200 million molecules takes only 0.5 hours, and 1 billion molecules just 14 hours.
Furthermore, we have added active learning and TTA baselines in Section 4.3. The results are summarized below. We hope these additions address your concerns and further clarify our contributions.
| Model | Strategy | AUCROC(%) | BEDROC(%) | EF0.5% | EF1% | EF5% |
|---|---|---|---|---|---|---|
| DrugCLIP | - | 57.17 | 6.23 | 8.56 | 5.51 | 2.27 |
| GLARE(20) | MI | 65.08 | 29.75 | 37.85 | 21.34 | 6.41 |
| GLARE(20) | DPO | 70.36 | 35.03 | 65.53 | 37.63 | 9.62 |
| DrugTTA | - | 71.24 | 45.08 | 74.39 | 42.74 | 10.61 |
| GLARE(15) | GRPO | 71.72 | 33.72 | 81.69 | 43.80 | 10.66 |
| GLARE(20) | GRPO | 79.78 | 46.39 | 83.51 | 47.36 | 12.08 |
Finally, we would like to emphasize that, to our best knowledge, this is the first work to integrate GRPO within an active learning framework and apply it to large-scale virtual screening. By reformulating virtual screening as a Markov Decision Process (MDP) and introducing an active learning-driven reward structure, GLARE can adaptively and efficiently select the most informative molecules.
We will further discuss the comparison and differences between GLARE and TTA-based methods (Drug-TTA) in the revised manuscript.
A broader range of molecular tasks
Thank you for acknowledging our contribution in introducing a reinforcement learning objective to enhance the active learning acquisition strategy.
We would like to emphasize that the GLARE framework is highly flexible and extensible. While our current work focuses on virtual screening, the underlying methodology, framing the acquisition strategy as a reinforcement learning problem, can be readily adapted to a wide range of molecular tasks, such as few-shot molecular property prediction, molecular docking, and molecular lead optimization.
Our core contribution lies in coupling active learning with reinforcement learning, enabling GLARE to adaptively and efficiently identify informative samples in data, scarce molecular scenarios. We believe this unified framework not only demonstrates significant potential for efficient data acquisition across diverse molecular applications, but also brings meaningful value to the broader field of molecular machine learning.
Thank you again for your insightful feedback.
Thanks the authors for their justification and additional experiments. I acknowledge that applying an RL objective in an active learning setting is an interesting exploration, and its success is not surprising. However, I still have some reservations about the contributions, as I feel that the method is highly dependent on the specific setting. In particular, if RL for active learning is only applied to virtual screening, the chosen baselines (simple acquisition functions) are relatively weak, and it is expected that RL would outperform them. This makes the contribution appear to be somewhat setting-dependent. That said, from another perspective, introducing a method to a new task where it has not been applied before does have some kind of novelty. At present, I remain cautious in my evaluation, but I may update my score upon more careful consideration. I thank the authors again for their efforts and engagement in the rebuttal process.
We greatly appreciate the time and effort you have dedicated to reviewing our work. Your thoughtful and constructive feedback has played a crucial role in enhancing the quality of our manuscript.
Please let us know if you have any further questions or need additional information.
This paper presents GLARE, a novel active learning framework for virtual screening. The authors reformulate the selection of molecules as a Markov Decision Process (MDP) and apply Group Relative Policy Optimization (GRPO) to train a learnable policy model for molecule selection. The proposed framework achieves state-of-the-art performance over baselines along with better exploration and exploitation.
优缺点分析
Stength:
- This paper studies an important scientific problem, i.e., virtual screening, with potential real-world impact
- Reformulating virtual screening as an MDP is innovative
- The writing is overall clear and easy to follow.
Weakness:
- The RL part relies heavily on Group Relative Policy Optimization, which limits the novelty of the paper
- The paper does not report walltime or computational cost relative to TcsAL or PtAL.
- The choice of pretrained GNN is not clear. If I understand correctly GNN corresponds to GIN, how about the pretrained GNN? And similarly, for table 2 could author add another experiment so that PtAL and GLARE are using same pretrained GNNs for a more fair comparison?
问题
- Could GLARE be extended to support multi-objective optimization? for example, if users want to identify active molecules with optimized druglikeness, solubility, etc. Could the reward function or policy network be extended accordingly?
局限性
Yes
最终评判理由
The authors' responses have addressed most of my concerns and problems, and I am therefore willing to raise my score to 4.
格式问题
NA
We sincerely thank the reviewer for the detailed feedback and constructive suggestions to improve our manuscript. Below, we summarize the reviewer's questions and provide point-by-point responses.
W1: Clarification on Novelty and GRPO
R: Thank you for your valuable comments.
We respectfully disagree with the view that our work is merely a straightforward application of Group Relative Policy Optimization (GRPO). In fact, our study goes significantly beyond simply adopting GRPO. We specifically address the unique and underexplored challenges of active learning in virtual screening, challenges that are fundamentally different from those in standard RL domains. These include extreme class imbalance, label scarcity, and the need for efficient prioritization of candidates from ultra-large chemical spaces.
To our knowledge, this is the first work to integrate GRPO within an active learning framework and apply it to large-scale virtual screening. We not only reformulate the virtual screening process as a Markov Decision Process (MDP) but also introduce active learning-driven reward structures, enabling GLARE to adaptively and efficiently select the most informative molecules. By coupling the RL agent's selection strategy with docking feedback in an active learning loop, our method, GLARE, efficiently prioritizes the most informative docking computations, offering a efficient and practical solution for real-world virtual screening workflows.
In addition, based on GRPO, we have substantially redesigned the framework. Specifically, we introduced a new reward formulation that incorporates gradient-based uncertainty, specifically tailored to the active learning context, an aspect not addressed in prior GRPO works.
We also compared GRPO with Direct Policy Optimization (DPO), another effective RL algorithm, for optimizing the policy network. As shown in the table below, GRPO consistently outperforms DPO in our setting. We attribute this to GRPO's use of group-wise advantage estimation, which better captures the relative quality of selected molecules within a batch (a property that aligns well with the objectives of virtual screening). This provides a direct motivation for our choice of GRPO.
| Method | Model | Strategy | ALDH1 | PKM2 | VDR | |||
|---|---|---|---|---|---|---|---|---|
| Iter 10 | Iter 16 | Iter 10 | Iter 16 | Iter 10 | Iter 16 | |||
| GLARE | MLP | Policy (DPO) | 5.759 | 6.012 | 1.878 | 5.526 | 2.397 | 5.961 |
| Policy (GRPO) | 6.574 | 6.535 | 2.067 | 5.904 | 3.173 | 6.512 | ||
| GNN | Policy (DPO) | 4.892 | 6.743 | 1.927 | 6.491 | 3.618 | 6.635 | |
| Policy (GRPO) | 6.179 | 7.067 | 2.480 | 7.146 | 7.535 | 7.104 | ||
| Pre. GNN | Policy (DPO) | 6.387 | 6.834 | 4.166 | 7.394 | 7.293 | 7.590 | |
| Policy (GRPO) | 7.274 | 7.205 | 4.547 | 7.768 | 7.932 | 7.992 |
We hope this clarifies the novelty and significance of our contribution.
W2: The Computational Cost
R: Thank you for raising this important question. To provide a comprehensive comparison, we evaluated both training and inference runtimes on the ALDH1 dataset, using the same hyperparameters and hardware as TcsAL (see Appendix C.4).
Active learning typically involves training on a small number of labeled samples while making predictions on a much larger pool of unlabeled data. As a result, inference time becomes the dominant computational cost, especially in large-scale virtual screening scenarios. As shown in the table below, while our method requires additional training time due to the computational cost of gradient-based uncertainty in reward calculation, this overhead is confined to the training phase. During inference, reward computation is not required, enabling our method to achieve inference speeds comparable to TcsAL.
| Method | Model | Train Time (Sec. per Epoch) | Infer. Time (Sec.) |
|---|---|---|---|
| TcsAL | MLP | 1.5 | 18.2 |
| GNN | 4.2 | 148.6 | |
| GLARE | MLP | 4.7 | 14.5 |
| GNN | 8.9 | 152.6 |
The characteristic is particularly important for large-scale dataset. To further demonstrate the scalability of our approach, we tested it on the AmpC dataset (~ compounds) using the same hyperparameters as PtAL. As dataset size increases, inference time becomes even more significant. Our method maintains inference speed similar to baseline methods while delivering much higher accuracy, making it especially well-suited for ultra-large virtual screening tasks.
| Method | Model | EnamineHTS-0.1 (2m) | AmpC (99.5m) | ||
|---|---|---|---|---|---|
| Train Time (Sec.) | Infer. Time (Sec.) | Train Time (Sec.) | Infer. Time (Sec.) | ||
| PtAL | GNN | 71.4 | 825.5 | 4,112.6 | 40,423.3 |
| Pre. GNN | 129.1 | 922.3 | 6,749.7 | 45,374.8 | |
| GLARE | GNN | 84.2 | 913.3 | 4,868.2 | 49,662.4 |
| Pre. GNN | 155.9 | 1,179.1 | 8,237.6 | 55,127.2 |
W3: The choice of Pretrained GNN.
R: Thank you for highlighting this important point.
For GLARE (with GNN), we use a Graph Isomorphism Network (GIN) as the backbone. For GLARE (with Pre. GNN), we also employ a GIN architecture, but it is initialized with weights from the pretrained GraphMVP or MolCLR model.
From the table below, GLARE (with Pre. GNN) substantially outperforms PtAL (with MolCLR), even though both use pretrained GINs (with different pretraining protocols). Notably, GLARE (with GNN, no pretraining) also surpasses PtAL (with MolCLR) in the final phase (Iter 6), confirming the strength of our learned policy.
To further isolate the effect of the learning strategy, we additionally compared GLARE (with MolCLR) and PtAL (with MolCLR) using the same pretrained GINs. As shown in the the table below, GLARE consistently outperforms PtAL under the same backbone, demonstrating that the performance gain is primarily attributable to GLARE's superior learning strategy, rather than the encoder itself.
| Method | Model | Strategy | Enamine50k | EnamineHTS-0.1 | EnamineHTS-0.2 | |||
|---|---|---|---|---|---|---|---|---|
| Iter 4 | Iter 6 | Iter 4 | Iter 6 | Iter 4 | Iter 6 | |||
| PtAL | MolCLR | Greedy | 0.5000 | 0.6708 | 0.5512 | 0.7278 | 0.7574 | 0.8698 |
| MolCLR | UCB | 0.4972 | 0.6796 | 0.5384 | 0.7276 | 0.7624 | 0.8844 | |
| GLARE | GNN | Policy | 0.4926 | 0.7424 | 0.4385 | 0.7526 | 0.7024 | 0.9032 |
| MolCLR | Policy | 0.7695 | 0.8652 | 0.8425 | 0.8814 | 0.9356 | 0.9527 | |
| Pre. GNN (GraphMVP) | Policy | 0.7765 | 0.8869 | 0.8637 | 0.9181 | 0.9618 | 0.9732 |
Q1: Extension to Multi-Objective Optimization.
R: Thank you for this excellent question. We appreciate the suggestion and will discuss this direction in the Discussion section as future work.
GLARE can indeed be extended to support multi-objective optimization, which is highly relevant for practical drug discovery.
Reward Extension: A straightforward way is to incorporate additional objectives as regularization terms in the reward function. For example, the reward can be reformulated as where represents a constraint or score for another property, such as druglikeness or similarity to a reference compound. By tuning , the model can learn to favor molecules optimizing multiple objectives.
Policy Extension: Alternatively, the framework can support multiple policy networks, each focusing on a different objective (such as activity, druglikeness, solubility, etc.). Each policy provides a confidence score for its respective objective, and decisions can be made by aggregating these scores. This can be formulated as a multi-agent optimization framework: where actions from different agents (objectives) are combined to determine the final selection.
Thanks to our flexible reward design and reinforcement learning-based policy, GLARE naturally accommodates such extensions. This makes it well-suited for practical drug discovery scenarios that require simultaneous optimization of multiple molecular properties.
Thank you for your time and effort in reviewing our work. We sincerely appreciate your insightful and constructive feedback, which has been instrumental in improving our manuscript.
Please let us know if you have any further questions or need additional information.
This work proposes GLARE, a reinforcement learning-based active learning (AL) framework designed for large-scale virtual screening under limited annotation budgets. GLARE reformulates molecular selection as a Markov Decision Process (MDP) and replaces traditional hand-crafted acquisition strategies with a learnable policy network trained via Group Relative Policy Optimization (GRPO). The method dynamically balances structural diversity, biological relevance, and computational cost during molecule selection. The framework is compatible with various molecular encoders, including MLPs, GNNs, and pretrained models. Experiments on the LIT-PCBA and Enamine benchmarks shown substantial improvements in enrichment factor (EF) and retrieving rate (RR) over competitive baselines, particularly in low-data regimes. GLARE also shows strong synergy with foundation models like DrugCLIP, achieving multi-fold gains with minimal additional annotations. Overall, the authors contribute an adaptive approach for efficient hit identification in early-stage drug discovery.
优缺点分析
Strength:
- The use of a learnable policy model optimized through GRPO distinguishes this work from previous solutions that rely on static and handcrafted heuristics. The methodology is solid as the paper provides a thorough breakdown of the components, including the molecular encoder variants, reward function design, and policy optimization.
- The experimental section is extensive, with evaluations on both standard (LIT-PCBA) and large-scale (Enamine) benchmarks. The framework performs well even in extremely low-data regimes, and it is also able to improve the enrichment factor of foundation models like DrugCLIP with limited additional labels. These promising results (Section 4, Experiments) show a great potential of GLARE in early-stage drug-discovery practice.
- The inclusion of ablation studies, UMAP visualizations, and comparisons across multiple molecular encoders add clarity and depth to the evaluation.
Weakness:
- Although the GRPO-based optimization is well-motivated, the computational cost associated with reinforcement learning in high-dimensional chemical spaces is not clearly quantified (Section 5, Discussion), particularly as the size of the screening library increases to 10⁸–10⁹ compounds.
- The reward function in equation 3 and equation 10 uses uᵢ, a manually constructed discount factor for exploration. This heuristic rule proves effective on specific datasets, but it may not generalize well to other tasks (e.g., different biological targets or hit rate distributions), as it is fundamentally driven by empirical design rather than data.
- The contributions of the pretrained encoder versus the learned policy in GLARE are not clearly distinguished. In Table 1 and Table 2, although the combination of GLARE with pretrained GNN encoders performs best, the experiments do not clarify whether the observed performance gains arise primarily from the pretrained molecular representations or from the optimization of the policy network itself.
问题
- Can the authors provide a quantitative analysis of GLARE’s computational cost and scalability?
- How sensitive is GLARE’s policy performance to the reward formulation and exploration heuristics?
局限性
No The paper presents a well-structured and comprehensive framework, but its discussion of limitations could be made more explicit. The authors briefly mention scalability issues (Section 5, Discussion), but they do not provide any quantitative analysis of the computational cost of policy training and inference, especially in ultra large chemical libraries. Next, the reward function includes heuristic components such as uncertainty-weighted discount factors, yet the potential sensitivity of these heuristics across diverse targets is not explored. On top of that, the work does not consider the biological interpretability and off-target risks of the retrieved compounds, which could limit the application of the framework in drug discovery practical scenarios. At this point, a more transparent discussion of these aspects would strengthen the paper’s reliability and completeness.
最终评判理由
The author has addressed my concerns. Although this work introduces one new way of improving the activing learning virtual screening. I view it as an incremental work rather than a transformative for the field. Since the score to 6 requires groundbreaking impact as stated in the rating selection, and therefore I maintain my original score of 5.
格式问题
No
We sincerely thank the reviewer for their positive comments, especially recognizing the novelty of our work and the extensiveness of our experimental section. We also appreciate the constructive suggestions, which have helped us to further improve the quality of our manuscript. Below we provide detailed point-by-point responses to each of the reviewer's concerns.
W1 & Q1: Computational Cost in Large Chemical Space
R: We appreciate your valuable comment on the computational scalability of GRPO-based reinforcement learning, particularly as the screening library size increases to – compounds.
To address this concern, we conducted large-scale experiments on the AmpC dataset (~ compounds) using the same hyperparameters as PtAL. As shown in the table below, our method (GLARE) achieves inference speeds comparable to the baseline PtAL across both medium (EnamineHTS-0.1, 2 million compounds) and ultra-large (AmpC, 99.5 million compounds) datasets. Notably, while our method introduces additional computational overhead during training due to gradient-based reward optimization, this overhead does not affect inference. During inference, reward computation is not required, and the runtime is dominated by model forward passes, as in the baselines.
| Method | Model | EnamineHTS-0.1 (2m) | AmpC (99.5m) | ||
|---|---|---|---|---|---|
| Train Time (Sec.) | Infer Time (Sec.) | Train Time (Sec.) | Infer Time (Sec.) | ||
| PtAL | GNN | 71.4 | 825.5 | 4,112.6 | 40,423.3 |
| Pre. GNN | 129.1 | 922.3 | 6,749.7 | 45,374.8 | |
| GLARE | GNN | 84.2 | 913.3 | 4,868.2 | 49,662.4 |
| Pre. GNN | 155.9 | 1,179.1 | 8,237.6 | 55,127.2 |
We also evaluated both training and inference runtimes on the ALDH1 dataset (PKM2 and VDR have similar scales), using the same hyperparameters and hardware as TcsAL.
| Method | Model | Train Time (Sec. per Epoch) | Infer Time (Sec.) |
|---|---|---|---|
| TcsAL | MLP | 1.5 | 18.2 |
| GNN | 4.2 | 148.6 | |
| GLARE | MLP | 4.7 | 14.5 |
| GNN | 8.9 | 152.6 |
W2: The Design of the Exploration Discount Factor
R: Thank you for this insightful comment.
We would like to clarify that the exploration discount factor in our reward function is not solely based on heuristic or empirical design, but is fundamentally data-driven. The motivation for introducing is to ensure that even if a molecule is predicted as inactive, it can still be selected if its uncertainty is high (i.e., it may be informative for model training). Thus, we penalize the reward by a factor that captures uncertainty, rather than using a fixed or manually tuned value.
Specifically, inspired by [1], we incorporate Exploration Modification in the reward, where uncertainty is quantified by the L2 norm of the model's gradient. This approach makes the discount factor adaptive to the actual characteristics of each dataset, instead of being a static or hand-crafted parameter. As a result, our method can flexibly adapt to different tasks, targets, and hit rate distributions, thereby enhancing both its generalizability and effectiveness.
We will revise this part in the manuscript to improve its clarity. Thank you again for your valuable feedback.
[1] Jordan T. Ash et al., "Deep batch active learning by diverse, uncertain gradient lower bounds." arXiv preprint arXiv:1906.03671, 2019.
W3: The Contributions of the Pretrained Encoder and Learned Policy
R: Thank you for highlighting this important point.
To clarify the respective contributions of the pretrained encoder and the learned policy in GLARE, we provide a condensed analysis of the results from Table 1 and Table 2.
As shown in the table below, GLARE (with GNN) significantly outperforms TcsAL (with GNN), regardless of the strategy used. Both employ the same non-pretrained GIN architecture. This demonstrates that our Learnable Policy strategy in GLARE provides a substantial advantage over prior approaches. GLARE (with Pre. GNN) further outperforms GLARE (with GNN), indicating that pretraining effectively addresses the cold-start problem and enhances early-phase retrieval performance.
| Method | Model | Strategy | ALDH1 | PKM2 | VDR | |||
|---|---|---|---|---|---|---|---|---|
| Iter 10 | Iter 16 | Iter 10 | Iter 16 | Iter 10 | Iter 16 | |||
| TcsAL | GNN | Random | 1.020 | 0.988 | 0.868 | 1.036 | 1.071 | 0.994 |
| Similarity | 2.438 | 2.362 | 1.943 | 2.113 | 2.419 | 2.664 | ||
| Uncertainty | 0.907 | 0.923 | 0.868 | 0.777 | 0.754 | 0.681 | ||
| Greedy | 3.126 | 3.750 | 1.901 | 2.548 | 3.014 | 2.871 | ||
| MI | 3.482 | 4.077 | 1.777 | 2.952 | 4.085 | 4.085 | ||
| GLARE | GNN | Policy | 6.179 | 7.067 | 2.480 | 7.146 | 7.535 | 7.104 |
| Pre. GNN (GraphMVP) | Policy | 7.274 | 7.205 | 4.547 | 7.768 | 7.932 | 7.992 |
From the table below, GLARE (with Pre. GNN) also substantially surpasses PtAL (with MolCLR), even though both use pretrained GINs (with different pretraining protocols). GLARE (with GNN, no pretraining) even outperforms PtAL (with MolCLR) at the final phase (Iter 6), confirming the strength of our learned policy.
To further isolate the effect of the learning strategy, we compared GLARE (with MolCLR) and PtAL (with MolCLR) under same pretrained GINs, also shown in the table below. GLARE consistently outperforms PtAL, showing that the performance gain is primarily due to GLARE's superior learning strategy rather than the encoder alone.
| Method | Model | Strategy | Enamine50k | EnamineHTS-0.1 | EnamineHTS-0.2 | |||
|---|---|---|---|---|---|---|---|---|
| Iter 4 | Iter 6 | Iter 4 | Iter 6 | Iter 4 | Iter 6 | |||
| PtAL | MolCLR | Greedy | 0.5000 | 0.6708 | 0.5512 | 0.7278 | 0.7574 | 0.8698 |
| MolCLR | UCB | 0.4972 | 0.6796 | 0.5384 | 0.7276 | 0.7624 | 0.8844 | |
| GLARE | GNN | Policy | 0.4926 | 0.7424 | 0.4385 | 0.7526 | 0.7024 | 0.9032 |
| MolCLR | Policy | 0.7695 | 0.8652 | 0.8425 | 0.8814 | 0.9356 | 0.9527 | |
| Pre. GNN (GraphMVP) | Policy | 0.7765 | 0.8869 | 0.8637 | 0.9181 | 0.9618 | 0.9732 |
These comparisons clearly demonstrate that GLARE's superior performance arises mainly from its learned policy, while the use of a pretrained encoder provides an additional warm-up effect, further enhancing screening efficiency.
Q2: Sensitivity Analysis of GLARE's Policy Performance
R: To evaluate the sensitivity of GLARE's policy to the reward formulation and exploration heuristics, we conducted a series of ablation studies on the ALDH1, PKM2, and VDR datasets using GLARE (with GNN). The results are summarized in the table below.
| Method | Strategy | ALDH1 | PKM2 | VDR |
|---|---|---|---|---|
| TcsAL | Greedy | 3.750 | 2.548 | 2.871 |
| MI | 4.077 | 2.952 | 4.085 | |
| GLARE | , i.e., | 6.539 | 6.657 | 6.531 |
| 6.842 | 6.863 | 6.732 | ||
| 6.873 | 6.902 | 6.965 | ||
| Policy | 7.067 | 7.146 | 7.104 |
-
Removing exploration modification entirely (, i.e., ) results in the lowest performance among GLARE variants, as the model loses the ability to identify uncertain but informative molecules.
-
Using a constant exploration factor () yields better results than no modification, but is still less effective than dynamically adjusting based on gradient uncertainty.
-
When the reward in GLARE is modified as , performance drops slightly compared to the original GLARE, but still remains well above all baselines, indicating that GLARE is not highly sensitive to the specific reward formulation.
Overall, these results highlight the robustness of GLARE and the added value of our adaptive exploration strategy. Importantly, all GLARE variants consistently outperform the existing baselines, regardless of how the reward or exploration modification is set. This demonstrates the robustness of the GLARE framework.
L1: Discussion
R: We appreciate the reviewer's valuable feedback and acknowledge several limitations of our current framework. We will incorporate these discussions and additional analyses in the revised manuscript.
While GLARE demonstrates strong performance and reasonable computational efficiency in our experiments, scaling to ultra-large chemical libraries (e.g., compounds) remains challenging. To address this, we have included additional experiments on a -scale compound database, which demonstrate the method's scalability in large-scale settings.
Although our sensitivity analysis shows that GLARE is robust to changes in reward formulation and exploration heuristics, there may still be cases where performance is affected by unusual or out-of-distribution targets.
Finally, our current framework does not explicitly consider biological interpretability, which is important for real-world drug discovery applications. We recognize these as important directions for future research to further enhance the reliability and practical impact of our approach.
Thank you for your response. The authors have addressed my concerns. However, elevating the paper to a “Strong Accept” would require "groundbreaking impact" as described in the rating selection. While the method is interesting and valuable to the field, I view its contribution as incremental rather than transformative, so I will keep my original score.
Thank you very much for your thoughtful feedback and for recognizing our method as interesting and valuable to the field. While we understand and respect your perspective regarding the level of impact, we would like to highlight that, to our knowledge, this is the first work to integrate GRPO within an active learning framework for large-scale virtual screening. By coupling the RL agent’s selection strategy with docking feedback in an active learning loop, we believe these innovations collectively offer a practical and efficient solution for real-world virtual screening.
We sincerely appreciate your insightful and constructive feedback, which have greatly contributed to improving our work.
This paper introduces the GLARE framework (a GRPO-based Learning framework for Active REinforced screening) for virtual screening in drug discovery. It casts virtual screening as a Markov decision process and integrates it into a policy-based GRPO reinforcement learning framework. GLARE achieves competitive performance across multiple benchmark datasets.
优缺点分析
Strengths:
- Models virtual screening as a Markov decision process and leverages GRPO for policy optimization.
- Attains higher enrichment factors and retrieval rates compared to classical active‐learning strategies and state‐of‐the‐art pretrained backbones.
Weaknesses:
- Represents an incremental modification of existing reinforcement learning frameworks.
- Does not report training or inference runtime.
问题
- What is the computational time required to train these models?
- Are four and six iterations sufficient for benchmarking the enrichment factor in Table 2? Can the authors show that training has converged?
- Could the authors elaborate on the specific strategies applied in the experiments summarized in Table 2?
局限性
Yes
最终评判理由
This work applies GRPO reinforcement learning to virtual screening. The authors address my concern regarding training convergence and report the computational costs. One minor point is the incremental use of an existing GRPO method, as noted by another reviewer as well. Therefore, we maintain our current positive score.
格式问题
None
We appreciate the reviewer's careful review of our paper, positive feedback, and recognition of our work, particularly the effectiveness of our proposed method. Below, we address the concerns point by point:
W1: The Incremental Modification of Existing Reinforcement Learning Frameworks
R: Thank you for your valuable comments. We would like to clarify that our work is not a simple incremental modification of existing RL frameworks. Our study specifically addresses active learning in virtual screening, which poses unique challenges compared to standard RL tasks.
To our knowledge, we are the first to integrate RL (GRPO) with active learning and apply it to virtual screening. We innovatively model the virtual screening scenario as a Markov Decision Process (MDP) and combine it with active learning principles. This enables our method, GLARE, to learn how to select the most informative molecules adaptively. By coupling the RL agent's selection strategy with docking feedback in an active learning loop, our method, GLARE, efficiently prioritizes the most informative docking computations, offering a efficient and practical solution for real-world virtual screening workflows.
In addition, based on GRPO, we have substantially redesigned the framework. Specifically, we introduced a new reward function based on gradient uncertainty (Exploration Modification), specifically crafted to balance exploration and exploitation in this context, an aspect not addressed by existing methods. We hope this clarifies the novelty and significance of our contribution.
W2 & Q1: The Runtime of Training or Inference
R: Thank you for raising this important question. To provide a comprehensive comparison, we evaluated both training and inference runtimes on the ALDH1 dataset, using the same hyperparameters and hardware as TcsAL (see Appendix C.4).
Active learning typically involves training on a small number of labeled samples while making predictions on a much larger pool of unlabeled data. As a result, inference time becomes the dominant computational cost, especially in large-scale virtual screening scenarios. As shown in the table below, while our method requires additional training time due to the computational cost of gradient-based uncertainty in reward calculation, this overhead is confined to the training phase. During inference, reward computation is not required, enabling our method to achieve inference speeds comparable to TcsAL.
| Method | Model | Train Time (Sec. per Epoch) | Infer. Time (Sec.) |
|---|---|---|---|
| TcsAL | MLP | 1.5 | 18.2 |
| GNN | 4.2 | 148.6 | |
| GLARE | MLP | 4.7 | 14.5 |
| GNN | 8.9 | 152.6 |
The characteristic is particularly important for large-scale dataset. To further demonstrate the scalability of our approach, we tested it on the AmpC dataset (~ compounds) using the same hyperparameters as PtAL. As dataset size increases, inference time becomes even more significant. Our method maintains inference speed similar to baseline methods while delivering much higher accuracy, making it especially well-suited for ultra-large virtual screening tasks.
| Method | Model | EnamineHTS-0.1 (2m) | AmpC (99.5m) | ||
|---|---|---|---|---|---|
| Train Time (Sec.) | Infer. Time (Sec.) | Train Time (Sec.) | Infer. Time (Sec.) | ||
| PtAL | GNN | 71.4 | 825.5 | 4,112.6 | 40,423.3 |
| Pre. GNN | 129.1 | 922.3 | 6,749.7 | 45,374.8 | |
| GLARE | GNN | 84.2 | 913.3 | 4,868.2 | 49,662.4 |
| Pre. GNN | 155.9 | 1,179.1 | 8,237.6 | 55,127.2 |
Q2. The Sufficiency of Iterations and Training Convergence in Table 2
R: Thank you for pointing out this question.
We closely follow the setting of the baseline PtAL, which adopts four and six active learning iterations with a fixed total annotation budget. Specifically, the total budget = (number of iterations) × (budget per iteration), and this total budget is kept the same across all methods for fair comparison. This practice is standard in the active learning literature and is widely considered sufficient for benchmarking enrichment factors. In our experiments, we observed that the model performance stabilizes after several iterations, and additional rounds yield diminishing returns. Thus, using four and six iterations is both representative and adequate for this benchmark.
Regarding training convergence, we ensured sufficient training within each active learning round. Following the TcsAL baseline, we trained each model for 50 epochs per iteration and monitored the training loss. As shown in the table below, the loss curves for MLP, GNN, and Pre. GNN all reach convergence within each iteration, confirming that our models are well-trained.
| Epoch | MLP | GNN | Pre. GNN | |||
|---|---|---|---|---|---|---|
| Iter 4 | Iter 6 | Iter 4 | Iter 6 | Iter 4 | Iter 6 | |
| 1 | 0.4328 | 0.4204 | 0.4548 | 0.555 | 0.4858 | 0.5183 |
| 2 | 0.1292 | 0.1828 | 0.1717 | 0.2749 | 0.2854 | 0.1897 |
| 5 | 0.0225 | 0.0181 | 0.0478 | 0.0871 | 0.0351 | 0.0181 |
| 10 | 0.0016 | 0.0097 | 0.0062 | 0.0158 | 0.0039 | 0.0036 |
| 20 | 0.0027 | 0.0023 | 0.0112 | 0.0092 | 0.0185 | 0.0256 |
| 50 | 0.0003 | 0.0009 | 0.0071 | 0.0029 | 0.0006 | 0.0221 |
Q3: Active Learning Strategies in Table 2
R: Thank you for your question.
In the experiments summarized in Table 2, we adopted several active learning acquisition strategies, as detailed in Appendix C.2. Specifically, we used Greedy and Upper Confidence Bound (UCB) methods. The corresponding acquisition functions are as follows:
-
Greedy: Selects samples with the highest predicted value:
-
Uncertainty: Selects the most uncertain samples:
-
Mutual Information (MI): Selects samples with the highest mutual information:
-
Upper Confidence Bound (UCB): Selects samples with the highest upper confidence bound:
These strategies are standard in active learning and allow for a fair and comprehensive benchmark as presented in Table 2.
Thank you for your time and effort in reviewing our work. We sincerely appreciate your insightful and constructive feedback, which has been instrumental in improving our manuscript.
Please share any other concerns you may have at your convenience. We will be glad to hear from you.
Thanks for the rebuttal. I believe the authors will include the additional results in the revised manuscript. I will maintain the current positive score.
We sincerely thank all reviewers for their insightful feedback and recognition of our contributions:
- #7Vgx: "The methodology is solid…”, “The framework performs well even in extremely low-data regimes."
- #jbx2: "Reformulating virtual screening as an MDP is innovative."
- #XNVU: "The overall framework design is well-aligned with the virtual screening task… consistent improvement over baseline models…"
We also appreciate the thoughtful suggestions regarding efficiency, baseline comparisons, and the novelty of GRPO and the GNN backbone.
In summary,
-
Methodological Contribution:
This is the first work to integrate GRPO into an active learning framework for large-scale virtual screening. Our reformulation as an MDP, combined with a gradient-based uncertainty reward, enables more adaptive and efficient molecule selection. The active learning loop with RL-driven selection and docking feedback allows GLARE to prioritize informative computations, making it practical for real-world use.
-
Efficiency:
Training incurs extra overhead due to gradient-based uncertainty, but this is limited to training. Inference is as fast as TcsAL, since reward computation is not needed. Compared to DrugTTA, our method is dramatically more efficient at inference: e.g., screening 1 billion molecules in 14 hours versus DrugTTA's 2.1 days for 100 million.
-
Component Effectiveness:
Ablation studies confirm that GRPO consistently outperforms DPO, thanks to its group-wise advantage estimation, which aligns well with batch selection in virtual screening. These ablation results also show that GLARE is flexible and extensible.
Once again, we thank all reviewers and the area chair for their time and constructive feedback, which have significantly improved our work.
This paper proposes GLARE, a reinforced active learning framework for virtual screening that formulates the problem as an Markov decision problem and leverages Group Relative Policy Optimization (GRPO). Reviewers appreciated the clear motivation, strong experimental results, and practical significance of the approach, particularly the demonstrated improvements over state-of-the-art AL methods and the ability to enhance VS foundation models such as DrugCLIP. The framework’s design, which integrates uncertainty-aware rewards and operates effectively within limited iterations, was considered well-suited for real-world drug discovery scenarios.
In the reviews and during the AC-Reviewer discussion, concerns were raised about the limited evaluation metrics (focus on enrichment factors without considering optimization of best binders). Still, the authors’ rebuttal was judged thorough and addressed runtime and methodological concerns convincingly. AC recommends that all rebuttal contributions are incorporated in the final version and that additionally the paper contribution is discussed in the context of the existing literature of RL for molecule design, which was overlooked in the initial submission.
The contribution is significant in bringing reinforcement learning into active learning for virtual screening, a domain where adaptive strategies are highly needed. Overall, while not transformative at the algorithmic level, the work is technically sound, well-executed, and fills an important niche in molecular AI. Given the empirical strength and practical impact, the paper is recommended for acceptance.