6.4

/10

Spotlight4 位审稿人

最低2最高5标准差1.2

4.0

置信度

创新性2.8

质量2.3

清晰度2.5

重要性2.8

NeurIPS 2025

UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection

Jigang Fan,QuanLin Wu,Shengjie Luo,Liwei Wang

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

摘要

关键词

Ligand Binding Site DetectionLigand Binding Site PredictionDrug DiscoveryStructure Based Drug Discovery

评审与讨论

审稿意见

评分: 5置信度: 42025-06-26

This paper introduces novel deep learning approaches, new dataset and benchmark and new evaluation metric for the pocket detection task. The authors recognized that only limited pockets are occupied in most PDB data, and then build a new dataset by aggregating different PDB structures of the same protein. Moreover, by introducing previously well-established techniques in object detection, this paper addresses several existing issues of the evaluation of pocket detection methods, and achieved strong performance with an advanced network architecture.

优缺点分析

Strengths:

The paper has provided a useful and valuable new dataset for the pocket detection task, addressing the multi-pocket problem.
The paper has introducing well-established metrics and deep learning approaches of object detection to pocket detection, achieving strong performance.

Weaknesses:

Generalization of the model: Although the authors have tested their method on sequence-similarity dataset splits, they should include more strict ones. For example, 30% similarity cutoff or protein family split;
The definition of the AP metric: As the AP metric is newly introduced to the field of pocket detection, the paper should have a formulation or pseudo code to clearly explain the calculation of AP, but not just describe it with natural languages.

问题

The generalization of the model: How good is the model for low homology proteins or even proteins with different fold? P2Rank and Fpocket rescore seems to be a strong baseline with classic ML approaches. How about their performance on low homology test samples?
Recall vs precision: Unlike the object detection task, personally speaking, recall is more important for pocket detection to locate new biological meaningful pockets. Given a certain IoU threshold, for example 0.9 0.6 and 0.3, how many pockets can be recalled?

局限性

yes

最终评判理由

This paper makes a valuable contribution to the field of pocket detection by introducing a new benchmark dataset that addresses the multi-pocket problem, a carefully adapted evaluation metric (AP), and a high-performing deep learning architecture. The authors thoughtfully incorporate object detection paradigms into protein structure analysis and provide strong empirical results.

After reviewing the rebuttal and follow-up discussions, I find that the authors have satisfactorily addressed my primary concerns:

On model generalization, the authors provided additional evaluations across varying sequence similarity cutoffs, including the challenging <30% category. Although performance does decline under these strict conditions—highlighting a broader limitation of deep learning models in biology—the authors are transparent about this issue and provide context consistent with other leading works in the field.

On AP metric clarity, the inclusion of detailed pseudocode significantly improves the reproducibility and interpretability of the evaluation pipeline, especially given the metric’s novelty in this domain.

On recall vs precision, the detailed breakdown across IoU thresholds and the rationale for balancing recall with ranking precision is compelling. The strong recall of UniSite-3D across all thresholds further strengthens the case for its practical utility.

Overall, this is a technically sound and impactful paper with meaningful improvements in both methodology and benchmarking. The work sets a new standard for pocket detection evaluation and opens promising directions for future research. I maintain my positive evaluation.

格式问题

作者回复

2025-07-31

We thank reviewer for the constructive comments. We provide our feedback as follows. We hope our answers have addressed your questions, and we would greatly appreciate it if you could consider raising your rating.

A1

Q1-1: Weaknesses 1: Generalization of the model: Although the authors have tested their method on sequence-similarity dataset splits, they should include more strict ones. For example, 30% similarity cutoff or protein family split;

Q1-2: Questions 1: The generalization of the model: How good is the model for low homology proteins or even proteins with different fold? P2Rank and Fpocket rescore seems to be a strong baseline with classic ML approaches. How about their performance on low homology test samples?

Thank you very much for your suggestion! The reason our initial manuscript did not include the 30% similarity experiment is that we aimed to keep the sizes of the training and testing sets relatively consistent across different similarity splits. However, the 30% similarity split does not yield enough samples for the test set. Moreover, thresholds of 0.5, 0.7, and 0.9 for sequence similarity are commonly used in protein-related tasks to assess generalization performance.

Previous traditional machine learning methods (such as P2Rank) did not account for generalization testing with sequence similarity splits. To ensure a fair comparison of generalization across different methods, and considering that all methods use HOLO4K as the benchmark testset, we divided the subsets based on different sequence similarity thresholds (0.9, 0.7, 0.5, and 0.3) using the HOLO4K dataset in conjunction with each method's training set.

We compared the performance at different similarity levels of the classic machine learning methods Fpocket-rescore and P2Rank, the deep learning-enhanced version of P2Rank (GrASP), as well as our methods UniSite-1D and UniSite-3D, in terms of $\text{AP}_{0.3}$ . The results are as follows:

Similarity: <1.0 → <0.9 → <0.7 → <0.5 → <0.3
Fpocket-rescore: 0.5900 → 0.5911 → 0.5925 → 0.5867 → 0.5670
P2Rank: 0.6011 → 0.5982 → 0.6211 → 0.5987 → 0.5734
GrASP: 0.6668 → 0.6490 → 0.6729 → 0.6388 → 0.6040
UniSite-1D (ours): 0.6867 → 0.6868 → 0.6817 → 0.6156 → 0.5744
UniSite-3D (ours): 0.7090 → 0.7053 → 0.7177 → 0.6580 → 0.6534

The results show that, compared to traditional machine learning methods, deep learning approaches such as UniSite-1D, UniSite-3D, and GrASP experience a performance decline when dealing with proteins of very low sequence similarity. This could be due to deep learning methods' greater sensitivity to the distribution of the training data.

A2

Q2: Weaknesses 2: The definition of the AP metric: As the AP metric is newly introduced to the field of pocket detection, the paper should have a formulation or pseudo code to clearly explain the calculation of AP, but not just describe it with natural languages.

Thank you very much for your suggestion! We have added a detailed pseudocode to explain the calculation of the AP metric, which will be included in the Appendix of the revision. The pseudocode is as follows:

Function calculate_AveragePrecision(prediction_list, ground_truth_list, iou_threshold)

Inputs:

prediction_list: A list, where each element represents the predictions for one protein. Each element is of the form $\lbrace (s_i, m_i)|s_i\in \mathbb{R} ,m_i\in\lbrace 0,1\rbrace^L\rbrace_{i=1}^{N}$ , where $m_i$ represents the $i$ -th predicted binding site as a binary mask of length $L$ , and $s_i$ denotes the $i$ -th confidence score.

ground_truth_list: A list, where each element represents the ground truth (gt) binding sites for one protein. Each element is of the form $\lbrace m_i^{gt}|m_i^{gt}\in\lbrace 0,1\rbrace ^L\rbrace_{i=1}^{N_{gt}}$ , where $m_i^{gt}$ represents the $i$ -th ground truth binding site as a binary mask of length $L$ .

iou_threshold: IoU threshold for considering a prediction as True Positive.

Output:

AP: Average Precision under the given IoU threshold.

# Step1: determine TP (True Positive) or FP (False Positive) for each prediction

for predictions_per_protein, ground_truths_per_protein in zip(predicton_list, ground_truth_list):

——Set all ground truths in ground_truths_per_protein unused

——Sort predictions_per_protein $=\lbrace (s_i, m_i)|s_i\in \mathbb{R} ,m_i\in\lbrace 0,1\rbrace^L\rbrace_{i=1}^{N}$ by decreasing confidence score $s_i$

——for $i=1,2,\cdots,N$

———— Find ground_truth_j that has the max residue-level IoU, IoU( $m_j^{gt}$ , $m_i$ ), with prediction_i

———— if IoU( $m_j^{gt}$ , $m_i$ ) > iou_threshold and ground_truth_j is unused:

—————— Mark prediction_i as TP

—————— Set ground_truth_j as used

————else:

—————— Mark prediction_i as FP

# Step 2: Sort all predictions across all proteins by decreasing confidence scores

all_predictions = FLATTEN_ALL(prediction_list)

all_ground_truths = FLATTEN_ALL(ground_truth_list)

Sort all_predictions by decreasing confidence scores

# Step 3: calculate the precision-recall curve

Set cum_TP, cum_FP as 0, precision_list and recall_list as empty

for prediction_i in all_predictions:

—— if prediction_i is marked as TP:

———— cum_TP += 1

——else:

————cum_FP += 1

—— precision_list.append(cum_TP / (cum_TP + cum_FP))

—— recall_list.append(cum_TP / LEN(all_ground_truths))

# Step 4

Calculate AP as the area under the precision-recall curve

return AP

A3

Q3: Questions 2: Recall vs precision: Unlike the object detection task, personally speaking, recall is more important for pocket detection to locate new biological meaningful pockets. Given a certain IoU threshold, for example 0.9 0.6 and 0.3, how many pockets can be recalled?

Thank you very much for your suggestion! We calculated the Recall of different methods at various IoU thresholds on the widely used HOLO4K benchmark dataset, and the results are as follows:

Method	$\text{AP}_{0.3}$ ↑	$\text{AP}_{0.5}$ ↑	$\text{Recall}_{0.3}$ ↑	$\text{Recall}_{0.5}$ ↑	$\text{Recall}_{0.7}$ ↑	$\text{Recall}_{0.9}$ ↑
Fpocket	0.2711	0.1488	0.8361	0.5922	0.2130	0.0253
Fpocket-rescore	0.5899	0.2847	0.8361	0.5922	0.2130	0.0253
P2Rank	0.6011	0.2625	0.7814	0.5337	0.1868	0.0089
DeepPocket	0.5415	0.2891	0.7514	0.5824	0.2584	0.022
GrASP	0.6668	0.4126	0.7186	0.5374	0.2537	0.0159
VN-EGNN	0.2606	0.1346	0.7289	0.4874	0.0566	0
UniSite-1D (ours)	0.6867	0.4595	0.8212	0.6199	0.3535	0.0824
UniSite-3D (ours)	0.7091	0.5446	0.8469	0.6901	0.4106	0.1039

The results indicate that:

UniSite-3D achieves state-of-the-art (SOTA) Recall across all IoU thresholds.
Notably, the high Recall score of Fpocket is due to its tendency to output nearly all potential cavities in a protein, rather than accurately identifying the useful pockets. In fact, Fpocket gives lower scores for true pockets, making it difficult for biologists to identify meaningful pockets. This is also the reason why many researchers have tried to develop rescore methods for Fpocket.
While Recall indicates a method’s potential to identify new biologically meaningful pockets, it doesn’t take into account the score of each predicted pocket. The AP metric addresses this issue, and therefore, we use it as a fair metric to evaluate different methods.

评论- Reviwer response

2025-08-01

Thank you for the thoughtful and comprehensive rebuttal. I appreciate the authors’ efforts to address the concerns in detail, and I provide the following feedback on the specific points raised:

A1 – Generalization on Low Homology Proteins

I appreciate the added experiments with stricter sequence similarity cutoffs, including the 30% threshold. The detailed comparison across traditional machine learning methods, deep learning baselines, and your UniSite variants is informative. While the performance of UniSite-3D remains strong overall, I note the performance decline for very low homology proteins (<30% similarity), which is especially relevant in realistic deployment scenarios. I understand the difficulty of maintaining test set sizes in such splits, but the generalization gap still limits confidence in robustness across highly divergent proteins. That said, I’m generally satisfied with your response and the transparency in addressing this limitation.

A2 – AP Metric Definition

Thank you for adding detailed pseudocode for the Average Precision (AP) calculation. This improves clarity and reproducibility, particularly given that AP is a new metric in the pocket detection domain. Your revision is well-handled and satisfies my concern.

A3 – Recall vs Precision Trade-off

The additional recall comparisons across a range of IoU thresholds are very helpful. The results highlight UniSite-3D’s strong recall, even at stricter IoU thresholds. I also appreciate the nuanced discussion on the limitations of high-recall methods like Fpocket that may lack practical utility due to poor ranking of relevant pockets. Your justification for using AP as a more comprehensive metric is reasonable and well-argued.

Overall: I am generally satisfied with the responses and the additional experimental results. However, considering I had already given a relatively high score of 5 (Accept), I will not increase my rating further. The observed performance drop on low-homology proteins remains a concern for applications in more diverse or novel protein settings. Nonetheless, the clarifications and improvements have strengthened the work, and I look forward to seeing the revised version.

2025-08-01

Thank you very much for your timely response and detailed feedback! We sincerely appreciate your recognition of our work.

Regarding the similarity issue, we would like to note that the limited generalization ability of deep learning models is a common challenge in the AI for Protein field. For example, in the Science paper introducing RoseTTAFold All-Atom by David Baker et al. [1], the results in Fig. 2F show a performance drop from 35% to 24% in low-similarity scenarios. One possible contributing factor is the scarcity of structural data (<0.1% of known proteins).

Once again, we sincerely appreciate your thoughtful and constructive feedback! Your suggestions have significantly improved the quality of our work.

[1] Science 384.6693 (2024): eadl2528. DOI:10.1126/science.adl2528.

评论- Re

2025-08-01

Yeah. I also noticed similar results in my research projects. Considering the good generalization of Fpocket and P2Rank, maybe RAG is a good strategy in the future.

2025-08-01

We agree that RAG may be a promising direction for future exploration. Thank you very much for the insightful comment!

审稿意见

评分: 4置信度: 42025-06-30

This paper introduces an end-to-end discontinuous framework UniSite for protein ligand binding site detection. Besides, they construct and present UniSite-DS, a novel dataset that is "UniProt-centric" rather than "PDB-centric". For evaluation, beyond traditional evaluation metrics DCC/DCA, the authors advocate for using Average Precision (AP) based on Intersection over Union (IoU) of binding site residues. Experiments are conducted on HOLO4K, COACH420, and UniSite-DS.

优缺点分析

Strengths

The paper is generally well-written, clearly structured, and easy to follow
The proposed UniSite model is an interesting end-to-end deep learning architectures to the binding site detection problem.
The conceptualization and construction of the UniSite-DS dataset is an interesting contribution. The shift from a PDB-centric to a UniProt-centric view is a well-motivated and important step forward for the field.

Weaknesses

The central argument for UniSite's superiority rests heavily on the new IoU-based AP metric. However, this evaluation framework may be inherently biased towards the proposed method. UniSite is explicitly trained to predict residue-level masks, optimizing a loss function that includes IoU-like components (Dice loss). In contrast, many baseline methods were not designed for this specific output format. The authors' decision to define its binding sites as residues within a 9Å radius is an ad-hoc post-processing step, and the paper notes this radius was chosen because it yielded the best AP performance (Appendix E). This constitutes tuning the baseline for the new metric, and it is unclear if this is a fair or optimal representation for that method. Consequently, the superior performance of UniSite on the AP metric might reflect an alignment between the model's architecture and the evaluation metric, rather than a fundamental superiority in identifying binding sites.
When evaluated on traditional, widely-accepted metrics like DCC and DCA, the performance of UniSite is less convincing, as shown in Table 2.
Result in Table 3 reveals a significant performance drop as the sequence similarity between the training and test sets decreases. This suggests a concern about overfitting on the training dataset. To make a convincing case for its robustness, the paper needs to show how this performance degradation compares to that of the baseline methods under the same conditions. It is possible that simpler, feature-based methods like P2Rank are more robust to such distributional shifts.

问题

Suggestions/Questions:

Providing a comparsion with baselines on proteins with low similarity.
The definition of pocket residue of different baselines should be detailed in the paper for better understanding.
Providing a detailed ablation study of the key module, such as ESM embedding,

局限性

Yes

最终评判理由

All of my concerns have been solved.

格式问题

None

作者回复

2025-07-30

We thank reviewer for the constructive comments. We provide our feedback as follows. We hope our answers have addressed your concerns, and we would greatly appreciate it if you could consider raising your rating.

A1

Q1: Weaknesses 1: The central argument for UniSite's superiority rests heavily on the new IoU-based AP metric … rather than a fundamental superiority in identifying binding sites.

Binding site residues is the standard output for ligand binding site detection task and is critical for downstream applications. Binding site residues are the primary output of most baselines (details are provided in A4). In Appendix A, we have discussed the impact for docking tasks, and reviewer cfsU also strongly agrees with this perspective:

reviewer cfsU: Binding site detection is an important and generally understudied problem, compared to binder design/hit discovery, which relies entirely on accurate binding site prediction.
IoU-based AP is a widely accepted and fair metric for detection tasks like Object Detection.
Most learning-based baselines are supervised by mask loss. As discussed in Section 3, most learning-based methods first predict a site score for each residue/atom, and this process is commonly supervised by a mask loss: P2Rank, DeepPocket and GrASP utilize BCE loss and VN-EGNN employs dice loss. To align with the training objective of baselines, we calculate the Semantic-IoU metric: On the HOLO4K benchmark, we first merge all ground truth binding sites. For each method, we select the score corresponding to Recall=0.7 at an IoU threshold of 0.3 as the cutoff threshold and aggregate its predictions accordingly. Finally, we compute the IoU between the merged predicted mask and the merged ground truth mask. |Method| $\text{AP}_{0.3}$ ↑| $\text{AP}_{0.5}$ ↑|Semantic-IoU↑| |-|-|-|-| |Fpocket|0.2711|0.1488|0.1799| |Fpocket-rescore|0.5899|0.2847|0.4732| |P2Rank|0.6011|0.2625|0.4576| |DeepPocket|0.5415|0.2891|0.4368| |GrASP|0.6668|0.4126|0.5536| |VN-EGNN|0.2606|0.1346|0.4344| |UniSite-1D (ours)|0.6867|0.4595|0.5360| |UniSite-3D (ours)|0.7091|0.5446|0.5876|

As shown above, our method demonstrates superiority even under this baseline-training-loss-aligned evaluation framework.

The transition from a discontinuous workflow to an end-to-end architecture for ligand binding site detection constitutes one of our key contributions.

A2

Q2: Weakness 2: When evaluated on traditional, widely-accepted metrics like DCC and DCA, the performance of UniSite is less convincing, as shown in Table 2.

While UniSite shows slightly inferior performance on traditional metrics DCC and DCA, this primarily stems from the fundamental flaws in these metrics (see Section 4). Reviewer cfsU agreed with the flaws of DCC and DCA:

reviewer cfsU: The illustrations of failure modes in commonly used metrics and the proposal to adopt IoU-based AP for more comprehensive evaluation are clearly presented and well-justified. The failure cases in Fig 4 provide particularly compelling evidence.

Specifically, DCC and DCA suffer from two critical limitations:

Limitation 1. Predictions may be double-counted due to the absence of proper matching criteria (Fig 4A). We quantified the proportion of proteins affected by double counting during evaluation on the widely-used HOLO4K benchmark.

Table. Double Counting (DC) Rate in Metrics in HOLO4K-sc.

Method	DC of $\text{DCC}_{\text{top-}n}$	DC of $\text{DCA}_{\text{top-}n}$
Fpocket	18.80%	18.31%
Fpocket-rescore	13.23%	13.66%
P2Rank	18.98%	18.31%
DeepPocket	13.53%	14.21%
GrASP	16.29%	14.88%
VN-EGNN	12.80%	11.70%
UniSite-1D (ours)	8.88%	8.94%
UniSite-3D (ours)	9.86%	10.04%

The results reveal that DCC and DCA metrics suffer from widespread double counting artifacts, which significantly distort model performance assessment.

Limitation 2. DCC/DCA only evaluate the center of binding sites and are ligand-dependent, which leads to evaluation failures in certain scenarios (Fig 4B-D).
- For HOLO4K, we calculated the DCC and DCA metrics of the centroid of ground truth binding residues for each protein. The results show that: the mean ground truth DCC is 2.15 Å (92.65% < 4 Å), and the mean ground truth DCA is 1.57 Å (98.88% < 4 Å). However, in principle, both DCC and DCA should ideally be 0 when evaluated using ground truth binding residues, indicating these metrics inherently contain systematic bias.
- To address some of DCC's inherent limitations, we defined a corrected metric, DCC-residue, which uses the center of ground truth binding residues rather than the ligand's center for calculation. This modification resolves failure cases caused by ligand diversity in traditional DCC evaluation.

Method	$\text{AP}_{0.3}$ ↑	$\text{DCC-residue}_{\text{top-}n}$ ↑	$\text{DCC}_{\text{top-}n}$ ↑
Fpocket	0.2711	0.2982	0.3076
Fpocket-rescore	0.5899	0.5005	0.5183
P2Rank	0.6011	0.4972	0.5300
DeepPocket	0.5415	0.4902	0.4925
GrASP	0.6668	0.5379	0.5131
VN-EGNN	0.2606	0.5997	0.5861
UniSite-1D (ours)	0.6867	0.6400	0.5538
UniSite-3D (ours)	0.7091	0.6264	0.5716

Our evaluation on the HOLO4K benchmark demonstrates that the corrected DCC-residue metric shows improved consistency with IoU-based AP in ranking different methods' performance.

Existing benchmark datasets suffer from significant annotation omissions. These data quality issues greatly affect the reliable evaluation of ligand binding site detection performance, as they overlook numerous ground truth binding sites.
- A representative example is UniProt ID: P28482 (MAPK1), a protein critically involved in cell proliferation, differentiation, survival, and apoptosis. The HOLO4K dataset annotates only one ligand binding site for this protein (derived from PDB: 1pmeA, SB2 binding site). In contrast, UniSite-DS documents 15 ligand binding sites (from PDB entries: 6qa1_A, 6qal_A, 4fv7_A, 6gjd_A, etc.)
Both traditional metrics and benchmark datasets exhibit substantial limitations, failing to accurately reflect the true performance of different methods. In fact, UniSite demonstrates significantly superior performance when assessed using either the IoU-based AP metric or the corrected DCC-residue measure.

A3

Q3-1: Weakness 3: Result in Table 3 reveals a significant performance drop as the sequence similarity between the training and test sets decreases … feature-based methods like P2Rank are more robust to such distributional shifts.

Q3-2: Question 1: Providing a comparsion with baselines on proteins with low similarity.

Previous traditional machine learning methods (such as P2Rank) did not account for robustness testing with sequence similarity splits. To ensure a fair comparison of robustness across different methods, and considering that all methods use HOLO4K as the benchmark testset, we divided the subsets based on different sequence similarity thresholds (0.9, 0.7, 0.5, and 0.3) using the HOLO4K dataset in conjunction with each method's training set.

We compared the performance at different similarity levels of the classic machine learning methods Fpocket-rescore and P2Rank, the deep learning-enhanced version of P2Rank (GrASP), as well as our methods, UniSite-1D and UniSite-3D, in terms of $\text{AP}_{0.3}$ . The results are as follows:

Similarity: <1.0 → <0.9 → <0.7 → <0.5 → <0.3
Fpocket-rescore: 0.5900 → 0.5911 → 0.5925 → 0.5867 → 0.5670
P2Rank: 0.6011 → 0.5982 → 0.6211 → 0.5987 → 0.5734
GrASP: 0.6668 → 0.6490 → 0.6729 → 0.6388 → 0.6040
UniSite-1D (ours): 0.6867 → 0.6868 → 0.6817 → 0.6156 → 0.5744
UniSite-3D (ours): 0.7090 → 0.7053 → 0.7177 → 0.6580 → 0.6534

The results show that, compared to traditional machine learning methods, deep learning approaches such as UniSite-1D/3D and GrASP experience a performance decline when dealing with proteins of very low sequence similarity. This could be due to deep learning methods' greater sensitivity to the distribution of the training data.

A4

Q4: Question 2: The definition of pocket residue of different baselines should be detailed in the paper for better understanding.

In fact, pocket (binding site) residues are the primary output of most baselines:

(1) Fpocket outputs a "pocket{index}_atm.pdb" for each predicted binding site, which records the atoms of binding site in PDB format.

(2) P2Rank generates a "{name}.pdb_prediction.csv" file for each protein. The "residue_ids" column contains the chain IDs and the residue IDs of binding site residues.

(3) DeepPocket rescores and refines the Fpocket results and keeps the same output format.

(4) GrASP predicts a "{name}_probs.pdb" file for each protein, in which the bfactor column contains the predicted scores of the heavy atoms. We filtered and clustered these atoms into different sites following the method in original paper.

In contrast, VN-EGNN outputs a "prediction.csv" file for each protein, which only contains four columns: "x", "y", "z" and "rank".

A5

Q5: Question3: Providing a detailed ablation study of the key module, such as ESM embedding.

We conducted an ablation study on the ESM embedding, and the results show that for UniSite-3D, removing the ESM embedding leads to a decrease of approximately 0.03 in the model's AP performance. For UniSite-1D, since it does not incorporate any structural information, removing the ESM embedding from the sequence encoder causes the training to fail to converge.

	$\text{AP}_{0.3}$ ↑	$\text{AP}_{0.5}$ ↑
UniSite-3D w/o ESM	0.5359	0.3501
UniSite-3D w/ ESM	0.5603	0.3835

2025-08-05

Dear Reviewer fcYt,

We sincerely appreciate your time and effort in reviewing our manuscript and offering valuable suggestions.

As the author-reviewer discussion phase is wrapping up, we would like to confirm whether our detailed rebuttal has addressed your concerns. If you require further clarification or have any additional questions, we would be more than happy to address them.

If our responses have resolved your concerns, we would be truly grateful if you could consider raising your rating.

Best regards,

The authors of paper 15335

2025-08-05

Thank you for your rebuttal. It has clarified several of my initial questions.

However, my concern regarding the evaluation bias and the advantage of IoU metric still exit for the following reasons:

From my perspective, there is a fundamental difference in how fpocket and UniSite define a binding pocket/sites, also known as Residue-centric versus pocket-centric perspective mentioned in p2rank paper [1].

UniSite follows the residue-centric perspective and it defines a pocket as the set of amino acid residues at the binding site. Consequently, its training is supervised by classifying whether a given residue is part of the pocket.

In contrast, fpocket follows the pocket-centric perspective, and it defines a pocket as the potential volume a ligand might occupy. So, fpocket determines pockets by clustering alpha spheres to delineate this space.

The objectives of these two definitions are fundamentally different. The distinction between the DCC and IoU metrics mirrors this conceptual difference between fpocket and UniSite.

It is therefore unsurprising that UniSite has an inherent advantage when evaluated with IoU, as the metric aligns perfectly with its residue-based definition. This advantage exists regardless of whether IoU is intrinsically a better metric than DCC.

What's more, in the p2rank paper, the author thinks pocket-centric is better by stating: "We believe that pocket-centric point of view better represents a common sense".

Suggestions:

For me, to convincingly validate the effectiveness of UniSite, additional experimental results should be included:

i) Varying the IoU threshold to compute AP0.3 and AP0.5 for fpocket and p2rank, and comparing these results with those of UniSite.

ii) More importantly, conducting molecular docking experiments using the pockets predicted by the different methods and comparing the resulting docking accuracies. As the authors themselves state in the paper, docking is a crucial downstream application for pocket prediction. Such an experiment would provide a direct and practical demonstration of the performance differences among the methods.

iii) Adding baseline DSDP[2].

[1] P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure

[2] DSDP: A Blind Docking Strategy Accelerated by GPUs

2025-08-06

Thank you very much for your feedback! We would like to provide the following clarifications in response to your new questions:

**Regarding the Distinction Between residue-centric and pocket-centric Methods**

Our method is fundamentally different from the residue-centric Ligand Binding Site (LBS) prediction methods described in the P2Rank paper [1]. In the P2Rank paper, residue-centric methods are defined as:

Those residue-centric methods look at the problem of LBS prediction as to the problem of binary classification of solvent exposed residues to binding and non-binding. This is also the way how they are evaluated and compared, usually in terms of standard binary classification metrics: MCC, AUC or F-measure.

In contrast, our method predicts a set of pockets rather than predicting one binary mask to represent all pockets. This fundamental difference means that the statement in the P2Rank paper is not applicable to our approach.
Our dataset and evaluation metrics have already taken into account the potential bias introduced by small ligands in large composite pockets, as claimed in the P2Rank paper.
- During manual inspection of the dataset, we have already considered these cases.
  - We examined all entries with large composite cavities jointly formed by multiple ligands, each occupying only part of the cavity, and excluded these entries from our current dataset. Specific examples can be found in our A2 response to reviewer cfsU.
  - In our data curation pipeline, we considered all ligands bound to the same pocket, and retained the pocket defined by the largest ligand as a priority.
- Our IoU-based AP evaluation metric uses relatively low IoU thresholds (0.3 and 0.5), which helps reduce the bias caused by differences between ligand size and actual pocket volume. This IoU-based AP metric differs from the standard binary classification metrics such as MCC, AUC and F-measure.
Most learning-based baselines are supervised using pockets defined by ligand neighborhoods. They all employ mask loss (as shown in our rebuttal A1 response), which is fair across all learning-based methods. Fpocket is a purely geometry-based method and consistently shows inferior performance across all metrics.
Traditional metrics such as DCC and DCA are merely based on the pocket center and ligand atoms, without considering pocket volume at all. They are affected by errors when the same pocket is associated with different ligands (see Figure 4B in the manuscript).
Using residues as the definition of ligand binding sites is a common practice for most methods, including Fpocket and P2Rank, as discussed in our rebuttal A4 response.

Regarding the Suggestions

We have already provided AP results at IoU thresholds of 0.3 and 0.5 (AP0.3 and AP0.5), as described in Lines 279–280 of the manuscript. The results are reported in Tables 1–2 and Tables S2–S3. Our method demonstrates significant superiority under these metrics.
It is well acknowledged that accurate pocket prediction is critical for downstream tasks such as molecular docking [2, 3, 4]. In our manuscript, we include a case study (Appendix B) that illustrates the impact of different predicted pockets. However, due to inconsistencies in annotations between LBS datasets and docking datasets, conducting a fair and systematic comparison of how different LBS methods influence different docking methods would require tremendous additional work, which goes far beyond the scope of a single LBS paper. In fact, most LBS methodological papers do not include docking experiments. Such evaluations are more suitable as an individual benchmarking study.
As a blind docking method, DSDP outputs a "{name}_surface.txt" file for LBS prediction, which contains a binary classification (0/1) of all protein surface points. This format cannot be used for AP, DCC, or DCA evaluation, and thus DSDP is not comparable under these metrics.

We hope our answers have addressed your concerns, and we would greatly appreciate it if you could consider raising your rating.

[1] P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure

[2] Accurate structure prediction of biomolecular interactions with AlphaFold 3

[3] Posebusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences

[4] Structure prediction of protein-ligand complexes from sequence information with UMol

2025-08-06

Thank your for the detailed rebuttal. All of my concerns have been solved and i'll raise my score to 4.

2025-08-06

We sincerely appreciate your constructive feedback, insightful comments, and the time you have dedicated to reviewing our work! Your suggestions have significantly improved the clarity and quality of our manuscript.

审稿意见

评分: 5置信度: 42025-07-01

In this paper, the authors introduce UniSite-DS, a new UniProt-centric protein ligand binding site dataset. They develop UniSite-1D/3D, methods for binding site detection based on this new dataset, and argue for extending the commonly used DCC and DCA metrics for evaluating binding site predictors to include Average Precision based on residue-level IoU. They identify failure modes where DCC and DCA incorrectly classify binding sites, where AP provides a better metric, and show that existing methods have sub-optimal performance on the new UniSite-DS benchmark.

优缺点分析

Strengths The focus on dataset curation and preparation, exemplified by the statement: “We manually inspected all UniProt IDs with more than ten ligand binding sites, as well as those where a single protein–ligand complex structure contributed three or more binding sites,” is a strong contribution and potentially very useful for the community. Binding site detection is an important and generally understudied problem, compared to binder design / hit discovery, which depends completely on accurate binding site prediction. The illustrations of failure modes of the commonly used metrics and the need to calculate IoU-based AP to more comprehensively evaluate binding site detectors are clear and well-motivated.

Weaknesses The newly curated dataset is central to the work, so it seems to deserve its own figure (perhaps supplemental) describing the dataset curation process, quality control, and manual inspection. In Fig 1 it is maybe unsurprising that UniSite-1D/3D performs best, considering it was trained on UniSite-DS. Results are more mixed for the other benchmark experiments. Section 5.4 discusses some limitations with respect to sequence similarity, but I did not see an explicit Limitations section with clearly laid out limitations and future work.

问题

Can the authors show some example “case studies” of manual inspection and curation of the data? Making the process and the dataset itself easily inspectable could help the community judge the usefulness of the dataset artifact. The failure cases in Fig 4 are helpful to see - how many such cases are there? How often does IoU AP disagree with DCA and DCC? Are there additional, simpler baselines that the authors could consider in addition to the proposed binding site detection method and the methods from literature? In particular to illustrate how a simple approach leveraging the UniSite-DS compares to UniSite-1D/3D

局限性

Yes

最终评判理由

The authors addressed each question and comment during the rebuttal period, and I have raised my score accordingly.

格式问题

作者回复

2025-07-30

A1

Q1: Weaknesses: The newly curated dataset is central to the work, so it seems to deserve its own figure (perhaps supplemental) describing the dataset curation process, quality control, and manual inspection.

We have created an additional figure to illustrate the dataset curation, quality control, and manual inspection process in detail. Due to the official rebuttal constraints set by NeurIPS, we are unable to provide the figure directly at this stage. We would like to walk you through the structure and key elements of the figure as follows.

Curation Process
- Systematic retrieval of all protein–ligand interactions from the PDB database
- Identification of binding site residues within 4.5 Å of each ligand
- Integration of all ligand binding sites across different PDB structures using UniProt identifiers and SIFTS annotations
- Redundancy removal using NMS with IoM ≥ 0.7 and IoU ≥ 0.5 to exclude highly overlapping binding sites
- Additional quality control and manual inspection to ensure high-quality dataset
Quality Control
- Exclude structures with resolution >2.5 Å or determined by non-crystallographic methods
- Remove solvent molecules
- Discard entries with ≤3 binding site residues to eliminate floating ligands
Manual Inspection
- Manual inspection of entries with >10 ligand binding sites or with ≥3 sites contributed by a single protein–ligand complex
  - Entries involving supramolecular assemblies across multiple protein subunits
  - Entries with large composite cavities jointly formed by multiple ligands, each occupying only part of the cavity

If you have any further suggestions regarding the design or content of the figure, we would greatly appreciate it.

A2

Q2: Questions: Can the authors show some example “case studies” of manual inspection and curation of the data? Making the process and the dataset itself easily inspectable could help the community judge the usefulness of the dataset artifact.

Manual Inspection Case Studies

Case 1 (Entries involving supramolecular assemblies across multiple protein subunits)
- The Photosystem I-LHCI Supercomplex is a core component of the photosynthetic machinery, consisting of multiple subunits such as UniProt ID: P05310 (Photosystem I P700 chlorophyll a apoprotein A1, PDB: 7dkz_A), Q41038 (Chlorophyll a-b binding protein, 7dkz_B), and Q32904 (Chlorophyll a-b binding protein 3, 7dkz_C).
- Although each of these proteins binds a large number of Chlorophyll a molecules, they function as part of a tightly integrated supramolecular assembly. Therefore, they are not suitable to be included as independent entries in the UniSite-DS database and have been excluded in the current version.
Case 2 (Entries with large composite cavities jointly formed by multiple ligands, each occupying only part of the cavity)
- Trypanothione reductase (UniProt ID: Q389T8) is a key enzyme in Trypanosoma brucei that specifically catalyzes the reduction of trypanothione, functionally analogous to glutathione reductase in mammals. The ligands from structures PDB: 5s9x_A, 5s9t_A, and 2wov_C together form a large binding cavity. These ligand binding sites could potentially be merged.
- However, since each ligand actually occupies only a portion of the cavity rather than the entire cavity, whether such a composite cavity formed by different ligands should be considered a unified binding site often depends on system-specific definitions in the literature. As a result, entries that require further literature-based validation were excluded from the current version of the UniSite-DS database.

A3

Q3: Questions: The failure cases in Fig 4 are helpful to see - how many such cases are there? How often does IoU AP disagree with DCA and DCC?

As we have discussed in Section 4 (Rethinking the Evaluation Metrics for Binding Site Detection, lines 217-245), DCC and DCA suffer from two critical limitations:

Limitation 1. Predictions may be double-counted due to the absence of proper matching criteria (Fig 4A). We quantified the proportion of proteins affected by double counting during evaluation on the widely-used HOLO4K benchmark.

Table. Double Counting (DC) Rate in Metrics in HOLO4K-sc.

Method	DC of $\text{DCC}_{\text{top-}n}$	DC of $\text{DCA}_{\text{top-}n}$
Fpocket	18.80%	18.31%
Fpocket-rescore	13.23%	13.66%
P2Rank	18.98%	18.31%
DeepPocket	13.53%	14.21%
GrASP	16.29%	14.88%
VN-EGNN	12.80%	11.70%
UniSite-1D (ours)	8.88%	8.94%
UniSite-3D (ours)	9.86%	10.04%

The results reveal that DCC and DCA metrics suffer from widespread double counting artifacts, which significantly distort model performance assessment.

Limitation 2. DCC/DCA only evaluate the center of binding sites and are ligand-dependent, which leads to evaluation failures in certain scenarios (Fig 4B-D).
- For HOLO4K, we calculated the DCC and DCA metrics of the centroid of ground truth binding residues for each protein. The results show that: the mean ground truth DCC is 2.15 Å (92.65% < 4 Å), and the mean ground truth DCA is 1.57 Å (98.88% < 4 Å). However, in principle, both DCC and DCA should ideally be 0 when evaluated using ground truth binding residues, indicating these metrics inherently contain systematic bias.
- To address some of DCC's inherent limitations, we defined a corrected metric, DCC-residue, which uses the center of ground truth binding residues rather than the ligand's center for calculation. This modification resolves failure cases caused by ligand diversity in traditional DCC evaluation.

Method	$\text{AP}_{0.3}$ ↑	$\text{DCC-residue}_{\text{top-}n}$ ↑	$\text{DCC}_{\text{top-}n}$ ↑
Fpocket	0.2711	0.2982	0.3076
Fpocket-rescore	0.5899	0.5005	0.5183
P2Rank	0.6011	0.4972	0.5300
DeepPocket	0.5415	0.4902	0.4925
GrASP	0.6668	0.5379	0.5131
VN-EGNN	0.2606	0.5997	0.5861
UniSite-1D (ours)	0.6867	0.6400	0.5538
UniSite-3D (ours)	0.7091	0.6264	0.5716

Our evaluation on the HOLO4K benchmark demonstrates that the corrected DCC-residue metric shows improved consistency with IoU-based AP in ranking different methods' performance.

A4

Q4: Weaknesses: In Fig 1 it is maybe unsurprising that UniSite-1D/3D performs best, considering it was trained on UniSite-DS. Results are more mixed for the other benchmark experiments.

Existing benchmark datasets suffer from significant annotation omissions. These data quality issues greatly affect the reliable evaluation of ligand binding site detection performance, as they overlook numerous ground truth binding sites.
- A representative example is UniProt ID: P28482 (MAPK1), a protein critically involved in cell proliferation, differentiation, survival, and apoptosis. The HOLO4K dataset annotates only one ligand binding site for this protein (derived from PDB: 1pmeA, SB2 binding site). In contrast, UniSite-DS documents 15 ligand binding sites (from PDB entries: 6qa1_A, 6qal_A, 4fv7_A, 6gjd_A, etc.)
As described in A3, traditional DCC and DCA metrics have significant limitations and cannot faithfully reflect the true performance of different methods. This is exactly the reason why we propose IoU-based AP as a more fair evaluation metric.
In fact, UniSite demonstrates significantly superior performance when assessed using either the IoU-based AP metric or the corrected DCC-residue measure.

A5

Q5: Weaknesses: Section 5.4 discusses some limitations with respect to sequence similarity, but I did not see an explicit Limitations section with clearly laid out limitations and future work.

We will add a Limitations and Future Work section as follows:

Dataset: Currently, we manually inspect and remove unreasonable data entries. Future work could consider further repairing and reintegrating the excluded data into the dataset.
Model: Our current model design aims to demonstrate the effectiveness of the End-to-End ligand binding site learning framework, without incorporating specialized feature engineering. Future work could explore the inclusion of specialized feature engineering to further improve model performance and generalization ability.

A6

Q6: Questions: Are there additional, simpler baselines that the authors could consider in addition to the proposed binding site detection method and the methods from literature? In particular to illustrate how a simple approach leveraging the UniSite-DS compares to UniSite-1D/3D

In fact, the baseline methods cannot be re-trained on the UniSite-DS without significant architectural modifications.

Fpocket is a purely geometric algorithm without trainable parameters.
Learning-based methods like P2Rank and GrASP are supervised to predict one binary mask. This schema is incompatible with UniSite-DS annotations, which contain multiple potentially overlapping ground truth binding site masks.

The transition from a discontinuous, post-processing-based workflow to an end-to-end learning framework for ligand binding site detection constitutes one of our key contributions.

2025-08-05

Dear Reviewer cfsU,

We sincerely appreciate your time and effort in reviewing our manuscript and offering valuable suggestions.

If our responses have resolved your concerns, we would be truly grateful if you could consider raising your rating.

Best regards,

The authors of paper 15335

2025-08-06

The authors have carefully responded to every question and comment raised in the review. I will happily increase my score and recommend acceptance of the paper.

2025-08-06

We are truly grateful for your recognition of our work and for the time and effort you dedicated to providing constructive and insightful feedback! Your suggestions have significantly improved the clarity and quality of our manuscript.

审稿意见

评分: 2置信度: 42025-07-02

The authors presented a dataset of protein ligand binding sites, two methods for protein ligand binding site detection, and a metric for evaluating the binding site predictions. The paper highlights three main issues. First, that existing datasets are PDB (Protein Data Bank)-centric, specifically focusing on individual protein-ligand structure, which introduces a statistical bias in training and evaluation models. Second, current binding site prediction pipelines are fragmented into disjoint steps. Finally, conventional evaluation metrics used to capture the performance of binding site detection are inadequate; these metrics are often concerned only with the center of the binding sites, they are ligand-dependent, and the absence of proper matching criteria between predictions and ground truth can lead to double-counting of predictions. The authors introduced UniSite-DS, a UniProt-centric dataset that consolidates binding information across all available structures of the same protein; two end-to-end predictive models, UniSite-1D (sequence-based) and UniSite-3D (structure-based); and a new evaluation metric, Average Precision (AP) based on Intersection over Union for comprehensive binding site assessment. The authors run extensive experiments using UniSite and widely used benchmarks for binding site prediction, showing the performance of Unisite-1D/3D in comparison with other models across the AP-IoU and other accepted metrics.

优缺点分析

Strengths

UniSite 1D is a sequence-based method for predicting protein binding sites that performs similarly to other structure-based models.
The authors identified problems with the current metrics used to assess binding site prediction. The AP metric doesn’t focus only on comparing the centers of the binding sites. Additionally, it addresses the problem of double-counting predictions.

Weaknesses

While the paper presents an interesting approach, it suffers from several weaknesses that impact its clarity and reproducibility.

The clarity of the paper could be improved — in particular, Section 3.1 lacks a clear explanation of the loss function, and important terms such as Lmask are introduced without definition until much later (Section 5.1).
The notation used throughout Section 3 and beyond is often confusing, requiring multiple readings to go through Section 3 and later sections.
The overall idea of the architecture seems good. However, many important architectural details are missing. For example, the sequence encoder is not described in sufficient detail, making the method non-reproducible. Figure 3 does not help to fill in these details.
The definition of the binding site center in Section 4 is vague and should be clarified.
In Issue 1, the authors state that focusing on individual protein–ligand structures introduces statistical bias, but it does not expand much on this claim. It also does not explicitly describe how PDB-centric structures differ from UniSite data; this could be enhanced by describing the instances of each dataset.
Figure 2 attempts to describe a generalization of the fragmentation in past binding site prediction approaches. However, it is difficult to gain insight into how these pipelines work from the figure and the description in the introduction and Section 3.
The problems of widely used metrics are well explained. However, the introduced metric could be further described. Given a set of ground truth binding sites and predicted binding sites, it is not possible to compute the metric with the explanation given in the main manuscript.

Minor comments

Line 58: The acronyms DCC and DCA are used without being previously defined. All acronyms should be introduced the first time they appear.
Line 113: SIFTS annotations are mentioned and referenced, but the acronym SIFTS is not described.
Section 3: The notation needs refinement. For instance, using the gt in $m^{gt}$ is never explained, and the explanation of this variable can be enhanced to better show that it is binary vector of size $L$ . Improving the notation would help readers understand the content.
Figure 2 (Section 3): This figure does not add relevant insight into understanding the differences between existing methods and the proposed one.
Line 159: The symbol $\sigma$ is defined as the bijective matching between sets $z$ and $z_pad^gt$ , which could be easily confused with the sigmoid function. This makes it harder to interpret the loss function clearly.
Line 164: Lmask is referred to as "binary mask loss", but its formal definition doesn't appear until Section 5.1 (line 262). It should be mentioned earlier that Lmask combines binary cross-entropy and Dice loss.
In line 189, GearNet-Edge is mentioned, and details are provided even though no modifications were made. That’s fine, but I would prefer more implementation details for the sequence encoder instead.
Segmentation module (line 209): It is unclear whether the probabilities are obtained through the MLP. The figure suggests they are, but the text describes a linear classifier followed by a softmax. Additionally, for mask embeddings, the first mask goes through the MLP. This paragraph needs to be clarified.
Typo: Seems like it should be "dot product" instead of "Dot-production."
Equations: None of the equations are numbered. They should be labeled sequentially as (Equation 1), (Equation 2), ..., (Equation N).
DCC/DCA: It is unclear whether this represents a division or something else. The notation should be clarified.
Lines 276 to 277 attempt to justify the absence of DCC and DCA in Table 1. This justification could be enhanced by means of a specific example. The clarification of the differences between past datasets and UniSite-DS may help in this regard.
Section 5.1: It is essential to add a justification for the baseline methods used for comparison.
Line 160: Sentence ends in "by minimizing the following cost:" to then show the "result" of the minimization $\hat{\sigma}$ rather than the loss.

问题

In which way can past datasets induce statistical bias? Do the experiments in the manuscript showcase how this bias is present?
Are the baseline methods also capable of predicting multiple binding sites?
In Table 3, do the baseline methods show variation in other evaluation metrics?
Why were the thresholds 0.5, 0.7, and 0.9 chosen for sequence similarity?
In line 210, the expression $p_i \in \Delta^2$ is shown: Is $p_i$ a scalar? What is $\Delta^2$ defined?

局限性

Yes.

最终评判理由

The authors have proposed a valuable new dataset (UniSite-DS) for binding site detection, as well as proposed new models for binding site detection (UniSite-1D and UniSite-3D) and a new metric to assess binding site predictions (AP of IoU).

My main issue with this manuscript is regarding the description of UniSite-1D and UniSite-3D, I thank the authors for the many clarifications provided. However, Section 3 needs to be rewritten in order to properly explain the architecture of the models to allow for reproducibility. My second concern is the lack of detail about the newly introduced metric. The authors have demonstrated the problems with past metrics very well and empirically shown how the new metric avoids those issues. My problem is that there is no description for the reader to understand what the metric is in the main paper. Finally, my reading of the motivations to build UniSite-DS is that a "protein-centric" dataset is the "solution" for the statistical bias present in past datasets. However, the response for the reviewers indicates that this statistical bias is a result of the lack of annotated binding sites.

格式问题

No concerns.

作者回复

2025-07-31

We thank reviewer for the constructive comments. We provide our feedback as follows. We hope our answers have addressed your confusion, and we would greatly appreciate it if you could consider raising your rating.

A1

Q1-1: Weakness 1: The clarity of the paper could be improved — in particular, Section 3.1 lacks a clear explanation of the loss function …

Q1-2: Minor comment 6: Line 164: Lmask is referred to as "binary mask loss", but its formal definition doesn't appear until Section 5.1 …

Thank you for your suggestion. We will move the implementation details about $L_{mask}$ to Section 3.1, line 164.

A2

Q2-1: Weakness 3: The overall idea of the architecture seems good …

Q2-2: Minor comment 7: In line 189, GearNet-Edge is mentioned …

Given a protein sequence of length L, the input embedder receives three inputs: (1) the protein sequence represented as a 1D tensor of shape (L); (2) the residue indices as a 1D tensor of shape (L); and (3) the ESM2 embedding of protein sequences as a 2D tensor of shape (L,1280). The residue indices are encoded into sinusoidal positional embeddings of shape (L,32). The protein sequence is embedded using 21 learnable embeddings (20 standard amino acids plus an "unknown" category) to produce embeddings of shape (L,64). These three components - the ESM2 embeddings (L,1280), sinusoidal embeddings (L,32), and amino acid embeddings (L,64) - are then concatenated along the feature dimension, and subsequently processed by a 3-layer multilayer perceptron (MLP) to generate the per-residue features of shape (L,256).

We will add these implementation details of the sequence encoder in the Appendix.

A3

Q3: Weakness 4: The definition of the binding site center in Section 4 is vague and should be clarified.

In accordance with established methodologies, we employ the geometric center of the ligand as the ground truth binding site center for the DCC metric. And the predicted centers are outputs of each method itself.

A4

Q4: Weakness 5: In Issue 1, the authors state that focusing on individual protein–ligand structures introduces statistical bias …

We have provided a detailed specific example in Figure 1 of the manuscript to highlight the significant differences between the UniSite-DS and the classic PDB-centric dataset, PDBbind.

Specifically, for the protein Farnesyl diphosphate synthase (UniProt ID: Q8WS26), the upper-left portion of Figure 1 shows that in the PDBbind dataset, only one ligand binding site is recorded, contributed by PDB ID: 1YHM. In contrast, the upper-right portion of Figure 1 shows that UniSite-DS records 17 ligand binding sites for this protein, sourced from 13 different PDB structures. The protein portions of these PDB structures are highly similar, while their ligand binding sites are notably different. PDB-centric datasets overlook the different binding sites of the same protein across multiple PDB structures, missing a large number of ground truth sites.

A5

Q5-1: Weakness 6: Figure 2 attempts to describe a generalization of the fragmentation in past …

Q5-2: Minor comment 4: Figure 2 (Section 3): This figure does not add relevant insight …

Figure 2 (top) illustrates the discontinuous workflow of conventional learning-based binding site prediction methods. Take the baseline GrASP as a specific example: (1) Predict Binary Mask. GrASP first predicts a score for each heavy atom, representing "the likelihood to be a part of a binding site". The heavy atoms with scores below 0.3 are discarded. (2) Clustering. The remaining atoms are clustered into discrete binding sites through average linkage clustering, which makes overlap between predicted sites impossible.

In contrast, Figure 2 (bottom) demonstrates our UniSite approach, which directly generates N potentially overlapping binding sites in an end-to-end manner, as detailed in Sections 3.1 and 3.2.

A6

Q6: Weakness 7: The problems of widely used metrics are well explained …

We have added a detailed pseudocode to explain the calculation of the AP metric, which will be included in the Appendix of the revision. Due to the word limit in the NeurIPS response, please refer to the pseudocode in our A2 response to Reviewer 4hDu.

A7

Q7: Minor comment 1: Line 58: The acronyms DCC and DCA are used without being previously defined. All acronyms should be introduced the first time they appear.

We have already introduced the definitions of both DCC and DCA metrics between the end of line 58 and the beginning of line 61.

A8

Q8: Minor comment 2: Line 113: SIFTS annotations …

SIFTS stands for Structure Integration with Function, Taxonomy and Sequences resource.

A9

Q9: Minor comment 3: Section 3: The notation needs refinement. For instance, using the gt in $m^{gt}$ is never explained …

We have provided a detailed explanation of $m_i^{gt}$ in lines 135-139. Each $m_i^{gt}$ represents a ground truth binding site, where ground truth is well known as gt, and $m_i^{gt}\in\lbrace 0,1\rbrace^L$ is a binary vector of length $L$ . We have further elaborated on the meaning of each bit in the vector in lines 137-139: "Here, $m_{ij}^{gt}=1$ indicates that the $j$ -th residue is part of the $i$ -th site, while $m_{ij}^{gt}=0$ means it is not."

A10

Q10: Minor comment 5: Line 159: The symbol $\sigma$ is defined as the bijective matching …

The use of $\sigma$ to denote permutation or bijective matching is in fact a well-established convention in the literature (e.g., as seen in DETR and MaskFormer). If this notation does make you confused, we would use $\tau$ rather than $\sigma$ .

A11

Q11: Minor comment 8: Segmentation module (line 209) …

The description in Section 3.2 (Segmentation Module) is indeed accurate. The N site queries are processed through a linear classifier followed by softmax activation to generate class probabilities, while the N mask embeddings are transformed by a multilayer perceptron (MLP). We will incorporate these corrections in Figure 2.

A12

Q12: Minor comment 9: Typo: Seems like it should be "dot product" ...

Thank you for your suggestion. We will correct this typo in the revision.

A13

Q13: Minor comment 10: Equations: None of the equations are numbered …

While equation numbering is not required by NeurIPS and we don't cross-reference equations, we acknowledge its value for clarity and will add numbering in revision.

A14

Q14: Minor comment 11: DCC/DCA: It is unclear whether this represents a division or something else. The notation should be clarified.

We have provided detailed explanations of both DCC and DCA as distinct metrics in the manuscript. In this particular context, the “/” was intended to indicate their parallel or alternative applications (i.e., either metric could be used). To prevent any potential ambiguity, we will revise the expression to explicitly state "DCC or DCA" in the revision.

A15

Q15: Minor comment 12: Lines 276 to 277 attempt to justify the absence of DCC and DCA in Table 1 …

As stated in response A5, our dataset includes multiple PDB structure data for each protein entry, where binding sites are contributed by different structures and in different coordinate systems. Traditional DCC and DCA metrics rely on the ligand coordinates, so they are no longer applicable to Table 1.

A16

Q16-1: Question 1: In which way can past datasets induce statistical bias? Do the experiments in the manuscript showcase how this bias is present?

Q16-2: Question 3: In Table 3, do the baseline methods show variation in other evaluation metrics?

Q16-3: Question 4: Why were the thresholds 0.5, 0.7, and 0.9 chosen for sequence similarity?

Due to the word limit in the NeurIPS response, regarding the statistical bias induced by past datasets and the experiments, please refer to Figure 1 of the manuscript, as well as our A4 and A3 responses to Reviewer cfsU. For the selection of similarity measures and the additional similarity experiments, please refer to our A1 response to Reviewer 4hDu.

A17

Q17: Question2: Are the baseline methods also capable of predicting multiple binding sites?

Yes, baseline methods can predict multiple binding sites.

A18

Q18: Question 5: In line 210 … What is $\Delta^2$ defined?

We will correct this sentence to "to generate class probabilities $\lbrace p_i=(p_i^{site},p_i^{\emptyset})|p_i^{site},p_i^{\emptyset}\in[0, 1]\rbrace_{i=1}^N$ ".

A19

Q19: Minor comment 13: Section 5.1: It is essential to add a justification for the baseline methods used for comparison.

We selected these baseline methods as they represent the spectrum of methodological approaches for ligand binding site detection. Fpocket represents traditional geometry-based computational algorithms. P2Rank is the most widely-used machine learning approach. DeepPocket, GrASP and VN-EGNN are all deep-learning-based methods. DeepPocket employs 3D convolutional neural networks while GrASP and VN-EGNN utilize graph neural networks.

Due to the page limit of the main text, we currently illustrate and justify the baselines only in Table 1 and Section 5.2. For the revised version, we will add a dedicated appendix section to justify the baseline methods.

A20

Q20: Minor comment 14 :Line 160: Sentence ends in "by minimizing the following cost:" …

We will correct this sentence to "This bipartite matching is obtained by minimizing matching cost $L_{match}$ ".

A21

Q21: Weakness 2: The notation used throughout Section 3 and beyond is often confusing, requiring multiple readings to go through Section 3 and later sections.

Thank you for your feedback. We have followed your suggestions and made the improvements to the manuscript accordingly.

2025-08-05

Dear Reviewer W7Hr,

We sincerely appreciate your time and effort in reviewing our manuscript and offering valuable suggestions.

If our responses have resolved your concerns, we would be truly grateful if you could consider raising your rating.

Best regards,

The authors of paper 15335

2025-08-07

Dear Reviewer W7Hr,

We sincerely appreciate your time and effort in reviewing our manuscript and offering valuable suggestions.

As the author-reviewer discussion phase is approaching its end in two days, we wanted to kindly follow up to confirm whether our detailed rebuttal has addressed your concerns. If you need further clarification or have any additional questions, we would be more than happy to address them.

If our responses have resolved your concerns, we would be truly grateful if you could consider raising your rating.

Best regards,

The authors of paper 15335

评论- Overall Response by Authors

2025-08-02

Dear all reviewers,

We would like to sincerely thank all reviewers for their constructive feedback and their time taken to review!

We have responded to all individual comments in detail. In addition to the individual responses, we observed that several questions raised by the reviewers center around a few common themes, which we respectfully outline below:

Limitations of Traditional Metrics (DCC and DCA) and Benchmark Dataset
- We have provided additional statistical analysis to further illustrate the issues with traditional metrics DCC and DCA (please see our A3 response to Reviewer cfsU and A2 response to Reviewer fcYt). This is exactly the reason why we propose IoU-based AP as a more fair evaluation metric.
- Moreover, we highlighted significant annotation omissions in current benchmark dataset (please see our A4 response to Reviewer cfsU and A2 response to Reviewer fcYt).
- These additions further support that both traditional metrics and benchmark dataset exhibit substantial limitations, failing to accurately reflect the true performance of different methods. In fact, UniSite demonstrates significantly superior performance when assessed using either the IoU-based AP metric or the corrected DCC-residue measure on traditional benchmark.
Fairness and Implementation Details of the IoU-based AP Metric
- The IoU-based AP metric builds on well-established evaluation practices in Object Detection and is specifically designed to address key limitations of traditional metrics DCC and DCA.
- We have added implementation details regarding the IoU-based AP metric (please see our A2 response to Reviewer 4hDu). To further illustrate the importance and fairness of incorporating the scores of predicted sites, we also calculated and discussed the IoU-based Recall metric (please see our A3 response to Reviewer 4hDu).
Design and Case Studies of UniSite-DS
- We have added additional details on UniSite-DS and provided extra case studies to help the community gain a better understanding of its design and implementation (please see our A1 and A2 responses to Reviewer cfsU). These will be included in the Appendix of the revision.
Generalization under Different Similarity Levels
- We have added experiments on testsets with different sequence similarities (please see our A1 response to Reviewer 4hDu and A3 response to Reviewer fcYt). The results show that, compared to traditional machine learning methods, deep learning approaches such as UniSite-1D, UniSite-3D, and GrASP experience a performance decline when dealing with proteins of very low sequence similarity. This could be due to deep learning methods' greater sensitivity to the distribution of the training data.
- We would like to note that the limited generalization ability of deep learning models is a common challenge in the AI for Protein field. For example, in the Science paper introducing RoseTTAFold All-Atom by David Baker et al. [1], the results in Fig. 2F show a performance drop from 35% to 24% in low-similarity scenarios. One possible contributing factor is the scarcity of structural data (<0.1% of known proteins have experimentally determined structures).

Once again, we sincerely appreciate all reviewers' feedback! The suggestions have significantly improved the quality of our work.

[1] Science 384.6693 (2024): eadl2528. DOI:10.1126/science.adl2528.

最终决定Accept (spotlight)

2025-09-17

This is an exciting and impactful paper that tackles binding site prediction, an important but less studied problem compared to binder design. Existing works typically focus on the single binding site case, ignoring that multiple binding sites may exist across multiple complexes of the same protein.

I, as well as the reviewers, appreciate how the authors took a comprehensive approach to this problem from the perspectives of dataset construction, modeling, and evaluation. The main contributions are: -A novel UniProt centric dataset that incorporates multiple binding sites for a given sequence. -A model (UniSite) that proposes an end-to-end binding detection framework compared to conventional stage-wise pipelines. -An Average Precision based on Intersection over Union (IoU) metric that address shortcomings of existing evaluation metrics for this problem.

The authors have also worked to thoroughly answer reviewer comments and thus I recommend acceptance of this paper.