PaperHub
8.2
/10
Poster4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.5
置信度
创新性3.0
质量3.3
清晰度3.0
重要性2.8
NeurIPS 2025

NeuralPLexer3: Accurate Biomolecular Complex Structure Prediction with Flow Models

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We introduce NeuralPLexer 3, a physics‑guided flow‑based generative model that delivers state‑of‑the‑art, seconds‑fast predictions of diverse biomolecular complex structures.

摘要

关键词
Co-foldingflow matchingbiomoleculescomputational drug discovery

评审与讨论

审稿意见
5

This paper introduces NeuralPLexer 3 (NP3), a conditional flow-matching model that couples a physics-motivated globular-polymer prior with symmetry-corrected flows and a custom Flash-TriangularAttention kernel. NP3 delivers ligand-to-protein complex structures in ≈30 s on a single L40S GPU, while achieving. competitive performance across proteins, nucleic‐acids, PTMs and covalent ligands. Two new evaluation suites,NPBench and ConfBench are released to measure generalisation and ligand-induced conformational change, where NP3 outperforms AlphaFold-Multimer on pocket motions.

优缺点分析

Strengths:

1)By integrating physics priors and multiple molecular modalities (proteins, ligands, nucleic acids, PPIs) and multiple useful information (MSAs, PLM information) it directly mitigates the “unphysical hallucination” problem common to diffusion/flow generators and incorporates recent gains from modern structure-prediction models.It also reflects several recent advances in structure prediction, making it a valuable reference for future work in this area.

  1. The introduction of two new benchmarks, NPBench and ConfBench, represents a meaningful contribution to the evaluation of molecular and structural modeling methods.

  2. The manuscript is well-written, and the figures and tables are of high quality.

Weakness:

  1. There are several minor issues that detract from the polish of the manuscript, such as inconsistent naming between “NP3” and “NPv3,” and formatting problems like the font size in Table 1.

  2. The training data is only described at a high level (e.g., “all PDB entries before September 2020” and synthetic datasets), and the authors note that model weights will not be released due to dual-use concerns. In addition, important implementation details—such as the custom Triton kernel code and hyperparameters—are proprietary, which limits reproducibility and independent validation.

  3. The novelty over NP2 appears to be incremental. A more detailed comparison and clarification of the specific improvements would be helpful to better assess the contribution.

  4. The scope of model comparison is relatively limited. For instance, in the nucleic acid category, only the performance of NP3 is discussed, without comparisons to other relevant baselines.

问题

There remain some areas of ambiguity that raise concerns regarding the clarity and completeness of the work.

  1. Have the authors systematically evaluated (or at least approximately assessed, given computational constraints) the contribution of each modality, such as MSAs, protein language models, and RNA-specific models, to the final performance?

  2. The claimed speed advantage over AlphaFold3 is compelling, but was the comparison conducted on the same hardware setup? A fair, hardware-consistent comparison would significantly strengthen this claim.

  3. What are the guiding principles behind the selection of sampling steps, algorithmic parameters, and physics-inspired priors? These choices appear somewhat ad hoc and could benefit from further justification or ablation.

  4. The authors state that the method "excels in areas crucial to structure-based drug design," but this claim lacks clear empirical support. Could the authors provide more quantitative evidence to substantiate this assertion?

局限性

yes.

最终评判理由

After considering the paper and the author rebuttal, I confirm a final rating of 5. And I think 5 best reflects the current contribution.

格式问题

No.

作者回复

We appreciate the reviewer’s thoughtful evaluation and constructive feedback. Each weakness and question raised by the reviewer is reproduced in block quotes, followed by our responses in plain text.

In response to the weaknesses,

Weakness 1: There are several minor issues that detract from the polish of the manuscript, such as inconsistent naming between “NP3” and “NPv3,” and formatting problems like the font size in Table 1.

Thank you for pointing out these formatting issues. We will address these concerns in the revised manuscript.

Weakness 2: The training data is only described at a high level (e.g., “all PDB entries before September 2020” and synthetic datasets), and the authors note that model weights will not be released due to dual-use concerns. In addition, important implementation details—such as the custom Triton kernel code and hyperparameters—are proprietary, which limits reproducibility and independent validation.

For training data, we used all PDB entries before September 2020, which includes a wide range of protein structures. We also generated synthetic datasets to augment the training data. While we understand the importance of reproducibility, we are currently unable to release the model weights due to dual-use concerns. However, we are committed to providing detail about our methodology and implementation in the revised manuscript.

Due to the proprietary nature of model weight of our custom Triton kernel code, we are unable to release these details at this time.

Weakness 3: The novelty over NP2 appears to be incremental. A more detailed comparison and clarification of the specific improvements would be helpful to better assess the contribution.

Thank you for the suggestion. We will include a section in the revised manuscript to highlight improvements made between NeuralPLexer2 and NeuralPLexer3. Among all technical differences, the most notable changes are:

  1. NeuralPLexer2 is a diffusion-based model using an equivariant graph-based denoising decoder, based on a continual development of the original NeuralPLexer model (doi:10.1038/s42256-024-00792-z). NeuralPLexer3 instead is a flow matching model with a multimodal conditioning module and a conditional diffusion transformer decoder.

  2. NeuralPLexer2 employs a Gaussian prior while NeuralPLexer3 employs a more physical globular polymer prior (described in Algorithm S3).

  3. Compared to NeuralPLexer2, NeuralPLexer3 is trained with greatly expanded training data and model size, and shows substantially improved performance.

Weakness 4: The scope of model comparison is relatively limited. For instance, in the nucleic acid category, only the performance of NP3 is discussed, without comparisons to other relevant baselines.

A direct comparison for the nucleic acid category was challenging, and we decide to omit a extensive comparison to baselines within the scope of this work. At the time of submission the only work reporting accuracy for low-homology RNA-including structures is AlphaFold 3, and we do not have a license to run their code in order to reproduce their results. The best non-deep-learning baseline, Alchemy-RNA2, requires additional human expert inputs.

To facilitate fair evaluation of structure prediction models on all molecular modalities in this field, we will make all benchmarking entries and model evaluation code public.

We find that NeuralPLexer3's RNA and protein-RNA interaction prediction performance is overall far from experimental accuracy, which is unsurprising given that (1) the diversity of solved RNA structures in the PDB is limited and (2) we did not include domain-specific features such as RNA secondary structure.

In response to the reviewer’s questions,

Question 1: Have the authors systematically evaluated (or at least approximately assessed, given computational constraints) the contribution of each modality, such as MSAs, protein language models, and RNA-specific models, to the final performance?

These conditioning signals are critical for model performance. The PDB is a comparatively small dataset (~250,000 structures). Cofolding models can be interpreted as learning the relationship between evolutionary signal (MSA, PLM) and structure.

Removal of protein MSA resulted in a drop in Posebusters pass rate from ~78% (Figure 1B) to ~27%. Use of more approximate MSA (MMSeqs2 on the ColabFold database) resulted in a pass rate of ~74%. We did not train NeuralPLexer3 to be robust to the removal of MSA.

PoseBench (arXiv:2405.14108v5) examines the role of MSA in AlphaFold 3 and Chai-1, and reports similar accuracy loss upon the removal of MSA. See Figure 2 and note that “single-seq” is the no-MSA condition.

NeuralPLexer3 has limited accuracy for RNA structure prediction, which remains an ongoing challenge in the field. In this work, we show that our model trained on RNA language model features are similar in accuracy to existing models trained on RNA MSA.

Question 2: The claimed speed advantage over AlphaFold3 is compelling, but was the comparison conducted on the same hardware setup? A fair, hardware-consistent comparison would significantly strengthen this claim.

We do not have license to run AlphaFold 3, so we cannot provide a perfectly controlled comparison. To address this, we run NeuralPLexer3 on the comparatively weaker L40S GPU and compare to AlphaFold 3’s reported timings on the A100 GPU.

NeuralPLexer3 is around 20x faster on an L40S than is AlphaFold 3 on A100s (as reported in doi:10.1038/s41586-024-07487-w) for a 1024-residue protein. The majority of the compute is done in 16-bit precision, for which the maximum throughput is 300 TFlops for A100 and 180 TFlops for L40S. So even though AlphaFold 3 had a hardware advantage in this comparison, NeuralPLexer3 still exhibits a large speed advantage.

Question 3: What are the guiding principles behind the selection of sampling steps, algorithmic parameters, and physics-inspired priors? These choices appear somewhat ad hoc and could benefit from further justification or ablation.

Training of the NeuralPLexer3 model was expensive compared to our computational resources, requiring the use of our cluster for a full month. While we tested many parameters and architectures at small scale, we found that these results often did not translate to the full-scale model. (For example, we found that an equivariant decoder was preferred at small scale, but not at full scale.) As a result, the development of the model was indeed ad hoc in many places.

In the manuscript, we present two cases where we were able to confidently extrapolate conclusions from small scale to full scale.

  1. In Figure 2B, we provide a sequential set of improvements to the model and report the performance of each added improvement.

  2. In Figure S4, we report scaling curves for the model, varying the size of the encoder and decoder as well as the number of decoder replicas during training.

We hope that these cases will be helpful and transferable in future model design.

We note that better hyperparameters are possible, and their determination is an interesting future direction. Two important areas that we have not exhaustively investigated:

  1. Physics-inspired priors: our current choice of prior is certainly important to break symmetry e.g. in the case of protein homodimers. In our small-scale experiments, it was also found to substantially improve training stability when the FAPE loss is included. However, we have not examined the relationship between the quality of the prior and model accuracy.

  2. Number of sampling steps and noise schedule: we have anecdotally found our current number of sampling steps to be more than sufficient. We have not carefully examined the tradeoff between sampler parameters and downstream accuracy.

Question 4: The authors state that the method "excels in areas crucial to structure-based drug design," but this claim lacks clear empirical support. Could the authors provide more quantitative evidence to substantiate this assertion?

We would point to the PoseBusters results presented in Figure 1B. PoseBusters (doi:10.1039/D3SC04185A) represents the current standard benchmark for structure-based drug design. In Figure 1, the authors report that traditional docking methods GOLD and Vina have success rates of 55% and 58% respectively, while requiring the ground-truth receptor structure as input. NeuralPLexer3 has a 78% success rate at this benchmark. NeuralPLexer3 is substantially faster than similar co-folding methods (AlphaFold 3, Chai-1, Boltz-2), which gives it greater screening power and enables interactive use.

评论

Thank you for the response. My concerns were addressed to a reasonable extent. I acknowledge the contribution, but my current score already reflects its contribution and it's hard to justify a higher rating.

审稿意见
5

This work proposes NeuralPLexer3, a generative model for biomolecular complex structure prediction. Many innovations are adopted to improve the accuracy and efficiency of the model over its predecessor, including physics-informed prior, flow-based objective and OT permutation, customized triangular attention kernel etc. The authors also propose new benchmark suites called NPBench and ConfBench to better evaluate the performance of structure-prediction models. Overall, this is a comprehensive and solid paper.

优缺点分析

Strength

  1. NeuralPLexer3 is a structure prediction model with strong performance and faster sampling speed.
  2. New comprehensive benchmarks are highly useful for the entire field if made publicly available.
  3. Some interesting innovations such as the physics-informed prior are proposed.
  4. Customized triangle attention can be valuable if also open-sourced.

Weakness

  1. Some details can be better elaborated (See Questions) and more ablation study can be added for demonstration the effectiveness of some innovations.
  2. Lack of comparison to other open-sourced strong baselines such as the series models of Boltz and Chai.

问题

Below are some questions and comments I have:

  1. Mis-predicted chirality is indeed a problem in AF models. The authors show that the NeuralPLexer3 is doing better than AF3 on this front. Which component in NP3 contributes the most to alleviating the chirality problem?
  2. The physics-informed prior from running Langevin dynamics is very interesting. The critical components of the drift are the bond, entity, and residue matrices SS. Can authors elaborate on how they are constructed?
  3. From the result, it seems like the PoseBuster benchmark evaluates the ligand RMSD? It would be nice to also evaluate the receptor RMSD since the NP3 and some other baselines are co-folding models.
  4. Ablation study: There is no ablation study about the novel prior. I am very curious about how the prediction accuracy would change if simple Gaussian prior is used.
  5. Releasing the kernel for triangular attention can be highly valuable to the field.
  6. One important concern is that the manuscript is not comparing NP3 to many other open-sourced strong baselines such as the family of Botlz and Chai. Admittedly, NP3 scores very high on many evaluation tasks, but the authors can surely also evaluate those baselines on new benchmarks proposed in this work.

局限性

Yes

最终评判理由

Great paper with substantial experiments. Suggest to accept.

格式问题

No

作者回复

We appreciate the reviewer’s thoughtful evaluation and constructive feedback. Each weakness and question raised by the reviewer is reproduced in block quotes, followed by our responses in plain text.

In response to the reviewer’s questions,

Question 1: Mis-predicted chirality is indeed a problem in AF models. The authors show that the NeuralPLexer3 is doing better than AF3 on this front. Which component in NP3 contributes the most to alleviating the chirality problem?

Within the scope of this work, we do not have definitive evidence isolating any one factor. We believe that the primary drivers are:

  1. the use of flow matching with symmetry correction described on lines 82-86 of the main text, lines 519-524 of the SI, and in Algorithm S1;

  2. the inclusion of smoothed frame-aligned point error during training (Algorithm S2, L11), which assigns a greater penalty to violations of local molecular geometry;

  3. the selection of sampled poses from all rollouts at inference time based on an approximate energy model (Figure 2B, Section S.5 of the SI).

Question 2: The physics-informed prior from running Langevin dynamics is very interesting. The critical components of the drift are the bond, entity, and residue matrices.Can authors elaborate on how they are constructed?

The drift terms are constructed based on an idealized Gaussian polymer model. The values are selected such that the standard deviation of atom positions within each component is approximately 2Å for each bonded atom pair, 5Å for each residue, and 8Å for each biopolymer entity.

To understand the construction of the adjacency matrices, SS, note that NeuralPLexer3 operates at two scales: anchors and atoms. Starting with atoms, SbondS_{bond} is a connectivity matrix for chemical bonds. Its entries are 1 and 0 corresponding to the presence and absence of bonds, respectively. For small molecule ligands, bond connectivity is provided as input. For amino acids, nucleotides, etc, bond connectivity is looked up in the Chemical Component Dictionary.

Anchors correspond to amino acid residues, nucleotides, and small molecule heavy atoms. In addition, there are anchors for other miscellaneous monomers such as non-natural amino acids. We refer to all of these objects as “residues”. The SresidueS_{residue} matrix has entries of 1 for consecutive anchors in the same biopolymer and 0 otherwise.

Finally, “entities” refer to biopolymer chains, and the SentityS_{entity} matrix introduces a spring between each atom to the centroid coordinate of the entity it belongs to.

The result of this process is illustrated in the upper left of figure 1A. The physics-informed prior looks like globular clouds for each chain, with some incomplete separation between chains.

Question 3: From the result, it seems like the PoseBuster benchmark evaluates the ligand RMSD? It would be nice to also evaluate the receptor RMSD since the NP3 and some other baselines are co-folding models.

PoseBusters evaluates the ligand RMSD after aligning the test and reference structures on the pocket. So, it implicitly tests for correct positioning of the ligand in the protein pocket in addition to explicitly measuring internal ligand structure.

We find that the PoseBusters (pocket-aligned ligand) RMSD is correlated with pocket RMSD, with a correlation coefficient R=0.7. The pocket RMSD is less than 2Å in 90% of PoseBusters structures.

Question 4: Ablation study: There is no ablation study about the novel prior. I am very curious about how the prediction accuracy would change if simple Gaussian prior is used.

This is a good point, and we did not carefully test a simple Gaussian prior. As this requires expensive model retraining, we reserve this exact experiment for a future work.

Meanwhile, because we choose to not include the hand-crafted entity positional encoding from AlphaFold 3 to break symmetry, we anticipate that an isotropic Gaussian prior will lead to degenerate and qualitatively wrong behavior for symmetric assemblies such as protein homodimers. This behavior has been observed in Boltz-1 (doi:10.1101/2024.11.19.624167, Figure 7).

Question 5: Releasing the kernel for triangular attention can be highly valuable to the field.

We fully agree and appreciate the reviewer’s recognition of the potential impact of our Flash-TriangularAttention implementation. However, we choose not to release the code at this time.

We are encouraged by the community’s recent development of efficient triangular kernels, including Trifast (doi:10.1101/2024.11.19.624167), triangle_attention and triangle_mutiplicative_update in Nvidia’s cuEquivariance library, and the GPU MODE triangular multiplicative update competition.

Question 6: One important concern is that the manuscript is not comparing NP3 to many other open-sourced strong baselines such as the family of Boltz and Chai. Admittedly, NP3 scores very high on many evaluation tasks, but the authors can surely also evaluate those baselines on new benchmarks proposed in this work.

We are currently working on comparison to Boltz-2 and Chai-1. There are challenges: this is computationally expensive, we want to ensure fair comparison, and sometimes these models permute atom order, requiring careful checking and postprocessing of outputs.

With the submitted work, we have published the code and reference structures corresponding to our benchmarks to facilitate comparison with future models. We have also published corresponding NeuralPLexer3 predictions. (The structure predictions are not included in the submission packet as they are hosted on Zenodo with author-identifying information.)

Further, we would point the reviewer to PoseBench (arXiv:2405.14108v5), which makes comparisons between AlphaFold 3 and Chai-1 among other methods. (Please note that “NeuralPLexer” refers to the model described in doi:10.1038/s42256-024-00792-z and not to NeuralPLexer3 as described in this work.)

审稿意见
5

The paper tackles the challenge of predicting three-dimensional structures of biomolecular complexes, such as protein–ligand and protein–protein assemblies, with high physical realism and speed. Experimental methods deliver accurate structures but are slow and resource-intensive. To address this, the authors present NeuralPLexer3 (NP3), a flow-based generative model. NP3 uses optimal-transport-driven training with a symmetry-correction module to enforce realistic bond geometries, and an anchor-based encoder–decoder that first refines key anchor atoms before expanding to all heavy atoms via geometry-aware attention. Hardware-aware optimizations enable inference in seconds on a single GPU. Benchmarks show NP3 outperforms AlphaFold3 and earlier NeuralPLexer versions on PoseBusters and ConfBench, achieving higher docking success and conformational accuracy.

优缺点分析

  1. Clear descriptions of the problem and architecture.
  2. Table 1 could be presented more clearly. Could the authors bold best performing models
  3. The figures are clear and informative.
  4. The choice of the range of y axis on figure 1B is misleading.
  5. The reviewer does not have enough domain expertise to judge the significance of this work. How significant are the minor (eg Table 1 Pose Buster) improvements in accuracy. Does this correspond to a significant improvement of downstream tasks?

问题

  1. The authors emphasise that alternatives such as AF3 produce unrealistic torsion angles or bond lengths. Often AF3 structures are post processed and optimised using force fields which are fast and readily available. Could the authors compare their models performance to using this common AF3 pipeline.
  2. Would it be possible to also compare to Boltz-2?

局限性

最终评判理由

The authors have responded to all comments and clarified some points. The score has been improved due to comments to other rebuttals. The difference between NP2 vs NP3 is more clear.

格式问题

None

作者回复

We appreciate the reviewer’s thoughtful evaluation and constructive feedback. Each weakness and question raised by the reviewer is reproduced in block quotes, followed by our responses in plain text.

In response to the strengths and weaknesses,

Weakness 2: Table 1 could be presented more clearly. Could the authors bold best performing models

Thank you for this suggestion. In the revised version, we will indicate the best performing model for each evaluation metric and task category in boldface.

Weakness 4: The choice of the range of y axis on figure 1B is misleading.

We agree and intend to introduce an axis break to improve clarity in future versions of the manuscript. We feel that it is important to remain zoomed-in on this scale in the second panel of Figure 1B. Stereochemistry must be preserved for these models to be useful in drug discovery. In our experience, structure-based drug design scientists demand 100% correct stereochemistry.

Weakness 5: The reviewer does not have enough domain expertise to judge the significance of this work. How significant are the minor (eg Table 1 Pose Buster) improvements in accuracy. Does this correspond to a significant improvement of downstream tasks?

Improvement of PoseBusters accuracy is itself one downstream task. Co-folding models are intended to reduce the need for structural experiments, and the read-off of these experiments is the protein-ligand structure. The key metric "% RMSD < 2 Å and PB-valid" is particularly important for downstream tasks because it requires structures to be both geometrically accurate AND physically reasonable (PB-valid). As shown in Figure 1B, NeuralPLexer3 achieves a 5.6% absolute improvement (77.9% vs 72.3%) compared to AlphaFold3. The combination of improved accuracy AND physical realism makes these gains particularly valuable for practical drug discovery applications where both geometric precision and chemical plausibility are essential.

For other downstream tasks, PoseBusters accuracy is at least well correlated with performance. For example, we find that the ability of the model to recapitulate protein ligand interactions (hydrogen bonds, salt bridges, pi-pi stacking, etc) is linearly correlated with the logarithm of the ligand RMSD measured in Posebusters.

As for the question of significance, we believe that the results we report are a meaningful improvement over AlphaFold 3. An even greater advancement compared to AlphaFold 3 is the far faster inference speed shown in the third panel of Figure 1B. Our faster model can screen more chemical space and enables semi-interactive design.

Regarding the reviewer’s questions,

Question 1: The authors emphasise that alternatives such as AF3 produce unrealistic torsion angles or bond lengths. Often AF3 structures are post processed and optimised using force fields which are fast and readily available. Could the authors compare their models performance to using this common AF3 pipeline.

We agree that a force-field pipeline can sometimes be used to clean up structures. However, these cleanup steps are expensive compared to the inference cost of NeuralPLexer3, requiring about 30s seconds on an L40S GPU. Force-field cleanup can also be brittle, as it requires force fields that are parameterized to the system of interest. Finally, force fields are sensitive to unphysical initial structures, and their minimization can diverge, especially in cases where bond lengths are too short.

PoseBench (arXiv:2405.14108v5) benchmarked the effect of force field relaxation on PoseBusters physical validity for multiple models including AlphaFold 3. Their results are mixed. In some tests (Figs 2, 4, G15, and G21), force field relaxation of AlphaFold 3 predictions offers small improvements to physical validity. In other tests (Figs 3, G17, and G25), force field relaxation results in worse physical validity. (Please note that “NeuralPLexer” refers to the model described in doi:10.1038/s42256-024-00792-z and not to NeuralPLexer3 as described in this work.)

Question 2: Would it be possible to also compare to Boltz-2?

At the time of submission, Boltz-2 had not been announced.

Boltz-2's training set included all Posebusters samples. Its training date cutoff for PDB structures was June 1, 2023. Moreover, Boltz-2 have not released their evaluation code corresponding to the custom benchmark reported in their manuscript, so it is yet not possible to make a fair comparison.

To facilitate comparisons with current and future models, we have published all of our benchmark code (included in the SI) as well as our corresponding structure predictions. (The structure predictions are not included in the submission packet as they are hosted on Zenodo with author-identifying information.)

评论

Thank you for responding to the rebuttal with all the additional detail.

审稿意见
5

This paper presents NeuralPLexer3, an update to the family of biomolecular structure prediction models. Algorithmic components introduced include physical priors to constrain initial conformational embedding, optimal transport mechanisms to simplify training flows, and compute-optimal attention schemes to induce inference speedups. Performance for structure prediction in the PoseBusters benchmark is included and compared to AlphaFold 3. New benchmarks are proposed that assess performance in specialized modalities such as protein-peptide interfaces and post-translational modifications. An additional benchmark study assesses model performance for capturing conformational changes induced by ligand binding, measuring model accuracy for apo- and holo-state structure prediction.

优缺点分析

Strengths:

  1. The algorithmic components applying physical constraints described in S.3 are nice, such as bonded-connectivity, entity-level and residue-level, and global spherical confinement constraints to ensure consistent clustering and compact arrangement at the polymer level
  2. The architecture (Figure S1) and arrangement of conditioning modules, encoder blocks, and flow blocks is nice
  3. The Flash-TriangularAttention implementation is innovative and the memory and speed improvements are significant
  4. The results in model development Figure 2B are nice and showcase a complex system
  5. PB performance (Figure 1B and Table 1) is impressive and significant
  6. New benchmarks with specialized modalities and performance therein (Table 1) is nice
  7. Calibrated uncertainties (results in Figure 3) are nice. Not perfect, but pretty good classifiers as indicated by dashed gridlines in Fig 3A & 3C
  8. Introducing ConfBench is nice and an important measurement of model fidelity with deep practical significance within the drug-discovery process.

Weaknesses:

  1. The paper is not very detailed on the math or algorithms in the main text. E.g., details for “incorporating optimal transport principles” by adding “a simulation-free symmetry correction module to straighten the conditional flow trajectories that connect the prior samples and ground truth structures” (L82-86). This is a significant component that could be an impactful contribution to the field; it should be described with equations, as should training losses in much greater detail than is included in S.2 Flow matching (eqn 2). This would greatly increase the clarity, quality, and significance of the work and overall rating
  2. Related to 1., the organization and presentation of the paper is lacking. Direct comparison to AF3 architectural, algorithmic, and training components in the main text would be helpful to increase understanding of the novelty of what is proposed. An outline that presents such advancements in a stepwise fashion and points to the results that demonstrate their significance (with equations) would increase the significance and originality of the work
  3. Related to 1. and 2., the overall clarity of the work is lacking. Pointers to specific data and descriptions of exact experiments that tested hypotheses and describe exactly which data was assessed and why would increase clarity. For example, the ConfBench experiments are difficult to understand. There is very little detail on what subset of the benchmark is used to create the data in Figure 4. Specifically, what set does the, e.g., “n=90” refer to in the bar charts? Further, I don’t understand how the measurement of “Apo Proteins” in Figure 4D is conducted. Are these just the raw prediction results for the proteins in the curated dataset without bound ligands (i.e., in their apo state)? If so, why are the results substantially lower than the “Proteins.Monomers” results in Table 1, which would refer to the same type of experiment? Finally and given this, was there any analysis done to understand the distribution shift from training data of NP3 to the ConfBench set, either apo or holo? Is the majority of the ConfBench set actually a subset of training data?
  4. Qualitative results are lacking. Including visualizations of predicted and true complexes (protein-ligand, PPIs, etc.) generated by NP3 in comparison to AF3 or AF-M would increase the significance of the work and give confidence that the quantitative PB results that suggest NP3 complexes are accurate and physically plausible bear out in practice with interesting and significant interactions visible in predicted complexes

问题

  1. See Weaknesses 3. for several questions
  2. How are symmetry-correction and optimal rigid structure alignment modules applied at test time, i.e., without ground-truth structures?
  3. Where is the data that the percentages quoted in lines 237-244 refers to? I don't see those numbers in Figure 4

局限性

yes

最终评判理由

The authors responses to weaknesses noted and questions raised resulted in increased clarity and understanding of the originality of the work. Each score was increased accordingly, and the overall rating increased to "Accept"

格式问题

NA

作者回复

We appreciate the reviewer’s thoughtful evaluation and constructive feedback. Each weakness and question raised by the reviewer is reproduced in block quotes, followed by our responses in plain text.

Weakness 1: The paper is not very detailed on the math or algorithms in the main text. E.g., details for “incorporating optimal transport principles” by adding “a simulation-free symmetry correction module to straighten the conditional flow trajectories that connect the prior samples and ground truth structures” (L82-86).

We agree and will clarify on this point. In the revised manuscript, we will augment the relevant section (L82–86) with supporting information. Specifically, we incorporate a simulation-free symmetry correction module motivated to align the conditional flow trajectories with displacement interpolation paths that approximate the OT geodesics between the prior and target distributions. The details are as follows:

SymbolMeaning
xTx_TNoisy prior sample drawn at diffusion timestep TT
x^0\hat{x}_0One‑step denoised coordinates predicted from xTx_T
xxGround‑truth (noise‑free) label structure
D={D1,,Dm}\mathcal{D}= \{\mathcal{D}_1,\dots,\mathcal{D}_m\}Set of entity groups; each Dk\mathcal{D}_k contains indices of chemically identical (indistinguishable) chains or ligands
RMSD(a,b)\mathrm{RMSD}(a,b)Root‑mean‑square deviation between coordinate sets a,ba,b
R,t\mathbf{R},\mathbf{t}Rotation matrix and translation vector returned by Kabsch alignment
P\mathbf{P}Block‑diagonal permutation matrix acting inside each Dk\mathcal{D}_k

  1. Denoise the prior with pre-computed conditoning: x^0=f(xT,T)\hat{x}_0 = f (x_T,T)

  2. Rigidly align each indistinguishable entity. For every entity group Dk\mathcal{D}_k:

    1. For each index pair (i,j)Dk×Dk(i,j)\in\mathcal{D}_k\times\mathcal{D}_k (e.g., (0,0),(0,1),,(2,1)(0,0),(0,1),\dots,(2,1) when entities 0,1,20,1,2 are identical):

      • Compute (Rij,tij)=Kabsch(x^0(i),x(j))(R_{ij},t_{ij})=\mathrm{Kabsch}(\hat{x}_0^{(i)},x^{(j)}).
      • Obtain an aligned label copy x(j)=Rijx(j)+tij. x^{(j)*} = R_{ij}x^{(j)} + t_{ij}.
    2. Greedy in‑group permutation search (c.f. doi:10.1101/2021.10.04.463034, section 7.3): For all entity group Dl\mathcal{D}_l, choose the permutation πl\pi_l^\star that minimises

      iDlRMSD(x(πl(i)),x^0(i)). \sum_{i\in\mathcal{D}_l}\mathrm{RMSD} \bigl(x^{\bigl(\pi_l(i)\bigr)*},\hat{x}_0^{(i)}\bigr).
  3. Global objective: For each global alignment (Rij,tij)\bigl(R_{ij},t_{ij}\bigr), compute

    RMSDtot=RMSD(Px,x^0). \mathrm{RMSD}_{\text{tot}} = \mathrm{RMSD}\bigl(\mathbf{P}x^{*},\hat{x}_0\bigr).

    Retain the permutation/alignment pair (P,{Rij,tij})\bigl(P^\star,\{R_{ij}^\star,t_{ij}^\star\}\bigr) with the lowest RMSDtot\mathrm{RMSD}_{\text{tot}}.

  4. Apply the optimal permutation to the label.

    x~=Px. \tilde{x}=\mathbf{P}^\star\,x.
  5. Final alignment applied to the prior: Align the permuted label x~\tilde{x} to the original noisy sample xTx_T (optional when using SE(3)‑invariant losses):

    (RT,tT)=Kabsch(x~,xT),x~=RTx~+tT. \bigl(\mathbf{R}_T,\mathbf{t}_T\bigr)=\mathrm{Kabsch}\left(\tilde{x},x_T\right), \qquad \tilde{x}^{\prime} = \mathbf{R}_T\tilde{x} + \mathbf{t}_T.

The resulting x~\tilde{x}^{\prime} (or x~\tilde{x} in an invariant setting) is used as the target in the flow‑matching loss, reducing the transport cost between the prior and target data distribution and maintain a consistent frame across the sampling iterations.

In the revised manuscript, we will also include an additional algorithm subsection to clarify the implementation of this module.

Weakness 2: Related to 1., the organization and presentation of the paper is lacking. Direct comparison to AF3 architectural, algorithmic, and training components in the main text would be helpful to increase understanding of the novelty of what is proposed.

We agree that more explicit comparisons with AlphaFold 3 would improve the clarity and impact of our presentation. Our architectural, algorithmic, and training improvements over AlphaFold 3 are as follows:

  • Architecture: MSA and Pairformer are used by both models. We condition additional on protein and RNA language model embeddings. From scaling studies (Figure S4), we identified a different pareto-optimal choice for encoder / decoder size. We also using sliding window attention instead of ad-hoc block-based operations for token-to-atom communication.

  • Algorithms: We employ flow matching to allow sampling from any prior distribution, and introduce a novel physics-based prior (Algorithm S3). We do not use AlphaFold 3’s hand-crafted positional encoding among entities and atoms.

  • Conditioning: NeuralPLexer3 does not rely on reference RDKit conformers, allowing it to encode molecules of all chemistries.

  • Training components: No training-time mini-rollout is needed because of the prior alignment step, which dramatically reduces the complexity of the training pipeline.

  • Evaluation: We introduce better benchmarks for physical plausibility and conformational change. Unlike AlphaFold 3, we have made the code and structures for these benchmarks available.

  • Optimization: We introduce a fused triangular attention kernel absent in AlphaFold 3.

Weakness 3: Related to 1. and 2., the overall clarity of the work is lacking. Pointers to specific data and descriptions of exact experiments that tested hypotheses and describe exactly which data was assessed and why would increase clarity. For example, the ConfBench experiments are difficult to understand. There is very little detail on what subset of the benchmark is used to create the data in Figure 4. Specifically, what set does the, e.g., “n=90” refer to in the bar charts? Further, I don’t understand how the measurement of “Apo Proteins” in Figure 4D is conducted. Are these just the raw prediction results for the proteins in the curated dataset without bound ligands (i.e., in their apo state)? If so, why are the results substantially lower than the “Proteins.Monomers” results in Table 1, which would refer to the same type of experiment? Finally and given this, was there any analysis done to understand the distribution shift from training data of NP3 to the ConfBench set, either apo or holo? Is the majority of the ConfBench set actually a subset of training data?

We appreciate the reviewer’s comments. In Figure 4, n=90 refers to the number of apo–holo structure pairs that had AlphaFold 2 and NeuralPLexer3 structures available and that met ConfBench criteria: global or pocket RMSD change >1.5 Å. These represent a subset of PLINDER-derived complexes curated to measure meaningful ligand-induced conformational changes. Unlike Table 1, which reports standard fold similarity metrics (e.g., TM-score, LDDT) on diverse targets regardless of conformational state, ConfBench assesses relative accuracy of choosing the correct state (apo vs. holo) using a normalized, symmetric score insensitive to rigid-body shifts. “Apo Proteins” in Figure 4D shows ConfBench scores for NeuralPLexer3 predictions on apo structures, benchmarked against both apo and holo references; lower values compared to Table 1 reflect the greater difficulty of this task and stricter evaluation. Importantly, ConfBench ensures at least one structure in each apo/holo pair is held out from training, mitigating data leakage.

Weakness 4: Qualitative results are lacking. Including visualizations of predicted and true complexes (protein-ligand, PPIs, etc.) generated by NP3 in comparison to AF3 or AF-M would increase the significance of the work and give confidence that the quantitative PB results that suggest NP3 complexes are accurate and physically plausible bear out in practice with interesting and significant interactions visible in predicted complexes

Thank you for the suggestion. In the revised manuscript, we plan to include (1) examples from Posebusters, where AlphaFold 3 predicted the incorrect chirality while NeuralPLexer3 predicted the correct chiral center, and (2) examples where AlphaFold 2-Multimer and AlphaFold 3 predicted the wrong (apo v.s. holo) conformation while NeuralPLexer3 correctly predicted the conformational difference. In compliance with the rebuttal guideline this year, we omit images from this author response.

In addition, NeuralPLexer3 predictions corresponding to all structures in Figure 1, Table 1, and Figure 4 will be made available on Zenodo.

Question 1: See Weaknesses 3. for several questions

Please see our response to weakness 3 above.

Question 2: How are symmetry-correction and optimal rigid structure alignment modules applied at test time, i.e., without ground-truth structures?

At test time, these modules are only applied between subsequent steps of the roll-out trajectory. That is, we attempt to keep each step geometrically and permutationally matched to the previous step. This results in straighter trajectories that may otherwise have been confounded by identical polymer chains and/or permutational symmetry within chains.

Question 3: Where is the data that the percentages quoted in lines 237-244 refers to? I don't see those numbers in Figure 4

We thank the reviewer for pointing out this typo on lines 237-244. The correct values are listed in the Figure 4D. These typos will be fixed in the revised manuscript.

最终决定

This submission presents NeuralPlexer3, an all-atom structure prediction model. The reviewers appreciated the architectural advances, the implementation of Flash-Triangular attention, strong performance, and thorough benchmarks, some of which are novel and significant in themselves. Reviewer concerns largely centered around clarity, organization, and presentation and comparisons to AF3. Discussions mainly focused on resolving these clarity concerns, with two reviewers raising the score and two others maintaining a positive rating. Altogether, the reviewer consensus is a clear accept.