PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
4
3
5
3.3
置信度
创新性2.5
质量2.5
清晰度2.5
重要性3.0
NeurIPS 2025

Accurately Predicting Protein Mutational Effects via a Hierarchical Many-Body Attention Network

OpenReviewPDF
提交: 2025-04-18更新: 2025-10-29
TL;DR

We present H3-DDG, a model that predicts binding free energy changes by modeling higher-order many-body interactions, achieving state-of-the-art performance in protein mutational effect prediction.

摘要

Predicting changes in binding free energy ($\Delta\Delta G$) is essential for understanding protein-protein interactions, which are critical in drug design and protein engineering. However, existing methods often rely on pre-trained knowledge and heuristic features, limiting their ability to accurately model complex mutation effects, particularly higher-order and many-body interactions. To address these challenges, we propose H3-DDG, a Hypergraph-driven Hierarchical network to capture Higher-order many-body interactions across multiple scales. By introducing a hierarchical communication mechanism, H3-DDG effectively models both local and global mutational effects. Experimental results demonstrate state-of-the-art performance on multiple benchmarks. On the SKEMPI v2 dataset, H3-DDG achieves a Pearson correlation of 0.75, improving multi-point mutations prediction by 12.10%. On the challenging BindingGYM dataset, it outperforms Prompt-DDG and BA-DDG by 62.61% and 34.26%, respectively. Ablation and efficiency analyses demonstrate its robustness and scalability, while a case study on SARS-CoV-2 antibodies highlights its practical value in improving binding affinity for therapeutic design.
关键词
Protein Mutational EffectsProtein-Protein InteractionsBinding Free EnergyHierarchical Many-Body Attention

评审与讨论

审稿意见
4

This paper presents H3-DDG, a hypergraph-driven hierarchical network designed to capture higher-order many-body interactions for predicting binding free energy changes (ΔΔG\Delta \Delta G) in protein-protein interactions. The method constructs three hierarchical graph representations: residue-level graphs, mutation-centered hypergraphs, and fine-grained mutation subgraphs, employing 2-body, 3-body, and 4-body attention mechanisms to model complex interactions across multiple scales. H3-DDG achieves state-of-the-art performance on SKEMPI v2 and BindingGYM datasets, with particularly strong results in challenging multi-point mutation scenarios, demonstrating a 12.10% improvement over BA-DDG in multi-point mutations on SKEMPI v2.

优缺点分析

Strengths


S1. Predicting ΔΔG\Delta \Delta G for protein-protein interactions is crucial for drug design and protein engineering. The authors clearly analyze the limitations of existing methods in handling higher-order and many-body interactions, particularly for multi-point mutations where the exponential mutation space poses significant challenges.

S2. The paper is generally well-written with clear presentation of the hierarchical architecture. The model design is simple yet effective, with well-motivated many-body attention mechanisms that effectively capture both local and global interactions. The hierarchical design effectively balances computational complexity with biological interpretability.

S3. Extensive experiments across multiple benchmarks (SKEMPI v2, BindingGYM) demonstrate strong performance improvements. Thorough ablation studies convincingly show the effectiveness of each component, and practical validation on SARS-CoV-2 antibody optimization demonstrates real-world applicability.

S4. The method achieves significant improvements in the challenging multi-point mutation scenarios, addressing a key limitation of existing approaches that struggle with complex interdependencies between mutation sites.

Weaknesses


W1. While the authors mention "manageable computational overhead," detailed runtime comparisons with baseline methods are missing from the main text. Given the hierarchical nature and higher-order attention mechanisms, comprehensive computational cost analysis is crucial for practical adoption.

W2. The repeated residual connections in equations (5) and (6) are unconventional and may preserve more original features, but this design choice detailed justification.

W3. The paper lacks comparison with very recent methods like Light-DDG [1], which reportedly achieves 89.7×\times inference acceleration and 15.45% performance gains over previous SOTA methods.


Reference [1] Wu et al. A Simple yet Effective DDG Predictor is An Unsupervised Antibody Optimizer and Explainer. ICLR 2025.

问题

Q1. How does H3-DDG compare with the recently proposed Light-DDG method in terms of both performance and computational efficiency?

Q2. Can the hierarchical embeddings and many-body interaction representations learned by H3-DDG be effectively transferred to other protein engineering tasks?

Q3. How sensitive is the method's performance to the choice of clustering parameters?

Q4. What is the relative contribution of each hierarchical level (2-body, 3-body, 4-body) to the final performance, and how does this vary across different types of mutations?

Q5. Given the computational requirements of the hierarchical many-body attention, what are the practical limitations for real-time protein design applications?

局限性

yes

最终评判理由

The authors have adddressed my main concerns and I will keep my current positive scores.

格式问题

did not notice any issue.

作者回复

We sincerely thank the reviewer for the positive feedback on our method's novelty, effectiveness, and practical impact, as well as for the detailed review and insightful comments. Below, we provide point-by-point responses.

W1: Computational Cost Analysis

R: Thank you for highlighting the need for detailed runtime comparisons. We have supplemented the main text with comprehensive computational efficiency results. As shown below, incorporating higher-order attention significantly improves performance (Pearson from 0.7118 to 0.7501), while training speed remains within an acceptable range:

MethodAttn. Between HyperedgesAttn. Around Mut. Sites (Subgraph)PearsonTraining Speed (its/sec)
BA-DDG----0.71186.25
H3-DDG3-body--0.73175.65
3-body3-body0.73524.50
3-body4-body0.75014.34

These results demonstrate that the performance gains from many-body attention come at a manageable computational overhead.

W2: Residual Connections in Equations (5) and (6)

R: Thank you for your feedback. The repeated residual connections in equations (5) and (6) are inspired by the original ProteinMPNN design. In our multi-layer interaction modeling, such residual connections help preserve original features in deeper networks, improving stability and information flow. This design choice ensures the model can benefit from complex higher-order representations without losing essential input information.

W3 & Q1: Comparison with Light-DDG

R: We appreciate the suggestion to include comparisons with recent methods. We have added a comparison with Light-DDG [1] using the per-structure Pearson and Spearman results reported in their paper:

MethodPearsonSpearman
Light-DDG0.54400.5004
H3-DDG0.56860.5281

Regarding training efficiency, Light-DDG requires data augmentation for each fold, i.e., training on all other folds to generate augmented data, leading to multiple training runs per split and thus lower overall efficiency. In contrast, H3-DDG is trained directly on the original data, making it more efficient in terms of total training time.

We acknowledge that Light-DDG achieves faster inference speed, but our method offers a better trade-off between accuracy and overall training efficiency.

[1] Wu et al. A Simple yet Effective DDG Predictor is An Unsupervised Antibody Optimizer and Explainer. ICLR 2025.

Q2: Applicability to Related Tasks

R: Thank you for the question. H3-DDG is designed to be highly extensible to related tasks such as protein-ligand binding affinity prediction and enzyme design. The core mechanism, hypergraph-based k-body attention, is modular and can be easily adapted to different molecular contexts.

For protein-ligand binding, the higher-order attention (e.g., 4-body) can be applied to the protein-ligand interface instead of the mutation subgraph, capturing cooperative interactions between residues and ligand atoms. Similarly, for enzyme design, the method can focus kk-order attention on the enzyme active site and its interaction with substrates or cofactors.

In all cases, the main adaptation is to redefine the region of interest for higher-order attention based on the task (e.g., mutation site, interface, or active site). The underlying model architecture and workflow remain unchanged, demonstrating H3-DDG's flexibility and potential for broad application in molecular modeling tasks.

Q3 & Q4: The Analysis of clustering parameters and Ablation

R1: We thank the reviewer for highlighting the importance of hyperparameter analysis. We have conducted comprehensive ablation studies to investigate the impact of different choices in hyperedge construction, attention order, and the number of hyperedges. The main findings are summarized below:

We compared our H3-DDG hyperedge construction with DiffPool and MinCutPool (see table below). Our hyperedge construction achieves the best performance. This is primarily because DiffPool and MinCutPool, which rely on learned assignment matrices, cannot explicitly ensure that the mutation site serves as the central node in each cluster. Consequently, they may group the mutation site with irrelevant residues, weakening the model's ability to capture the local effect of mutations. In contrast, H3-DDG always centers the mutation site in constructed hyperedges, enabling more precise modeling of local many-body interactions and leading to superior predictive accuracy.

Hyperedge ConstructionPearsonSpearmanRMSEMAE
DiffPool0.72610.64071.42091.0079
MinCutPool0.72750.64561.41781.0027
H3-DDG (ours)0.73170.64851.40850.9923

We further investigated the impact of the number of hyperedges, where LL is the number of nodes and L/10L / 10 is the number of hyperedges, meaning there are on average 10 nodes per hyperedge. Increasing the number of hyperedges leads to notable improvements in prediction accuracy, with only moderate additional computational cost.

Num of HyperedgesPearsonTraining Speed (its/sec)\uparrow
L/100.74185.11
L/60.74824.68
L/40.75014.34

We performed ablation on the order of attention (2-body, 3-body, and 4-body):

Adding 3-body attention markedly improves performance (Pearson from 0.7111 to 0.7317). Incorporating 4-body attention around mutation sites yields a further gain (Pearson to 0.7501). These results highlight the value of explicit higher-order modeling, especially for complex scenarios involving multiple mutations.

2-Body Message Passing3-Body Attn. Between Hyperedges4-Body Attn. Around Mut. Sites (Subgraph)MutationsPea.Spear.RMSEMAE
\checkmark----all0.71110.63511.45301.0304
single0.71990.61071.20710.8558
multiple0.67920.66511.98041.5034
\checkmark\checkmark--all0.73170.64851.40850.9923
single0.73830.62731.17280.8242
multiple0.70400.66761.91611.4528
\checkmark\checkmark\checkmarkall0.75010.66041.36650.9612
single0.74710.63741.15600.8080
multiple0.73410.69131.83201.3880

Our ablation studies demonstrate that the hypergraph-based construction and many-body attention mechanisms of H3-DDG are crucial to its superior performance. The chosen hyperparameters are well-justified by empirical results and offer strong predictive power while maintaining reasonable computational efficiency.

Q5: Large-scale and Real-time Applications

Thank you for raising the important issue of scalability to real-time protein design applications.

The key innovations of our proposed method H3-DDG (which is specifically designed with scalability and efficiency), the use of hypergraph-based many-body attention, enable several advantages over traditional approaches like BA-DDG.

  1. Hierarchical Attention with Adaptive Complexity

We apply different orders of attention (e.g., 2-body, 3-body, 4-body) to different structural hierarchies, allowing computational resources to be focused where most needed without excessive overhead.

  1. Controllable Hypergraph Construction

By leveraging hypergraphs, we can efficiently capture long-range and many-body interactions. Importantly, the number of hyperedges and the scope of each hyperedge are controllable, allowing us to balance modeling power and computational cost.

  1. Tunable Hyperparameters for Efficiency

The hyperparameters such as the number of edges in 4-body attention and the cutoff radius for mutation subgraphs are all user-controllable. This flexibility allows users to tailor the method for large-scale applications while keeping computational complexity manageable.

In practice, we have found that even for very large proteins (e.g., the largest proteins in the SKEMPI v2 dataset with over 3,000 residues), H3-DDG can be efficiently trained and evaluated on a single 4090 GPU (24GB memory). This demonstrates the method's suitability for large-scale and high-throughput applications.

评论

Thank you very much for your consideration and positive evaluation of our work. We appreciate your feedback and will make sure to include the supplementary results in the revised manuscript as requested. Please let us know if you have any additional comments or questions.

评论

Thank you for the rebuttal. I am inclined to maintain my current positive score. Please include supplementary results in the revised paper.

审稿意见
4

This paper introduces H3-DDG, a hypergraph-driven hierarchical network to predict higher-order many-body interactions across multiple scales. The authors compared the performance of their method with that of others and demonstrated improved performance, especially in cases with three or more mutations.

优缺点分析

Strengths: The topic is undoubtedly critical to assessing the implications of protein mutations. The methods are clearly described with detailed comparison results.

Weaknesses: The proposed methods are ad-hoc, although the proposals are all sensible. It is unclear what the practical implications of the improvement in prediction are, e.g., how these will translate into better drug targets, etc. Additionally, a discrepancy exists between the performance improvement reported in the abstract and that reported in the tables, which lowers the confidence in the results presented in the paper. Therefore, we have several questions followed by the previous mentioned weakness:

  1. The results are not reliable and the authors should include more discussions for ablation studies and hyper-parameter tuning. Do the authors also adjust the hyper-parameters of other baseline to make a fair comparison?

  2. The authors should mention the criteria for selecting the k-body attention design, as it seems that there is no unified rule for selecting a suitable k across all different tasks, and thus the authors should spend more context to discuss the effect of k.

  3. For the anti-body optimization task, it seems that H3-DDG did not perform well for RH103M, while other baseline methods such as MSM-Mut shows stronger performances. Could the authors explain more about the performance difference and what is the preference of proposed method?

问题

Can the authors discuss how the improvement translates into better performance in downstream tasks? It seems that the overall improvement over BA-DDG is relatively minor.

局限性

The authors discussed the limitations.

最终评判理由

I think this paper is good for publication.

格式问题

None.

作者回复

We sincerely thank the reviewer for recognizing the significance of our work and the clarity of our methodology and results. Briefly, the reviewer has two main concerns: (1) the reliability and fairness of the experimental comparisons, including ablation studies and hyper-parameter tuning, and (2) the rationale for design choices (such as the kk-body attention) and the interpretation of performance differences across tasks. Below, we summarize the reviewer's questions and provide point-by-point responses and clarifications.

W0 (1): The practical implications of the improvement in prediction.

R: Our proposed method advances protein representation learning by explicitly modeling many-body interactions at multiple scales. This approach not only improves the prediction of protein mutation effects, but also has the potential to strengthen downstream applications that depend on high-quality protein representations, such as protein function prediction, drug target interaction analysis, and protein engineering.

To illustrate the practical utility for drug discovery, we have evaluated our method on an antibody dataset for ΔΔG\Delta\Delta G prediction, a key task in therapeutic antibody optimization, as accurate ΔΔG\Delta\Delta G prediction facilitates the identification of affinity-enhancing mutations. Specifically, on the SARS-CoV-2 antibody benchmark, our method achieves an average rank of 9.67% for identifying favorable mutations, as measured by the proportion of top-ranked beneficial mutations (see Table 4), outperforming all baseline approaches.

Notably, our method is the only approach to achieve an average rank below 10% across all tested mutations, highlighting its strong generalization capability and significant practical relevance for antibody design and drug development.

W0 (2): The discrepancy exists between the performance improvement reported in the abstract and that reported in the tables.

R: We have rigorously checked all reported values to ensure accuracy and consistency. The improvements stated in the abstract are calculated as follows:

  • SKEMPI v2.0 dataset: The 12.10% improvement refers to the relative gain of H3-DDG over BA-DDG in the per-structure Pearson correlation metric (see Table 1).
  • BindingGYM dataset: The 62.61% and 34.26% improvements refer to H3-DDG's gain over Prompt-DDG and BA-DDG, respectively, in the Pearson correlation metric (see Table 2).

All results in the paper have been cross-validated to avoid inconsistencies.

W1: The reliability of the experimental results and more discussions for ablation studies and hyper-parameter tuning.

R: All experimental setups strictly follow established protocols from prior works (BA-DDG, Prompt-DDG), ensuring fair and reliable comparisons. We have also tuned the hyper-parameters for all baselines as their published optimal settings.

We evaluated H3-DDG's hyperedge construction against DiffPool and MinCutPool, achieving the best performance (see table below). This is primarily because DiffPool and MinCutPool, which are based on learned assignment matrices, cannot explicitly set the mutation site as a central node within each cluster. As a result, they may group the mutation site with unrelated residues, potentially weakening the effect of the mutation. In contrast, the H3-DDG approach ensures that the mutation site is always central in the constructed hyperedges, enabling more precise modeling of local many-body interactions and thus leading to superior predictive performance.

Hyperedge ConstructionPearsonSpearmanRMSEMAE
DiffPool0.72610.64071.42091.0079
MinCutPool0.72750.64561.41781.0027
H3-DDG (ours)0.73170.64851.40850.9923

We further investigated the impact of increasing hyperedge count and edge density in the mutation subgraph, finding that higher connectivity with 4-body attention notably improves prediction accuracy at a moderate computational cost, detailed in the following tables. Specifically, denser hyperedge and subgraph constructions enable the model to better capture complex local interactions, resulting in consistently higher Pearson correlations and lower error metrics, while maintaining efficient training speeds. This demonstrates that richer local connectivity is beneficial for modeling protein mutations, as long as computational complexity remains manageable.

Num of HyperedgesPearsonTraining Speed (its/sec)\uparrow
L/100.74185.11
L/60.74824.68
L/40.75014.34
Num of Edges in Mut. SubgraphComplexityPearsonSpear.RMSEMAETraining Speed (its/sec)\uparrow
NO(|N3N^3|)0.73520.65091.40070.98054.50
2NO(|2N32 \cdot N^3|)0.74610.65701.37600.97194.38
3NO(|3N33 \cdot N^3|)0.75010.66041.36650.96124.34

Finally, we evaluated the impact of different cutoff radius for extracting the mutation subgraph. As the cutoff radius of the mutation subgraph increases, prediction performance improves due to the inclusion of more relevant structural context. However, beyond 8 Å, the performance gains become marginal while computational cost continues to rise.

Cutoff of Mut. Subgraph(Å)PearsonSpear.RMSEMAETraining Speed (its/sec)\uparrow
50.73780.65381.39510.98344.52
80.75010.66041.36650.96124.34
120.75110.65181.36440.96024.03

W2: The criteria for selecting the k-body attention design.

R: We thank the reviewer for raising this point. The choice of kk in H3-DDG is made to balance expressive power and computational cost. Many-body effects are well-established as essential for modeling protein energetics (as also noted by other reviewers, #z7aD, #zCJg), but computational complexity increases exponentially with kk. Therefore, we employ higher-order attention (4-body) only within the mutation subgraph, where modeling complex local interactions is most critical, while utilizing 3-body attention in the hypergraph and standard 2-body attention elsewhere.

Ablation studies demonstrate that introducing higher-order attention layers leads to substantial improvements in predictive accuracy, particularly for multi-point mutations where many-body interactions are most pronounced. These results confirm that integrating both 3-body and 4-body attention is crucial for accurately modeling protein mutation effects, with the largest gains observed when both are combined.

2-Body Message Passing3-Body Attn. Between Hyperedges4-Body Attn. Around Mut. Sites (Subgraph)MutationsPea.Spear.RMSEMAE
\checkmark----all0.71110.63511.45301.0304
single0.71990.61071.20710.8558
multiple0.67920.66511.98041.5034
\checkmark\checkmark--all0.73170.64851.40850.9923
single0.73830.62731.17280.8242
multiple0.70400.66761.91611.4528
\checkmark\checkmark\checkmarkall0.75010.66041.36650.9612
single0.74710.63741.15600.8080
multiple0.73410.69131.83201.3880

W3: The performance difference of the anti-body optimization task.

R: Thank you for this insightful question. Although H3-DDG achieves the best average performance across all mutation sites, we acknowledge that its performance on RH103M is slightly lower than that of MSM-Mut.

Based on additional structural analysis, we believe this difference may be attributed to the unique and complex local environment at the RH103M site. This discrepancy could arise from limited coverage of similar structural cases in the training set or the presence of an atypical local energy landscape that presents particular challenges for predictive modeling. On the other hand, the modeling assumptions inherent to MSM-Mut may, by chance, align more closely with the specific structural or energetic characteristics of RH103M, resulting in slightly better performance for this particular mutation.

Nevertheless, H3-DDG achieves the best average performance (9.67% \downarrow) across all mutations and demonstrates superior generalization and robustness across diverse mutation sites, whereas MSM-Mut (average 14.41%) shows advantages only in a small number of specific cases, such as RH103M. These findings not only underscore the need for continued advances in modeling complex mutation environments, but also highlight the strength of our approach: the higher-order attention and explicit many-body interaction modeling in H3-DDG provide a principled and scalable framework.

Q1(2): The improvement over BA-DDG.

R: Compared to BA-DDG, H3-DDG introduces substantial advances both in model design and capability. As reviewer #z7aD noted, our use of hypergraph representations combined with many-body attention mechanisms (up to 4-body) is well-motivated. Conventional approaches like BA-DDG primarily capture pairwise interactions and thus overlook important higher-order effects such as hydrogen bond networks, ππ\pi-\pi stacking, or cooperative salt bridges, which are critical for accurate protein representation.

H3-DDG addresses this limitation by explicitly modeling many-body and long-range dependencies through hierarchical communication and higher-order attention on hypergraphs. By leveraging different kk-order attentions for various structural hierarchies, H3-DDG is able to capture both local and global interaction patterns that are inaccessible to pairwise models.

This enhanced representational power enables H3-DDG to more accurately model the complex, cooperative effects underlying protein stability, especially for challenging scenarios like multi-point mutations. As a result, downstream predictive performance is significantly improved. Specifically, as reported in the results section, H3-DDG achieves a 12.10% improvement over BA-DDG in Pearson correlation, demonstrating both the practical utility and the effectiveness of our modeling innovations.

评论

Dear reviewer,

Please engage in the discussion with the authors. The discussion period will end in a few days.

评论

Thank you for your experiments. I have increased my scores.

评论

Thank you for your thoughtful review, detailed feedback, and kind guidance on our work. We sincerely appreciate the time and effort you devoted to evaluating our manuscript. Your comments and instructions were instrumental in helping us strengthen our work. Please let us know if there are any further questions.

审稿意见
3

The paper proposes a neural network architecture to estimate free energy differences based on the output probabilities of an inverse folding model. The main contribution is a many-body attention mechanism operating on various graph and hypergraph representations.

优缺点分析

While the introduction starts rather clearly, the methods lack rigor and clarity. Many elements are not defined or detailed (eg which 25 atom pair distances exactly in L118? How is K defined in Eq 2? How are positional encodings exactly computed, how is chain identity factored in? What is the precise definition of the hypergraph and its hyperedges? How are hyperedges constructed, by merging any two adjacent edges? Are they directed? What are the dimensions of the weight matrices W in the attention? How are hyperedge features fijf_{ij} projected, are they flattened? What are the hyperparameters of the clustering? How were they chosen? What is the metric used in Table 4? etc.) Unfortunately this lack in clarity makes it hard to follow their (rather complex) model setup, and prevents reproducibility.

The motivation is sensible and the adressed problem is very relevant. However the claim that the proposed network learns hydrogen bond networks or ππ\pi - \pi stacking remains unsubstantiated.

While I appreciate the presence of ablation studies, I have difficulties understanding what was done in the pooling ablation. Which part of the model was replaced, what were the baselines? A bit more context would be good here. I also wonder whether the observed performance gains might stem from the added graph representations and not the attention mechanisms. An ablation of the attention while keeping the full graph, the clustered graph, and the mutation subgraph (modeling them with ProteinMPNN) seems to be a relevant experiment here.

I further wonder about the effect of hyperparameters. How does performance depend on eg the hypergraph construction?

That said, the experimental results are good and add value to the field. But given the issues of clarity and reproducibility I unfortunately cannot recommend to accept at this stage. Since there is no revision, I leave it to the discretion of the AC whether to trust that these issues will be addressed.

问题

See above, further:

  • What is the raionale to choose 3-body interaction on the pooled graph and 4-body interaction on the mutation subgraph? Why not both the same or the other way around?
  • Can you quantify the added value of using a clustered graph instead of the full graph? How much is runtime reduced by this? Would it be feasible to run a comparison of the model on an unclustered graph?
  • L186: How does 4-body attention only increase complexity by constant factor?
  • Is using a pretrained ProteinMPNN necessary? Do they add performance? Can the model be trained from scratch? How do you justify using the pretrained weights for a model that has significantly different architecture?

局限性

The discussion of limitations is very brief with one sentence in the conclusion. I would recommend to expand on this, and also discuss runtime complexity and scaling with various hyperparameters.

最终评判理由

I appreciate the authors' efforts in answering my questions and improving clarity, and acknowledge it by increasing my score to a borderline reject. Since I cannot verify that my main concerns about clarity and reproducibility have been sufficiently addressed in the revision, I leave it to the AC to trust whether these issues will be resolved.

格式问题

作者回复

We sincerely thank the reviewer for their detailed and constructive feedback, and we apologize for the insufficient details regarding ProteinMPNN in the Methods section of our manuscript. We have carefully addressed each comment, and will ensure that all relevant clarifications and improvements, particularly the additional details on methods, are thoroughly incorporated into the revised manuscript. The reviewer's suggestions have been invaluable in improving the quality and clarity of our work.

W1: The clarity of Methods

We thank the reviewer for highlighting these important points regarding the clarity and rigor of our Methods section. We apologize for the lack of detail, particularly regarding ProteinMPNN, and will ensure that all relevant information is clearly described in the revised manuscript. We wish to emphasize that these details refer to module settings that are exactly the same as those used in the previous work ProteinMPNN.

  1. Which 25 atom pair distances in L118? How is KK defined in Eq 2?

The 25 atom pair distances follow the setting in ProteinMPNN: all possible pairs between the backbone atoms (N, Ca, C, O) and the Cb atom of each residue (e.g., the Ca-Cb pair represents the distance between the Ca and Cb atoms). The value of KK is set to 16, which represents the number of RBFs used to encode atomic pair distances into continuous feature vectors, consistent with ProteinMPNN.

  1. How are positional encodings exactly computed? How is chain identity factored in?

The positional encoding is computed by calculating the relative position offset between residues and applying one-hot encoding, followed by a linear projection. Chain identity is encoded as a binary indicator (1 if the two residues are on the same chain, 0 otherwise), which is concatenated to the positional offset and input into the positional encoding module.

  1. What is the precise definition of the hypergraph and its hyperedges? How are hyperedges constructed?

We use the standard definition of a hypergraph: a hypergraph is an ordered pair H=(V,E)H = (V, E), where VV is the set of nodes and EE is a set of subsets of VV, each subset being a hyperedge. In H3-DDG, nodes correspond to amino acids, and each hyperedge consists of one or more amino acids that are grouped together based on spatial proximity using distance-based clustering. Hyperedges are not constructed by merging adjacent edges, nor are they directed.

  1. What are the dimensions of the weight matrices WW in the attention? How are hyperedge features fijf_{ij} projected?

The projection weight matrices WW in the attention module have dimensions [hidden_dim, hidden_dim], (typically hidden_dim=128). Hyperedge features fijf_{ij} are not flattened; they are directly projected using a learnable weight matrix while preserving their original tensor structure.

We appreciate the reviewer's attention to these details and will revise the Methods section to ensure full transparency and reproducibility.

W1: The metric used in Table 4?

R: We thank the reviewer for the question. The experimental setup and evaluation metric in Table 4 strictly follow the protocols established by MSM-Mut, Prompt-DDG, and other baselines. Specifically, the metric shown is the percentile rank of each beneficial mutation among all possible mutations for the protein (494 sites in total). For each method, we predict the ΔΔG for all possible mutations and rank them from lowest to highest. The percentage shown in the table indicates the rank position of each beneficial mutation. For H3-DDG, the TH31W mutation is ranked within the top 3.44% of all possible mutations.

W1 & W3 & W4: The ablation study and hyperparameters of clustering.

R: We thank the reviewer for the insightful comments regarding the ablation studies and clustering hyperparameters. We address both points below and will ensure to clarify these details in the revised manuscript.

Hypergraph Ablation Details: We compared our hyperedge construction with DiffPool and MinCutPool. Our hyperedge construction achieves the best performance. H3-DDG always centers the mutation site in constructed hyperedges, enabling more precise modeling of local many-body interactions and leading to superior predictive accuracy.

Hyperedge ConstructionPearsonRMSE
DiffPool0.72611.4209
MinCutPool0.72751.4178
H3-DDG (ours)0.73171.4085

Ablation of Attention Mechanisms: In response to the reviewer's suggestion, we further implemented versions where we retained the enhanced graph representations,namely, the full graph, the clustered graph, and the mutation subgraph (using ProteinMPNN), but used only 2-body attention mechanisms. The ablation results show that simply enhancing the graph structure without higher-order attention does not achieve the same level of performance improvement.

Full GraphBetween HyperedgesAround Mut. Sites (Subgraph)PearsonRMSE
2-body×\times×\times0.71111.4530
2-body2-body×\times0.72051.4332
2-body2-body2-body0.72481.4239
2-body3-body×\times0.73171.4085
2-body3-body4-body0.75011.3665

Clustering Hyperparameters: Hyperedges are constructed by clustering amino acids based on spatial proximity. We investigated the impact of the number of hyperedges, where LL is the number of nodes and L/10L / 10 is the number of hyperedges, meaning there are on average 10 nodes per hyperedge. Increasing the number of hyperedges leads to notable improvements in prediction accuracy, with only moderate additional computational cost.

Num of HyperedgesPearsonTraining Speed (its/sec)\uparrow
L/100.74185.11
L/60.74824.68
L/40.75014.34

All ablation results and details of hyperparameter selection will be explicitly described in the revised manuscript to ensure clarity and reproducibility.

W2: The Motivation of this work

R: We thank the reviewer for recognizing the motivation and significance of our work. As noted by Reviewer #zCJg, our model achieves significant improvements in challenging multi-point mutation scenarios where the exponential mutation space poses significant challenges. Our method directly tackles the shortcomings of previous methods that focus mainly on pairwise interactions, often overlooking crucial higher-order effects such as hydrogen bond networks and π\pi-π\pi stacking.

To further support our claims, we will include visualization analyses in the revised manuscript, illustrating cases where our model successfully captures known higher-order interactions.

Q1: Rationale for Using 3-Body and 4-Body Interactions

R: We thank the reviewer for raising this important question. The choice to use 4-body attention in the mutation subgraph and 3-body attention on the pooled graph is to balance expressive power and computational efficiency. Higher-order (4-body) interactions are most critical near mutation sites, where local many-body effects matter most and the subgraph size is manageable. For the broader graph, 3-body attention captures important non-pairwise effects globally without excessive computational cost.

Our ablation studies further support this design choice: adding higher-order attention layers leads to significant improvements in predictive accuracy.

Around Mut. Sites (Subgraph)PearsonRMSETraining Speed (its/sec)\uparrow
3-body0.73521.40074.50
4-body0.75011.36654.34

Q2: Clustered Graph vs. Full Graph

R: We thank the reviewer for this insightful question. To quantify the added value of using a clustered graph instead of the full graph, we attempted to run our model on unclustered (full) graphs. However, we found that for proteins with more than approximately 700 amino acids, the memory consumption exceeded 40GB GPU memory; for proteins larger than 1000 residues, memory usage could surpass 80GB.

As shown bellow, for proteins in SKEMPI with fewer than 500 amino acids, the clustered graph reduced runtime by approximately 1.8× compared to a hypothetical full graph. This makes our approach scalable to realistic protein sizes, while maintaining or even improving predictive performance.

MethodTraining Speed (its/sec) \uparrow
w/o Hypergraph3.19
with Hypergraph5.76

Q3: The constant factor on the 4-body attention complexity

R: The 4-body (node-node-edge) attention increases complexity by only a constant factor compared to 3-body attention, because both have overall O(N3)O(|\mathcal{N}|^3) complexity, but 4-body adds a multiplicative constant kk (the edge-to-node ratio): O(kN3)O(k \cdot |\mathcal{N}|^3). By controlling kk through sparsity in the graph, the increase remains a constant factor.

Q4: Necessity of Using Pretrained ProteinMPNN Weights

R: We thank the reviewer for this important question. Using pretrained ProteinMPNN weights is indeed beneficial in our setting. We observed that initializing with pretrained weights leads to consistently better performance and faster training compared to training the model from scratch.

To further validate this point, we conducted additional experiments comparing models trained with and without pretrained initialization. As shown below, without pretraining, the Pearson correlation for our method drops by 13% , while BA-DDG shows a larger decrease of 17%.

MethodPearsonRMSEAUROC
BA-DDG0.71181.45160.7726
BA-DDG (w/o pretrain)0.60461.64630.7387
H3-DDG0.75011.36650.7920
H3-DDG (w/o Pretrain)0.66221.54870.7435

We will clarify these points and include the comparative results in the revised manuscript.

L1: Discussion

R: Thank you for the suggestion. We agree and will expand the discussion of limitations in the revised manuscript, including more detail on runtime complexity, scaling behavior, and memory requirements. The detailed results and analysis will be moved to the Discussion section.

评论

Dear reviewer,

Please engage in the discussion with the authors. The discussion period will end in a few days.

评论

I thank the authors for addressing my questions.

评论

Thank you for taking the time to consider our rebuttal and for your continued feedback on our work. We truly appreciate your insights and constructive advice. We will incorporate these modifications and discussions in the revised manuscript.

Please let us know if there are any further questions.

审稿意见
5

This paper introduces H3-DDG, a novel hierarchical network designed to predict binding free energy changes (ΔΔG) in PPIs mutants. The proposed captures higher-order interactions that are crucial for understanding mutational effects. The authors demonstrate SOTA on SKEMPI v2 and BindingGYM benchmarks. The authors also present a case study on SARS-Cov-2 antibody design.

优缺点分析

Strengths

  • The use of hypergraph representations with many body attention mechanisms (up to 4-body) is well-motivated. This directly addresses the limitation of existing methods that primarily model pairwise interactions, missing crucial higher-order effects like hydrogen bond networks and π-π stacking.

  • The performance improvements are substantial - 12.10% better than BA-DDG on multi-point mutations in SKEMPI v2, and 34.26% improvement on BindingGYM

  • The hierarchical graph construction with mutation centered hypergraphs cleverly reduces computational complexity from O(N^3) or O(N^4) to manageable levels while preserving critical interaction information needed for model performance

Weaknesses

  • the paper lacks theoretical analysis of why the specific choices (e.g., 8A radius for mutation subgraphs, specific hyperedge construction) are optimal. Some ablation on these hyperparameters would strengthen the work.
  • Table 8 shows that H3-DDG is still ~30% slower than BA-DDG in training. For very large protein complexes or proteome-wide screens, this could be limiting.
  • The evaluation is limited on SKEMPI v2 and BindingGYM. Testing on additional datasets (Abdesign etc) would better demonstrate generalizability claims. The paper doesn't compare against some recent methods like ThermoMPNN or the latest co-folding based approaches for stability prediction.

问题

  • Could this architecture be adapted for related tasks like protein-ligand binding affinity prediction or enzyme design?
  • Given the success of AF3 and similar methods (Chai-1/ Boltz) , how does H3-DDG perform on predicted structures versus experimental structures?
  • How sensitive is the method to the clustering approach for hypergraph construction?

局限性

  • The model is trained on SKEMPI v2, which has known biases towards certain protein families and mutation types. How does this affect performance on underrepresented protein classes in the training dataset (membrane proteins etc)

  • The 4-body attention is restricted to local mutation subgraphs, potentially missing long range allosteric effects.

  • The model assumes standard conditions but ΔΔG is highly dependent on temperature, pH, and ionic strength.

  • The paper doesn't discuss GPU memory requirements. The 4-body attention on mutation subgraphs likely requires substantial memory, limiting application to large complexes.

最终评判理由

The rebuttal is quite clear and answers all of my questions to my satisfaction. However, I'm keeping my scores the same

格式问题

NA

作者回复

We appreciate the reviewer's careful review of our paper, positive feedback, and recognition of our work, particularly the novelty of our approach and the meaningfulness of our experiments. Below, we address the concerns point by point:

W1: The Analysis of Hyper-parameters

R: We thank the reviewer for highlighting the importance of hyperparameter analysis. We have conducted comprehensive ablation studies to investigate the impact of different choices in hyperedge construction, attention order, and mutation subgraph cutoff radius.

We compared our hyperedge construction with DiffPool and MinCutPool. Our hyperedge construction achieves the best performance. This is primarily because DiffPool and MinCutPool, which rely on learned assignment matrices, cannot explicitly ensure that the mutation site serves as the central node in each cluster. H3-DDG always centers the mutation site in constructed hyperedges, enabling more precise modeling of local many-body interactions and leading to superior predictive accuracy.

Hyperedge ConstructionPearsonRMSE
DiffPool0.72611.4209
MinCutPool0.72751.4178
H3-DDG0.73171.4085

We further performed ablation on the order of attention (2-body, 3-body, and 4-body). These results highlight the value of explicit higher-order modeling, especially for complex scenarios involving multiple mutations.

2-Body Message Passing3-Body Attn. Between Hyperedges4-Body Attn. Around Mut. Sites (Subgraph)MutationsPea.Spear.RMSEMAE
\checkmark----all0.71110.63511.45301.0304
\checkmark\checkmark--all0.73170.64851.40850.9923
\checkmark\checkmark\checkmarkall0.75010.66041.36650.9612

We also conducted a hyperparameter analysis of edge density in the 4-body attention mechanism, showing that increased the edge density in the mutation subgraph (with 4-body attention) improves prediction accuracy with only moderate additional computational cost.

Num of Edges in Mut. SubgraphComplexityPearsonRMSETraining Speed (its/sec)\uparrow
NO(N3)O(\|\mathcal{N}\|^3)0.73521.40074.50
2NO(2N3)O(2 \cdot\|\mathcal{N}\|^3)0.74611.37604.38
3NO(3N3)O(3 \cdot\|\mathcal{N}\|^3)0.75011.36654.34

Finally, we evaluated different cutoff radius for mutation subgraph extraction. As the cutoff radius increases, performance improves due to the inclusion of more relevant structural context. However, after 8 Å, the performance gain becomes marginal while computational cost continues to rise.

Cutoff of Mut. Subgraph(Å)PearsonMAETraining Speed (its/sec)\uparrow
50.73780.98344.52
80.75010.96124.34
120.75110.96024.03

W2: Large-scale Applications on larger complex or proteome-wide screens

R: Thank you for raising the important issue of scalability to large protein complexes and proteome-wide screens. The key innovations of H3-DDG lie in its scalable, hypergraph-based many-body attention, providing several advantages: (1) Hierarchical Attention: Different orders of attention (2-body, 3-body, 4-body) are applied at appropriate structural levels, focusing computation where it matters most. (2) Controllable Hypergraph Construction: Hypergraphs efficiently capture both long-range and many-body interactions, with user-controllable hyperedge number and scope to balance accuracy and cost. (3) Tunable Efficiency: Key parameters (e.g., number of 4-body edges, mutation subgraph radius) are adjustable, allowing efficient scaling to large systems.

In practice, we have found that even for very large proteins (e.g., the largest proteins in the SKEMPI v2 dataset with over 3,000 residues), H3-DDG can be efficiently trained and evaluated on a single 4090 GPU (24GB memory). This demonstrates the method's suitability for large-scale and high-throughput applications.

W3: Limited Evaluation Scope

R: Thank you for raising this important point regarding the evaluation scope.

In this work, we focused on SKEMPI v2, BindingGYM, and SARS-CoV-2 antibody datasets because they are widely recognized benchmarks for protein–protein binding affinity and mutation effect prediction. We acknowledge that a more comprehensive evaluation, including Abdesign and recent baselines, would further strengthen our claims. As shown below, we evaluated H3-DDG on the S669 benchmark dataset used in ThermoMPNN and under the same experimental setup, demonstrating that H3-DDG achieves state-of-the-art performance. We will include these results in revised manuscript.

MethodPearsonRMSE
ProteinMPNN0.263.32
ThermoMPNN0.431.52
H3-DDG0.561.48

Q1: Applicability to Related Tasks

R: Thank you for the question. H3-DDG is readily extensible to related tasks such as protein–ligand binding affinity prediction and enzyme design. Its hypergraph-based k-body attention can be adapted to different molecular contexts by redefining the region of interest. For example, in protein–ligand binding affinity prediction, the model can focus higher-order attention on the interface between the protein and ligand, capturing cooperative interactions between multiple residues and ligand atoms. In enzyme design, the attention mechanism can be centered on the enzyme active site and its interactions with substrates or cofactors, enabling a precise modeling of the functional environment. The core model architecture remains unchanged, demonstrating H3-DDG's flexibility and broad applicability in molecular modeling.

Q2: Performance on Predicted vs. Experimental Structures

R: Thank you for this insightful question. To evaluate robustness, we used AlphaFold 3 (AF3) to predict the complex structures for SKEMPI v2 and applied H3-DDG for ΔΔG\Delta\Delta G prediction. While AF3 achieves high-quality structure prediction, its accuracy may still be lower than experimental structures. Importantly, H3-DDG showed only a slight decrease in performance on AF3-predicted structures, with a drop of 5.4% in Pearson correlation, compared to experimental structures—significantly less than the performance drop observed for BA-DDG (10.7%). H3-DDG thus maintains state-of-the-art accuracy even on predicted structures, demonstrating strong robustness and practical applicability.

MethodStructurePearsonRMSEAUROC
BA-DDGExperimental0.71181.45160.7726
AF30.63561.56550.7226
H3-DDGExperimental0.75011.36650.7920
AF30.71171.45180.7755

Q3: Sensitivity to Clustering in Hypergraph Construction

The hyperedge construction method has been mentioned in the response to W1. Here we further investigated the impact of the number of hyperedges, where LL is the number of nodes and L/10L / 10 is the number of hyperedges, meaning there are on average 10 nodes per hyperedge. Increasing the number of hyperedges leads to notable improvements in prediction accuracy, with only moderate additional computational cost.

Num of HyperedgesPearsonTraining Speed (its/sec)\uparrow
L/100.74185.11
L/60.74824.68
L/40.75014.34

L1: Dataset Bias and Generalization

R: Thank you for highlighting this important point.

We acknowledge that SKEMPI v2 is biased toward certain protein families and mutation types, which may limit generalization to underrepresented classes such as membrane proteins. To address this, we have supplemented our evaluation with the BindingGYM benchmark, which offers greater structural and functional diversity—including more challenging and less-represented protein classes. H3-DDG achieves strong performance on BindingGYM, indicating good generalizability beyond SKEMPI v2 biases. Additionally, we performed transfer learning and independent validation on the SARS-CoV-2 benchmark, as well as additional experiments on protein thermostability, further demonstrating the robustness and adaptability of H3-DDG to novel protein families and mutation scenarios.

L2: Long-Range Interaction Modeling

R: Thank you for raising this concern. A key innovation of H3-DDG is its hierarchical architecture, which enables efficient modeling of both local and long-range interactions. While 4-body attention focuses on local mutation subgraphs, H3-DDG integrates 2-body message passing across the entire graph and 3-body attention across the hypergraph. This allows information to flow between distant sites, effectively capturing long-range allosteric effects. Compared to traditional GNNs that rely on deep stacking for long-range propagation, our hierarchical approach is more direct and effective.

L3: Lack of conditions of DDG

R: We acknowledge this limitation. The current model assumes standard experimental conditions and does not explicitly account for factors such as temperature, pH, or ionic strength, all of which can significantly influence DDG values. We will include this limitation in the discussion section and consider condition-specific modeling as an important direction for future work.

L4: GPU Memory and Scalability

R: H3-DDG leverages a hypergraph representation to efficiently model higher-order interactions while keeping GPU memory usage within a practical range. Although 4-body attention is applied only to local mutation subgraphs, the overall design ensures that increases in modeling capability do not come with prohibitive memory costs.

The table below compares GPU memory usage for different attention strategies at the same batch size. As shown, H3-DDG delivers clear performance gains with only a modest increase in memory consumption. This demonstrates that the hypergraph-based approach achieves an effective balance between modeling power and computational efficiency, making it feasible for large protein complexes.

MethodAttn. Between HyperedgesAttn. Around Mut. Sites (Subgraph)PearsonMemory Used
BA-DDG----0.711813872MB
H3-DDG3-body--0.731714864MB
3-body3-body0.735215390MB
3-body4-body0.750115692MB
评论

Dear reviewer,

Please engage in the discussion with the authors. The discussion period will end in a few days.

评论

Dear Reviewer,

please respond to the authors and acknowledge. This is a Neurips requirement, regardless of your positive evaluation of the paper.

Thanks,

the AC

评论

We thank all reviewers for their insightful comments and suggestions. Many reviewers recognized our work as well-motivated and innovative, with strong improvements, such as #z7aD's comment: “This paper introduces H3-DDG, a novel hierarchical network designed to predict binding free energy changes (ΔΔG\Delta\Delta G) in PPI mutants. The method is well-motivated and directly addresses the limitation of existing methods, #zCJg also noted: “The authors clearly analyze the limitations of existing methods in handling higher-order and many-body interactions. The model design is simple yet effective, with well-motivated many-body attention mechanisms.” Several reviewers (#z7aD, #tCDj, #GsHY) also acknowledged the strong performance improvements and the clarity of our experimental comparisons.

We are grateful for the constructive suggestions that have helped us further improve the manuscript. For example, following the recommendations of #GsHY, #zCJg, and #tCDj, we conducted additional ablation studies and hyper-parameter tuning on hyperedge construction, k-body attention design, and hypergraph hyper-parameters. These enhancements further demonstrate how our method advances protein representation learning by explicitly modeling many-body interactions at multiple scales. In response to the concerns about clarity of Methods raised by #tCDj, we have addressed these issues in detail in our responses and will clarify them further in the revised manuscript. We would like to emphasize that these details correspond to module settings consistent with those used in the pre-trained backbone, ProteinMPNN.

Finally, we would like to once again thank all reviewers and the area chair for their time, effort, and constructive feedback, which have significantly strengthened our work.

最终决定

The paper introduces a hypergraph-driven hierarchical architecture to predict free body energy in PPI mutants.The core contribution is a many-body attention mechanism which operates on multiple scales, leveraging graph and hypergraph representations.

The paper has been praised by the reviewers for addressing a known limitation of existing methods, technical novelty and empirical performance especially. The rebuttal resolved almost all concerns, with only a single reviewer leaning towards rejection, although it appears that all their concerns were addressed.

I recommend acceptance, subject to implementing all the changes and revisions suggested by the reviewers.