DualEqui: A Dual-Space Hierarchical Equivariant Network for Large Biomolecules
摘要
评审与讨论
In this paper, the authors propose the DualEquiNet model to combined invariant and equivariant-based representation together. The key idea is to develop a dual-space hierarchical equivariant network that constructs complementary representations in both Euclidean and spherical harmonics spaces to capture both invariant and equivariant properties. More importantly, the model employs bidirectional cross-space message passing and cross-space interaction pooling mechanism to hierarchically aggregate atomic features. The model has been validated based on several existing benchmarks and shows promising results.
优缺点分析
Strength: The developed Dual-Space-based learning architecture is novel and promising.
Weakness: The message-passing for spherical harmonic features is not explained clearly. The definition of neighborhood is very confusing! The importance of two difference components is unclear!
问题
-
The definition for SH neighborhood is confusing! From the definition, it is not the distance between the two atoms that matter, instead if they have similar local structures (with similar orientation), they will be treated as neighbors. This is very weird. For instance, if we have two benenze ring in the structure (with similar orientations), then they will be regarded as neighbors. However, if two benzene rings are oriented differently, they are no longer neighbors. Why the authors choose to do it this way? Why not just use the standard SH-based neighborhood definition? Is there any ablation study to show that the specially-design neighborhood will have better performance?
-
The ablation study part is not clear. Normally SH will contain more information that standard invariant-based graph models. State differently, SH-based models, such as E3NN, will have better performance than general GNNs, as they are more complicated and consider equivariant information. For the ablation study, the authors should consider just using SH-based model (w/o CSIP/Cross/Dual) instead of removing SH-based models.
-
The benchmarks are not very widely used. More interesting dataset from MolecularNet should be considered. In particular, the prediction of SARA and torsion angles seems to be not proper for this case. Note that the model input has all the geometric information, and SARA and torsion angles are just special properties that can be analytically calculated.
局限性
N.A.
最终评判理由
All my concerns are well addressed. I keep my score 4!
格式问题
NA
RQ1 & RQ2: About the difference between our SH neighborhood and spherical tensor-based baselines.
Thank you for raising this important question. We believe there may be some misunderstandings, and we clarify the differences below.
(1) There are two distinct classes of methods that utilize spherical harmonics:
Type 1: Methods based on spherical tensor products using Clebsch–Gordan coefficients and Wigner-D matrices. These methods, such as TFN, SE(3)-Transformer, and SEGNN, perform full tensor contractions, which lead to high computational cost up to [4], where L is the maximum tensor degree. This is because they compute tensor products across all combinations of degrees up to L. A detailed explanation can be found in Section 5.3 of [2]. While e3nn is a Python library and not a specific method, the above baselines rely on it for implementation.
Type 2: Even though the calculation of spherical tensor products is too expensive, the high-order spherical tensors are proved to contain more information [2]. To address the complexity issue, some recent baselines still use spherical tensors but try to replace the “tensor product process” with “vector inner product” (Inner product itself is ). In this way, we incorporate the spherical tensor information while avoiding the complexity.
This kind of methods includes HEGNN [5], GotenNet, and our DualEquiNet. Similar discussions can also be found in Sections 1 and 2 of HEGNN [5] and Sections 1 and 2 of GotenNet [6].
Because these two categories differ fundamentally in their computation and architecture, it is not feasible to use a Type 1 method as an ablation for our Type 2-based approach.
(2) On our SH-based neighborhood definition:
The SH-based neighborhood is a core contribution of our method and is not present in any existing baseline. It enables long-range, symmetry-aware interactions by linking nodes with similar local geometric environments, even if they are spatially distant—complementing standard Euclidean neighborhoods.
Regarding the reviewer’s concern: SH similarity is invariant to global rotations due to our equivariant design (see Appendix C.1 and B.2). Therefore, two benzene rings with similar local structures will still be linked regardless of their global orientation. Additionally, in both Type 1 and Type 2 methods, spherical tensors are computed based on directional vectors (i.e., relative positions), and are independent of absolute distances.
We will revise the paper to make these distinctions and design choices more explicit.
Q3: The benchmarks are not very widely used. More interesting dataset from MolecularNet should be considered. In particular, the prediction of SARA and torsion angles seems to be not proper for this case. Note that the model input has all the geometric information, and SARA and torsion angles are just special properties that can be analytically calculated.
RQ3: Since there is no reference attached, we guess you mean the “MoleculeNet” instead of “MolecularNet” in [1]. The datasets in MoleculeNet, like QM9, are about small molecules and have been widely used in papers of the baselines. However, our DualEquiNet aims to deal with the long-range dependencies in the large biomolecules. Tasks in MoleculeNet are not well-suited for evaluating this capability. Therefore, instead of using those datasets of small molecules, we develop new datasets of large biomolecules like protein and RNA. We identify this as a contribution instead of a weakness of our paper.
Indeed, the SASA and torsion angles can be calculated if we have ground-truth 3D structures. However, in real-world scenarios, we don’t have access to the ground-truth 3D structures. Especially for RNA, the 3D structures are highly noisy. So here our point is how accurately we can infer the SASA and torsion angle if we only have noisy 3D structures?
To this end, we ensure that ground-truth structures are used only to generate labels, while all models, including ours and the baselines, are provided only with noisy 3D structures derived from sequence-based prediction methods. This better simulates real-world conditions and tests a model’s capabilities.
We appreciate the reviewer’s comment and will include any proposed changes in camera-ready version. If we have addressed your questions, we kindly request you to consider updating your score. Thank you very much for your time and thoughtful review!
Reference:
[1] MoleculeNet: A Benchmark for Molecular Machine Learning.
[2] A Hitchhiker’s Guide to Geometric GNNs for 3D Atomic Systems.
[3] On the Expressive Power of Geometric Graph Neural Networks.
[4] From Peptides to Nanostructures: A Euclidean Transformer for Fast and Stable Machine Learned Force Fields.
[5] Are High-Degree Representations Really Unnecessary in Equivariant Graph Neural Networks?
[6] GotenNet: Rethinking Efficient 3D Equivariant Graph Neural Networks.
All my concerns are well addressed. I will keep my score!
Thank you for your positive feedback and for recognizing our responses! We truly appreciate your time and effort throughout the review process and we will incorporate the discussion in the camera-ready version of our paper.
The authors present DualEquiNet, a novel geometric graph neural network operating in joint Euclidean-spherical harmonics spaces for biomolecular representation learning. The authors' methodology is interesting, and their experimental results are promising. Nonetheless, a few concerns remain regarding the evaluation of their proposed methodology.
优缺点分析
Points of strength:
- The authors' methodology is well-motivated, sound, and interesting. I particularly appreciated the authors' depth and clarity in discussing the importance of local-to-global (and vice versa) representation learning on biomolecular graphs. Using spherical harmonics to bridge this gap is a compelling idea.
- The authors evaluate their new method on a variety of protein and RNA datasets, and in all cases, their results are positive.
- The authors' writing style is clear and consistent, making it easy for readers to grasp the main takeaways from each section of the manuscript.
Points for improvement:
- My main concern is that the authors' empirical results with DualEquiNet are preliminary. Tables 1-2 are compelling experiments demonstrating the method's expressiveness. However, Tables 3-4 struck me as a bit unrealistic from the perspective of how this method might be applied in a drug discovery setting. To thoroughly demonstrate how DualEquiNet compares to leading e.g., protein representation learning methods, I would highly recommend the authors report the results of their method for standardized protein and RNA representation learning benchmarks such as ProteinWorkshop [1] and rnaglib [2] in their supplementary materials.
- In theory, it's great to see the authors' method working well in Tables 2-4, but after considering the authors' ablation studies in Table 5, I'm left with the feeling that DualEquiNet is outperforming the baselines for reasons beyond its key architectural components, as ablating several of them doesn't change the method's average RNA performance much. I'd like the authors to discuss why the components aren't as necessary as readers may initially be led to believe and what other reasons may be leading DualEquiNet to achieve top performance for RNA property prediction (in the ablations).
- The authors highlight the importance of E(3) equivariance for 3D biomolecules, but they don't discuss the relationship between E(3) and SE(3) equivariance for modeling molecular chirality. As such, it's currently unclear if the authors' proposed method is truly E(3)-equivariant and therefore can't detect chirality or whether it is actually SE(3)-equivariant and can detect chirality. A discussion of this point would be helpful.
References:
[1] Jamasb, A. R., Morehead, A., Joshi, C. K., Zhang, Z., Didi, K., Mathis, S. V., ... & Blundell, T. L. Evaluating Representation Learning on the Protein Structure Universe. In The Twelfth International Conference on Learning Representations.
[2] Wyss, L., Mallet, V., Karroucha, W., Borgwardt, K., & Oliver, C. (2025). A Comprehensive Benchmark for RNA 3D Structure-Function Modeling. arXiv preprint arXiv:2503.21681.
问题
- What do the authors think about simpler approaches to modeling 3D biomolecules that have emerged over the last 12 months, such as vanilla Transformer encoders trained with rotation data augmentation (e.g., AlphaFold 3)? Do they think such simpler methods may hold some promise in future work as generic biomolecular representation learners, or do the authors think there are fundamental limitations to such (generic) approaches that methods such as DualEquiNet address?
局限性
The authors haven't thoroughly discussed the computational complexity (time and memory) of their proposed method, so it's currently unclear how scalable it truly is.
最终评判理由
The authors have addressed my concerns regarding RNA benchmarking, ablations, and methodological novelty.
格式问题
I didn't find any formatting concerns.
We thank the reviewer for the valuable feedback, and we address the concern below.
W1: Significance of the evaluation datasets in Tables 3 and 4, and new results on RNAGlib.
RW1:
Significance of Tables 3 and 4 : Tables 3 and 4 focus on torsion angle prediction and solvent-accessible surface area (SASA) estimation. These tasks were chosen specifically because they capture fine-grained 3D geometric features that are highly relevant in drug discovery. Torsion angles play a central role in determining local conformations of biomolecules, directly influencing active site geometry and ligand binding modes. Accurate torsion angle prediction is critical for modeling flexible regions such as RNA loops and protein side chains, which are often key determinants of druggability. Experimental techniques such as NMR, X-ray crystallography, and cryo-EM can provide this information but are costly and time-consuming, motivating accurate learning-based approaches.
SASA is a well-established property used in structure-based drug design to identify hydrophobic and hydrophilic regions, estimate ligand accessibility, and correlate with binding affinities. Models capable of reliably predicting SASA provide valuable insight for guiding ligand optimization and binding site characterization.
These two tasks are widely used in molecular modeling literature [1][2][3] to benchmark a model’s ability to learn detailed structural properties in drug discovery applications.
We thank the reviewer for this valuable suggestion. In response, we have conducted additional experiments on the RNAGo and RNAProt datasets from RNAGLib and will include the results in the camera-ready version. DualEquiNet achieves state-of-the-art performance on these as well, further validating its effectiveness (see Table below). We also note that the RNA property prediction benchmarks reported in the main paper (e.g., Ribonanza, COVID and Tc-Ribo) are themselves widely used and established in works [4][5][6], ensuring that our evaluation is based on community-accepted standards and datasets. Due to time constraints, experiments on the ProteinWorkshop datasets, which are much larger, are not feasible now. We plan to include these results in the camera-ready version. We will update all references accordingly.
| RNAGo | RNAProt | |
|---|---|---|
| Jaccard Index | Accuracy | |
| SchNet | 0.613±0.020 | 0.709±0.005 |
| EGNN | 0.646±0.060 | 0.720±0.009 |
| GVPGNN | 0.632±0.048 | 0.710±0.005 |
| TFN | 0.651±0.185 | 0.697±0.005 |
| FastEGNN | 0.527±0.159 | 0.718±0.010 |
| HEGNN | 0.661±0.038 | 0.735±0.016 |
| GotenNet | 0.555±0.170 | 0.709±0.006 |
| DualEquiNet | 0.681±0.047 | 0.749±0.015 |
W2: Significance of architectural components through ablation.
RW2:
(1) Ablating components in DualEquiNet indeed changes the performance.
The absolute RMSE differences in Table 5 translate to consistent percentage improvements and are data-dependent. For example, removing CSIP increases error by ~8% on Tc-Ribo and ~7% on Ribonanza, while disabling cross-space message passing further increases error by ~9% on Tc-Ribo. The combined removal of all dual-space components yields the largest degradation, with ~11% higher error on Tc-Ribo and ~7% on Ribonanza. Overall, these results indicate that DualEquiNet’s architectural contributions—particularly contextual pooling and cross-space learning—are synergistic and yield up to ~10% improvements on different tasks.
(2) Explanation of results in Table 5.
DualEquiNet w/o CSIP/Cross/Dual in Ablation Table 5 is not equivalent to standard Euclidean GNN baselines. Even in this ablation, we preserve the biological organization of large molecules (e.g., atom-to-nucleotide for RNA, atom-to-amino acid for proteins) using hierarchical mean pooling. We also mentioned this in Line 346 in our submission. Specifically, “w/o CSIP” replaces cross-space interaction pooling with standard mean pooling, but the pooling is still applied over biologically defined hierarchical units rather than only individual atoms. This explicit incorporation of biologically meaningful structure is one of our key contributions, ensuring that even our Euclidean-only variant captures context beyond atom-level interactions used by standard atom-level Euclidean GNNs, which explains its stronger performance, which explains why it already outperforms those baselines. We will make it more explicit in the camera-ready version.
Originally, we did not apply hierarchical pooling to the baselines to respect their original designs, and some are not directly compatible with such pooling. However, we have now run selected Euclidean baselines augmented with hierarchical mean pooling. Their performance improves over their original versions (reported in Table 1) and matches our Euclidean-only variant but still lags significantly behind the full DualEquiNet with cross-space interaction pooling (CSIP) and dual-space message passing (e.g., 4.6%, 8.7%, and 9.0% gains over best EGNN Pooling on COVID, Ribonanza, and Tc-Ribo datasets respectively).
| COVID | Ribonanza | Tc-Ribo | ||||||
|---|---|---|---|---|---|---|---|---|
| Reactivity | pH10 | Mg pH10 | MCRMSE | DMS | 2A3 | MCRMSE | Factor | |
| EGNN + Pooling | 0.280±0.006 | 0.469±0.006 | 0.362±0.010 | 0.370±0.009 | 0.688±0.028 | 0.534±0.014 | 0.611±0.010 | 0.699±0.016 |
| SchNet + Pooling | 0.302±0.006 | 0.495±0.010 | 0.384±0.010 | 0.394±0.010 | 0.721±0.019 | 0.642±0.004 | 0.681±0.011 | 0.702±0.007 |
| FastEGNN + Pooling | 0.379±0.002 | 0.566±0.009 | 0.454±0.008 | 0.467±0.007 | 0.773±0.024 | 0.819±0.006 | 0.796±0.013 | 0.721±0.013 |
| DualEquiNet | 0.272±0.004 | 0.448±0.010 | 0.340±0.005 | 0.353±0.006 | 0.611±0.025 | 0.505±0.024 | 0.558±0.012 | 0.636±0.049 |
W3: The authors highlight the importance of E(3) equivariance for 3D biomolecules, but they don't discuss the relationship between E(3) and SE(3) equivariance for modeling molecular chirality. As such, it's currently unclear if the authors' proposed method is truly E(3)-equivariant and therefore can't detect chirality or whether it is actually SE(3)-equivariant and can detect chirality. A discussion of this point would be helpful.
RW3: SE(3)-equivariant models are equivariant to translations and rotations but not to reflections. This allows them to distinguish chirality because mirror images are treated as distinct configurations. Most fully SE(3)-equivariant models—such as Tensor Field Networks (TFN), SE(3)-Transformer, and SEGNN—achieve this by using Clebsch–Gordan tensor products over spherical harmonics, which provide full orientation-sensitive tensor features but introduce significant computational overhead (up to for maximum tensor degree ). To avoid this cost, our DualEquiNet, like recent HEGNN and GotenNet models, uses spherical harmonic–based inner products instead of full tensor products. This design enables rich geometric reasoning at scale while remaining computationally efficient, but it also makes the model reflection-equivariant (E(3) rather than SE(3)). Extending DualEquiNet to an explicitly SE(3)-equivariant variant to handle chirality more directly is an exciting direction for future work.
Q1: Thoughts on potential of simpler transformer-Based 3D Biomolecular Models.
RQ1: We thank the reviewer for raising this timely question. We agree that recent developments, such as AlphaFold3, have shown that large-scale Transformer-based architectures with rotation data augmentation can implicitly learn geometric symmetries and achieve excellent results. However, as highlighted in [7] learning equivariance can provide greater data efficiency and generalization in many applications such as molecular property prediction and dynamics simulation. Similarly, [8] shows that explicit equivariance offers substantially improved data efficiency and better compute efficiency following power-law scaling.
DualEquiNet is designed precisely for such settings: it combines explicit geometric inductive biases with hierarchical biomolecular organization to achieve strong generalization even when labeled data is limited, while also supporting dual-space reasoning required for cross-scale biomolecular interactions. We view Transformer-based approaches trained with augmentation and DualEquiNet as complementary: Transformers excel for large-scale foundation models, while architectures like DualEquiNet remain crucial for data-limited and physics-sensitive problems.
Please let us know if there are any remaining concerns. If we have addressed your questions, we kindly request you to consider updating your score. Thank you very much for your time and thoughtful review!
Reference:
[1] Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders.
[2] Deeprna-twist: Language model guided rna torsion angle prediction with attention-inception network.
[3] Rna-torsionbert: leveraging language models for rna 3d torsion angles prediction.
[4] Deep learning models for predicting rna degradation via dual crowdsourcing.
[5] Ribonanza: deep learning of rna structure through dual crowdsourcing.
[6] Tuning the performance of synthetic riboswitches using machine learning.
[7] Equivariance is dead, long live equivariance? (a blog post on Molecular Modelling).
[8] Does equivariance matter at scale?
My concerns have been partially addressed (though I would like to see results for ProteinWorkshop in the camera-ready version). I have raised my score accordingly.
Thank you for your positive feedback and for recognizing our responses! We truly appreciate your time and effort throughout the review process. We are running experiments on ProteinWorkshop and will present the results in the camera-ready version.
Dear Reviewer, as the review update deadline is approaching, please let us know if you have any further questions. We’d be happy to clarify. Otherwise, if your concerns have been addressed, we’d greatly appreciate it if you could consider raising the rating.
This submission presents DualEquiNet, a graph neural network (GNN) for the simulation of macromolecules such as RNA and proteins. This submission proposes a dual-space representation method that concurrently functions in both Euclidean space (for local atomic interactions) and Spherical Harmonics space (for global structural relationships). The authors suggest bidirectional cross-space message transfer between these complementary representations and a Cross-Space Interaction Pooling (CSIP) method that facilitates hierarchical aggregation from atoms to residues. These improvements enable DualEquiNet to proficiently capture both intricate atomic details and extensive connections essential for biomolecular modeling. The authors assess their methodology across several RNA and protein property prediction tasks, using two newly established 3D structural benchmarks (SASA and torsion angle prediction), demonstrating consistent improvements over current geometric GNN baselines.
优缺点分析
Strengths:
- The submission shows a cool way to make complementary representations in both Euclidean and Spherical Harmonics spaces. This fixes a problem with the current GNNs that use geometry. With two spaces, this design lets the model see both atomic interactions in a small area and structural patterns in a large area at the same time.
- The Cross-Space Interaction Pooling mechanism that was suggested is a great way to combine features at the atomic level into representations, without losing the natural hierarchical structure of biomolecules.
- Extensive experiments are conducted to verify the effectiveness of the proposed DualEquiNet. The authors also perform ablation studies to provide further insight into the method designed.
Weaknesses:
-
The submission states that the model works well, but it doesn't go into much detail about how hard it is to compute in the main text, instead putting it in the appendix. There needs to be more upfront discussion about the resources it needs compared to baselines, especially for the "large biomolecules" it targets.
-
The main point of this submission is that it can better model long-range dependencies. But most of the experimental comparisons are only between other equivariant GNNs. To support the claims about long-range modeling, it would be more convincing to compare it to models that are made for long-range tasks and also use a dual-space representation [1,2,3]. This kind of comparison would help separate the performance gains, provide stronger evidence of the effectiveness of the proposed method.
-
The proposed architecture adds several new modules (dual-neighborhood construction, bidirectional cross-space attention, CSIP projections) and new hyperparameters without clear instructions on how to set them. Without design heuristics or simpler versions, it is hard for practitioners to adapt, tune, or extend the method to new datasets. This could make it harder to reproduce and use it more widely.
References:
[1] UniIF: Unified Molecule Inverse Folding. In NeurIPS, 2024.
[2] Neural atoms: Propagating long-range interaction in molecular graphs through efficient communication channel. In ICLR, 2024.
[3] Ewald-based Long-Range Message Passing for Molecular Graphs. In ICML, 2023.
问题
-
How would DualEquiNet work with more complex biomolecular systems, like protein-RNA interactions?
-
The SH-space neighborhood is implemented by a threshold and the cosine similarity between points. Could you provide some insights on how to implement an adaptive threshold mechanism to prevent tuning hyperparameters?
-
Could you provide empirical results or discussion with models specifically designed for long-range tasks, such as [1,2,3], as they also share a similar design idea?
References:
[1] UniIF: Unified Molecule Inverse Folding. In NeurIPS, 2024.
[2] Neural atoms: Propagating long-range interaction in molecular graphs through efficient communication channel. In ICLR, 2024.
[3] Ewald-based Long-Range Message Passing for Molecular Graphs. In ICML, 2023.
局限性
yes
最终评判理由
Based on the rebuttal comments and the paper quality, I propose to accept this submission.
格式问题
No
W1: The submission states that the model works well, but it doesn't go into much detail about how hard it is to compute in the main text, instead putting it in the appendix. There needs to be more upfront discussion about the resources it needs compared to baselines, especially for the "large biomolecules" it targets.
RW1:
Following reviewer's recommendation, we will move this discussion from the Appendix to the main text in the camera-ready version. To emphasize, our DualEquiNet has substantially lower compute compared to tensor-product-based models such as TFN or SE(3)-Transformer (complexity ) because DualEquiNet relies only on inner products over spherical harmonic features (). In practice, on the COVID dataset with identical hyperparameter settings, DualEquiNet achieves 3–5× faster training on large biomolecules compared to these tensor-product-based models, while maintaining strong predictive performance (see also table below for time per epoch and GPU memory requirements).
| Time (s/epoch) | Memory (MB) | |
|---|---|---|
| SchNet | 4.361 | 628 |
| EGNN | 6.240 | 608 |
| GVPGNN | 11.172 | 10160 |
| TFN | 56.235 | 43970 |
| FastEGNN | 11.755 | 1954 |
| HEGNN | 6.320 | 1232 |
| GotenNet | 10.332 | 5894 |
| DualEquiNet | 10.990 | 2436 |
Additionally, for all baselines and our method, we constrained the number of parameters to at most 4 million and ensured that all models fit within the memory of a single NVIDIA A6000 GPU. Hyperparameters for each model were optimized using Optuna under the same computational budget, ensuring fair comparisons.
W2: Comparison with other long-range baselines.
RW2: We appreciate the reviewer’s suggestion to include additional baseline comparisons. Our original submission already compared against a recent long-range model (FastEGNN). FastEGNN passes global information by introducing a small set of virtual nodes that connect to all real nodes and each other, enabling global message passing on graphs.
Following your recommendation, we further considered three additional baselines (UnIF, Neural atoms and EWALD). Unfortunately, to the best of our knowledge, the implementation of UniF is not publicly available; therefore, we conducted experiments with Neural atoms and EWALD. Specifically, we employed SchNet and DimeNet enhanced with Ewald-based message passing and Neural Atom interactions, respectively as reported in their original works. The results, summarized below, show that our proposed DualEquiNet consistently outperforms these strong baselines across all evaluated datasets (RMSE reported below).
| COVID | Ribonanza | Tc-Ribo | ||||||
|---|---|---|---|---|---|---|---|---|
| Reactivity | pH10 | Mg pH10 | Avg. | DMS | 2A3 | Avg. | Factor | |
| SchNet + Ewald | 0.392±0.004 | 0.583±0.010 | 0.459±0.008 | 0.478±0.009 | 0.845±0.020 | 0.884±0.002 | 0.865±0.014 | 0.736±0.004 |
| SchNet + Neural Atom | 0.409±0.004 | 0.596±0.009 | 0.481±0.006 | 0.495±0.008 | 0.854±0.015 | 0.919±0.009 | 0.886±0.012 | 0.726±0.002 |
| Dimnet + Ewald | 0.403±0.005 | 0.594±0.006 | 0.480±0.004 | 0.492±0.005 | 0.828±0.011 | 0.896±0.011 | 0.862±0.003 | 0.676±0.044 |
| Dimnet + Neural Atom | 0.411±0.009 | 0.602±0.016 | 0.489±0.013 | 0.501±0.015 | 0.825±0.010 | 0.892±0.013 | 0.859±0.005 | 0.679±0.006 |
| DualEquiNet | 0.272±0.004 | 0.448±0.010 | 0.340±0.005 | 0.353±0.006 | 0.611±0.025 | 0.505±0.024 | 0.558±0.012 | 0.636±0.049 |
W3: The proposed architecture adds several new modules (dual-neighborhood construction, bidirectional cross-space attention, CSIP projections) and new hyperparameters without clear instructions on how to set them. Without design heuristics or simpler versions, it is hard for practitioners to adapt, tune, or extend the method to new datasets. This could make it harder to reproduce and use it more widely.
RW3: Thank you for pointing this out. We argue that we did not introduce many new hyperparameters. The only new hyperparameter we added is the threshold used to construct SH-based neighborhoods (). In other modules, such as CSIP, we make the coefficients learnable by design—specifically to reduce manual tuning and avoid introducing additional hyperparameters, as the reviewer noted.
For , we acknowledge it is a unique component of our method and have discussed possible adaptive thresholding mechanisms in our response to RQ2 in this thread below. Additionally, if one prefers not to search over , a practical heuristic can be applied:
We can compute the distribution of SH similarities (i.e., cosine similarity between SH vectors) across all node pairs in the dataset, and then select a threshold such that the number of SH-based edges is approximately equal to the number of Euclidean edges. This balances the neighborhood density between Euclidean and SH spaces and provides a data-driven way to set without tuning.
Additionally, DualEquiNet has two depth-related parameters— and —which specify the number of layers in the atom- and residue-level modules. These are analogous to layer counts in standard GNNs, which are typically shallow. We searched over a small space {1,2,3,4} making this tuning lightweight and practical, so we don’t count this as a new hyperparameter.
Overall, we have deliberately designed the architecture to minimize new hyperparameters and make it accessible for adaptation to new datasets. We will clarify this in the revised version.
Q1: How would DualEquiNet work with more complex biomolecular systems, like protein-RNA interactions?
RQ1: Following the reviewer’s suggestion, we evaluated DualEquiNet on the RNAProt dataset from RNAGlib [1], which focuses on predicting protein–RNA interactions. The results (shown below) demonstrate that DualEquiNet achieves the highest accuracy among existing baselines, thereby validating its effectiveness for protein–RNA interaction tasks.
| Dataset | RNAProt |
|---|---|
| Metrics | Accuracy (↑) |
| SchNet | 0.709±0.005 |
| EGNN | 0.720±0.009 |
| GVPGNN | 0.710±0.005 |
| TFN | 0.697±0.005 |
| FastEGNN | 0.718±0.010 |
| HEGNN | 0.735±0.016 |
| GotenNet | 0.709±0.006 |
| DualEquiNet | 0.749±0.015 |
Q2: The SH-space neighborhood is implemented by a threshold and the cosine similarity between points. Could you provide some insights on how to implement an adaptive threshold mechanism to prevent tuning hyperparameters?
RQ2: We appreciate the reviewer’s insightful suggestion. To reduce reliance on a fixed threshold d_{SH}, we can consider the following alternatives in future work. (1) Percentile-based thresholding: Each node selects neighbors whose SH cosine similarity falls within the top-p percentile of all pairwise similarities to that node: . While remains a hyperparameter, it is scale-invariant and often easier to generalize across datasets than an absolute similarity threshold. It also adapts the number of neighbors per node based on local similarity distribution.
(2) Learnable neighborhood gating:
We can further eliminate manual thresholding by introducing a learnable, soft connectivity mechanism. Specifically, for each node pair (i, j), we define: ,
where is the sigmoid function, and are learnable parameters. can be used directly as edge weights or to probabilistically construct neighborhoods. This approach is fully adaptive and differentiable, enabling end-to-end learning of neighborhood structure in SH space.
These strategies offer practical and principled alternatives to manual threshold tuning. We plan to explore and benchmark them in future work.
Q3: Could you provide empirical results or discussion with models specifically designed for long-range tasks, such as [1,2,3], as they also share a similar design idea?
RQ3: We have now provided the comparison to the other long-range methods that you recommended. Please see RW2 earlier in this thread for the results demonstrating our DualEquiNet outperforms other long-range methods.
Please let us know if there are any remaining concerns. Thanks a lot!
Reference:
[1] A Comprehensive Benchmark for RNA 3D Structure-Function Modeling.
Thank you for providing a detailed discussion and additional experiment results. All my concerns are well-addressed.
Thank you for your positive feedback and for recognizing our responses! We truly appreciate your time and effort throughout the review process and we will incorporate the rebuttal content in the camera-ready version of our paper.
DualEquiNet integrates both euclidean and spherical harmonic features into a message passing network and proposes a novel pooling mechanism to integrate the euclidean and spherical harmonic spaces.
优缺点分析
STRENGTHS
The paper is well structured and easy to follow. They clearly explain how they define the euclidean and spherical harmonic neighborhoods, the node feature and coordinate updates which utilize both euclidean and spherical harmonic features, and the novel (CISP) cross space interaction pooling mechanism which mixes the euclidean and spherical spaces through projection functions. They run experiments on a synthetic dataset N-chain task, an RNA property prediction task on three real datasets, and a surface area prediction task where they show significant improvements to many current baselines. They also demonstrate similar performance to GotenNet on an angle prediction task.
WEAKNESSES
The paper runs an ablation study on the RNA property datasets and report the performance of the model when removing various components. A current outstanding flaw is that it is not clear where the substantial improvements against current baselines are coming from. For context, the proposed DualEquiNet achieves , , on the COVID, Ribonanza, Tc-Ribo datasets respectively, whereas the single space Euclidean model achieves , , and . By integrating both euclidean and spherical harmonic features there is a mean performance improvement of , , ; normalizing by the DualEquiNet standard deviations on the datasets we get , , which shows some improvement to the fully euclidean model. (These numbers are calculated with 5 seeds—it would be great if the authors could post ablation results with more runs if computationally feasible.)
However, to put these numbers in context, the fully euclidean model outperforms all the current reported baselines by a much larger delta (with the exception of TC-ribo; but even GotenNet performs worse than the fully euclidean model on TC-ribo). The fully euclidean model gets , , and whereas the next best benchmarked baseline gets (GotenNet), (FastEGNN), (TFN), which lead to deltas of , , . It is unclear why the fully euclidean model would outperform the other baselines to this extent and without further clarification it suggests that the lion's share of the improvement of the model is due to factors beyond the integration of spherical harmonic features and euclidean features.
While it is not explicit what the fully euclidean model entails (it would be helpful if the authors could clarify this), looking at the various sections we assume that it means that we have a messaging passing framework where
• The neighborhoods are defined by a threshold
• The message has the typical definition
• The update calculation uses a network which aggregates the calculation of K heads, where each head has a different message function and moreover utilizes attention to weight the messages and similarly for the coordinate update
From traditional message passing networks, it seems like the major differences is that attention is additionally calculated to weight the various messages and moreover that there are K many heads which each define a message network.
Given this, we do not see a clear reason why this version of the euclidean only message passing network would lead to a substantial performance improvement to current baselines (ex: a simple graph network like EGNN) beyond the additional attention calculation and the K-many heads.
(Thus, it would help if the authors could report the size of the different models—ablated versions and baselines—to see if there is a fair comparison in terms of model size. It also would be helpful to have the ablation experiments run for the SASA and torsion angle prediction to see if there is a similar behavior going on.)
Without further clarification, it is difficult to measure how much the integration of spherical harmonic features with euclidean features is adding to the existing body of work. Currently, the recommendation stands at 3—borderline reject—but this rating can and will change if the authors provide evidence that their integration of euclidean and spherical harmonic features leads to substantive improvement to existing work.
问题
-
Could you explicitly define the architecture and design choices of the “fully euclidean” model used in your ablations? Specifically, does it include multi-head attention and the same network depth/width as DualEquiNet? Additionally, how do its parameter count and training hyperparameters compare to the other baseline models such as EGNN, TFN, and GotenNet?
-
Given that the fully euclidean variant already significantly outperforms established baselines, could you disentangle the performance gains attributable solely to the spherical harmonic integration? For example, can you provide ablations across tasks (especially SASA and torsion angle prediction) that isolate the effect of the cross-space interaction pooling (CISP) and the spherical message passing components?
-
Have you run the ablations with a larger number of seeds to confirm the statistical significance of these improvements? If so, could you provide those results, including confidence intervals or standard errors, across all relevant tasks?
局限性
yes
最终评判理由
Based on the new data and clarification, I have increased the score.
格式问题
None
We thank the reviewer for their valuable feedback, and we address the concern below.
Q1: Ablation explanations and difference of our euclidean only message passing version with other euclidean baselines (such as EGNN) :
We appreciate the reviewer’s observation and clarify a key source of confusion: our fully Euclidean model (DualEquiNet w/o CSIP/Cross/Dual in Ablation Table 5) is not equivalent to standard Euclidean GNN baselines. Even in this ablation, we preserve the biological organization of large molecules (e.g., atom-to-nucleotide for RNA, atom-to-amino acid for proteins) using hierarchical mean pooling. Specifically, “w/o CSIP” replaces cross-space interaction pooling with standard mean pooling, but the pooling is still applied over biologically defined hierarchical units rather than only individual atoms. This explicit incorporation of biologically meaningful structure is one of our key contributions, ensuring that even our Euclidean-only variant captures context beyond atom-level interactions used by standard atom-level Euclidean GNNs, which explains its stronger performance, and hence it already outperforms those baselines. We will make it more explicit in camera-ready version.
Originally, we did not apply hierarchical pooling to the baselines to respect their original designs, and some are not directly compatible with such pooling. However, we have now run selected Euclidean baselines augmented with hierarchical mean pooling. Their performance improves over their original versions (reported in Table 1) and matches our Euclidean-only variant but still lags significantly behind the full DualEquiNet with cross-space interaction pooling (CSIP) and dual-space message passing (e.g., 4.6%, 8.7%, and 9.0% gains over best EGNN Pooling on COVID, Ribonanza, and Tc-Ribo datasets respectively).
| COVID | Ribonanza | Tc-Ribo | ||||||
|---|---|---|---|---|---|---|---|---|
| Reactivity | pH10 | Mg pH10 | MCRMSE | DMS | 2A3 | MCRMSE | Factor | |
| EGNN + Pooling | 0.280±0.006 | 0.469±0.006 | 0.362±0.010 | 0.370±0.009 | 0.688±0.028 | 0.534±0.014 | 0.611±0.010 | 0.699±0.016 |
| SchNet + Pooling | 0.302±0.006 | 0.495±0.010 | 0.384±0.010 | 0.394±0.010 | 0.721±0.019 | 0.642±0.004 | 0.681±0.011 | 0.702±0.007 |
| FastEGNN + Pooling | 0.379±0.002 | 0.566±0.009 | 0.454±0.008 | 0.467±0.007 | 0.773±0.024 | 0.819±0.006 | 0.796±0.013 | 0.721±0.013 |
| DualEquiNet | 0.272±0.004 | 0.448±0.010 | 0.340±0.005 | 0.353±0.006 | 0.611±0.025 | 0.505±0.024 | 0.558±0.012 | 0.636±0.049 |
Integration of euclidean and spherical harmonic features leads to substantive improvement to existing work: Beyond the explanation provided above, integrating the euclidean and spherical harmonic features indeed provide consistent percentage improvements across datasets. For example, removing CSIP increases error by ~8% on Tc-Ribo and ~7% on Ribonanza, while disabling cross-space message passing further increases error by ~9% on Tc-Ribo. The combined removal of all dual-space components yields the largest degradation, with ~11% higher error on Tc-Ribo and ~7% on Ribonanza. Overall, these results indicate that DualEquiNet’s architectural contributions—particularly contextual pooling and cross-space learning—are synergistic and yield up to ~10% improvements on different tasks.
Q2. Architecture and design choices of our “fully euclidean” model/parameter count and training hyperparameters compared to other baselines:
We believe this question likely arises from the confusion addressed in the point above but we provide further details for completeness. DualEquiNet w/o CSIP/Cross/Dual uses the same design as the full-version DualEquiNet, except the ablated part, i.e., they use the same multi-head attention and the same network depth/width. For DualEquiNet and all other baselines, we constrained the number of parameters to at most 4 million and ensured that all models fit within the memory of a single NVIDIA A6000 GPU. Within this resource budget, we used Optuna to perform hyperparameter optimization and select the best configuration for each model. This ensures that our comparisons are fair and conducted under consistent computational and hyperparameter constraints for all models and baselines.
Q3. Ablation study with 10 random seeds:
As suggested by the reviewer, we have run ablations with 10 random seeds. The results are consistent with those reported in Table 5 of the paper, with similar performance and comparable standard deviations. Due to rebuttal time constraints, we were able to run it on four tasks (see table below). We found a typo in the reported standard deviation for DualEquiNet on the Ribonanza dataset in Table 5: it should be “0.558 ± 0.012” instead of “0.558 ± 0.123”. The correct value is also reflected in Table 1 of the paper and only the ablation Table 5 has this typo. With this correction, the normalized improvement on Ribonanza becomes 0.038/0.012, which is approximately 3.17x standard deviations using 10 seeds, or 0.034/0.012, which is about 2.83x standard deviations using 5 seeds. We will correct this typo in the camera-ready version.
| COVID ↓ | Ribonanza ↓ | TcRibo ↓ | TorsionAngle ↓ | |
|---|---|---|---|---|
| DualEquiNet | 0.353±0.006 | 0.558±0.012 | 0.636±0.049 | 15.872±2.032 |
| w/o CSIP | 0.364±0.012 | 0.585±0.013 | 0.687±0.020 | 16.921±1.375 |
| w/o CSIP/Cross | 0.363±0.012 | 0.588±0.020 | 0.694±0.015 | 17.413±1.404 |
| w/o CSIP/Cross/Dual | 0.368±0.011 | 0.596±0.028 | 0.696±0.008 | 17.459±1.331 |
Please let us know if there are any remaining concerns. If we have addressed your questions, we kindly request you to consider updating your score. Thank you very much for your time and thoughtful review!
Thank you for the clarification and the additional data.
Thank you for your positive feedback and for recognizing our responses! We truly appreciate your time and effort throughout the review process and we will incorporate the discussion in the camera-ready version of our paper.
Thank you, reviewers and authors, for your active and productive engagement during the Author+Reviewer discussion period. It seems from the comments that there are no unresolved questions at this time that the authors need to address, however should that not be the case -- please add your comments ASAP before the end of today (August 8th AoE) and the start of AC+Reviewer discussions.
Summary: This paper introduces DualEquiNet, which combines Euclidean distance–based neighborhoods with neighborhoods defined by cosine similarity of local environment spherical harmonic (SH) projections, and pools atom-level features into residue-level representations. The approach is novel and delivers strong empirical results on RNA/protein benchmarks, including new datasets introduced by the authors. The rebuttal added helpful comparisons, ablations, and significance tests.
Strengths:
- Strong empirical gains across RNA/protein tasks.
- Sensible architectural design: hierarchical pooling and dual neighborhood definitions that incorporate biomolecular structure.
- Careful additional experiments during the rebuttal (e.g., pooling baselines, long-range baselines, new datasets).
- Practical complexity advantages compared to tensor-product equivariant networks.
Concerns:
- Terminology: The term “dual space” has well-established meanings in mathematics (dual vector space) and physics (real vs. reciprocal space duality, defined only after fixing a group). One could in principle frame Euclidean vs. SH neighborhoods as translation vs. rotation representations of the Euclidean group, but the paper does not present it this way. Instead, “dual space” is used simply to mean “two neighborhood definitions,” which risks misleading readers.
- Local environment SH neighborhoods: The cutoff based on cosine similarity of SH projection vectors couples motif similarity with relative orientation similarity. This is distinct from invariant approaches (power spectrum, bispectrum) that capture geometric similarity independent of orientation. The paper should make this trade-off explicit and note invariant cutoffs as a natural future direction.
- Hierarchy and pooling: The Euclidean-only ablation of the authors’ model does include atom→residue pooling, but external baselines (e.g., SE(3)-Transformer, TorchMD-Net) were left atom-only. Thus, part of the observed gains may reflect hierarchical pooling in addition to the dual-neighborhood design. This asymmetry should be clearly acknowledged.
Decision: On balance, the ideas and results justify acceptance. The work would be significantly strengthened by clearer framing, especially regarding terminology, the SH neighborhood design, and the role of pooling in comparisons. The camera-ready must include explicit clarifications on these points. I note that with a stronger conceptual framing, this contribution could rise from a solid poster to a spotlight or even oral-level paper.