Simultaneous Modeling of Protein Conformation and Dynamics via Autoregression
An autoregressive model that simultaneously learns both the conformations and dynamics of proteins from molecular dynamics data
摘要
评审与讨论
The paper introduces a new method for generative modelling of protein MD trajectories. An MD trajectory is considered as a sequence of 3D structures that are temporally related. Thus, the proposed model autoregressively generates/samples a trajectory frame-by-frame. The model is benchmarked for 3 sampling problems compared to other existing models for the same tasks, and tends to perform favorably.
优缺点分析
Strengths:
-
Novel idea: The idea of framing generative modelling of molecular dynamics trajectories as autoregression is exciting and new, to the best of my knowledge.
-
Good results on standard benchmarks: Empirical results look strong relative to baselines, overall. The experimental setup largely follows past published papers based on my understanding (except S.4.3 highlighted below). I will rely on other reviewers' assessment on the rigour of the experiments.
-
Figure 1 does a good job showing the pipeline and main contributions.
Weaknesses:
-
No standard deviations: For almost all the tables, there are no standard deviations or variations reported. Only an averaged metric is reported. This is esp. important for several of the claims of outperforming other models when the results are very close (eg. Table 3). It may be the case that several models are within an error range of one-another, making it harder to make an outright claim of SOTA performance.
- I realise that it may be conventional to train only one big model and report the best mean metric. However, it can still be useful to readers to additionally report some appropriate measure of variance in performance over your test set.
- I will adjust my assessment and increase my score if the authors can address this concern in the rebuttal.
-
Missing inference time metrics: I would be interested to see some metrics on the inference time for the proposed model as well as the previous models being compared to, as well as how that compares to actually running an MD simulation. I think this is important, b/c the goal of this line of work broadly is to accelerate and augment MD simulations. So are current models actually faster than running MD?
- As an example, in the conformation interpolation task: Say I am a user and I know the two conformational states of my protein. Now, I can either run an MD simulation or I can use your AI method (which is slightly less accurate than MD). Thus, I will only ever run your AI method if it is much faster than MD. Would you agree with this assessment?
-
(Minor) Conformation interpolation results and associated claim (S.4.3) The model requires finetuning to perform this task. However, when I read the Abstract/Intro/Conclusion, I got the impression that the model can perform this task out of the box (e.g. statements like "CONFROVER supports three tasks in a unified manner"). I think this claim needs better support/more nuance.
Subjective note on presentation: As a reader who works in closely related fields, but not deeply familiar with all aspects of this paper's field, I found that the paper's presentation of the technical content (method, experimental setup) was a bit hard to grasp. Esp. the methods were presented in what I felt was a very dense/condensed form, and required jumping back and forth to references in the appendix. I would suggest trying to improve the presentation of the methods.
The Intro/Conclusion/etc. reads very well.
问题
See questions within weaknesses.
局限性
Yes, I think some limitation are discussed.
Its worth asking: Why are proteins dynamic and undergo conformational changes? There is some amount of inherent dynamics of atomic systems involved, and this paper tackles the problem well. However, another major reason proteins undergo conformational changes is to interact with other partner molecules. It may be worth mentioning that the current method is limited in modelling interactions (or at least, can only implicitly model interactions at present).
It may also be interesting to discuss relative trade-offs vs. emerging methods in machine learning interatomic potentials for organic systems including proteins. Examples include MACE OFF (https://pubs.acs.org/doi/10.1021/jacs.4c07099) and the Orb family of models, which were applied to a simulating the dynamics of a couple enzyme proteins.
最终评判理由
The response addresses my concerns satisfactorily. I also read the other reviews' discussions, and have updated my assessment as a result.
I think that, if possible, authors should try to run repeats of AlphaFlow, too, for completing the revised tables with replicates.
格式问题
No formatting issues.
Minor typos:
- Line 125: learning to direct sample
- Fig.3 caption: denoise a noisy conformations
We thank the reviewer for acknowledging this work contains exciting novel ideas with strong empirical results, and the general evaluation follows standard practice. And we appreciate the constructive feedback on the current manuscripts and we provide responses as follows:
Weakness 1: Missing variance estimate on model performance
We thank the reviewer for the suggestion and agree that providing variance estimation on metrics would enhance the interpretability of the results beyond the standard mean metrics. We repeated the inference for our model and baseline models five times with different random seeds. From which, we report the mean and standard deviations for each experiment as follows:
-
Pearson correlations of conformation changes between sample and reference on multi-start (Table 1): | Type | Model | Traj. | Frame | ∆Frame | |----------|-----------|----------|----------|----------| | CA | MDGen | 0.56±0.03 | 0.47±0.03 | 0.41±0.02 | | CA | ConfRover | 0.75±0.01 | 0.63±0.01 | 0.53±0.01 | | PCA | MDGen | 0.18±0.01 | 0.15±0.01 | 0.10±0.01 | | PCA | ConfRover | 0.73±0.01 | 0.50±0.01 | 0.43±0.00 |
-
State recovery in 100ns simulation (Table 2). MD reference does not contain repeats. | exp | Avg JS-Dist | Avg Recall | Avg F1 | |:----------|:--------------|:-------------|:----------| | MDref | 0.31 | 0.67 | 0.79 | | MDGen | 0.56±0.01 | 0.29±0.01 | 0.42±0.01 | | ConfRover | 0.51±0.01 | 0.42±0.00 | 0.58±0.00 |
-
Pearson correlation on main dynamic mods between sample and reference trajectory. (Figure 5A shown as table.) MD reference does not contain repeats. | Components | Model | lag=1 | lag=5 | lag=10 | lag=20 | |------------------|----------------|--------------|-------------|------------|----------| | PC 1 | MD 100ns | 0.13 | 0.17 | 0.18 | 0.22 | | PC 1 | MDGen | 0.10±0.01 | 0.11±0.02 | 0.12±0.01 | 0.13±0.01 | | PC 1 | ConfRover | 0.16±0.02 | 0.18±0.01 | 0.19±0.01 | 0.20±0.01 | | PC 2 | MD 100ns | 0.17 | 0.16 | 0.18 | 0.18 | | PC 2 | MDGen | 0.11±0.01 | 0.11±0.01 | 0.11±0.01 | 0.12±0.01 | | PC 2 | ConfRover | 0.18±0.01 | 0.17±0.00 | 0.18±0.01 | 0.19±0.01 |
-
Time-independent conformation sampling (Table 3). We were not able to get repeated results from AlphaFlow due to long inference time and limited resource. | exp | Pairwise RMSD r | Per target RMSF r | RMWD | MD PCA W2 | Joint PCA W2 | Weak contacts J | Transient contacts J | Exposed residue J | |:----------|:------------------|:--------------------|:----------|:------------|:---------------|:------------------|:-----------------------|:--------------------| | AlphaFlow | 0.53 | 0.85 | 2.64 | 1.55 | 2.29 | 0.62 | 0.41 | 0.69 |
| ConfDiff | 0.54±0.00 | 0.85±0.00 | 2.70±0.01 | 1.44±0.00 | 2.22±0.04 | 0.64±0.00 | 0.40±0.00 | 0.67±0.00 | | ConfRover | 0.51±0.01 | 0.85±0.00 | 2.66±0.02 | 1.47±0.03 | 2.23±0.04 | 0.62±0.01 | 0.37±0.01 | 0.66±0.01 | | MDGen | 0.47±0.04 | 0.72±0.02 | 2.78±0.04 | 1.86±0.03 | 2.44±0.04 | 0.51±0.01 | 0.28±0.01 | 0.57±0.01 |
With the updated results, we confirm the following: (1) In forward simulation, ConfRover outperforms MDGen in predicting dynamic levels, capturing major motions, and recovering conformational states. (2) For time-independent conformation generation, ConfRover performs on par with specialized models such as AlphaFlow and ConfDiff. Its performance on metrics including 'per-target RMSF r', 'RMWD', 'MD PCA W2', 'Joint PCA W2', 'Weak contacts J', and 'Exposed residues J' is generally within one standard deviation or better, compared to ConfDiff and AlphaFlow.
Weakness 2: Inference time metrics comparisons with MD simulation.
While we briefly discussed inference time in Appendix A.2, we recognize the importance of providing more detailed comparisons with typical MD simulation runs. Below, we include a detailed runtime analysis comparing ConfRover and MD simulations, using the same hardware (NVIDIA H100-80G). Specifically, we measured the wall-clock time required to generate 100 ns ATLAS trajectories (80 frames) for proteins of varying sizes and report the average inference time per size bucket. For comparison, we also selected a representative protein from each bucket and estimated the time required to simulate 100 ns using OpenMM with implicit solvent.
| seqlen | (0,150) | [150,300) | [300,450) | [450,600) | [600, 724] |
|---|---|---|---|---|---|
| ConfRover | 6.99 | 7.53 | 10.92 | 15.88 | 20.83 |
| MD | 104.54 | 207.92 | 386.69 | 651.29 | 1099.13 |
| Speedup | 14.95 | 27.61 | 35.41 | 41.01 | 52.77 |
As shown in the table, ConfRover provides clear speedup for 100ns simulation, with even more pronounced acceleration for larger proteins. Beyond faster simulation with larger strides, ConfRover also supports time-independent sampling for generating independent conformations in parallel, and interpolation between terminal conformations, capabilities that are not easily achievable with classical unbiased MD simulations. While targeted MD simulations could in principle be used for transition pathway sampling, implementing and benchmarking such setups is beyond the scope of this short rebuttal period. Nonetheless, we believe the reported unbiased MD simulation speeds offer a concrete and fair basis for comparing inference efficiency, and they highlight the practical advantages of ConfRover.
Weakness 3: unification of tasks
Thank you for bringing this up. By "unification" we refer to a shared modeling framework that can learn from multiple tasks, rather than a single model that performs all tasks equally well out of the box. While training a single model for all three tasks is possible, different training strategies often yield models with varying strengths. In our early experiments, we trained a single model jointly on three tasks, but observed slightly degraded performance on time-independent sampling (e.g., on contact prediction). As a result, we adopted a two-stage training strategy, continuing training for interpolation to better balance task performance.
Other comments
-
Methods part being too dense. We are sorry that the reviewer found the Methods section difficult to follow. Due to page limitations, we had to condense some technical content and move certain model details (e.g., SE(3) diffusion) to the supplementary materials. With additional space available in the final version, we aimed to provide smooth reading experience for readers from diverse backgrounds. We would greatly appreciate any specific suggestions to help us strengthen the revised manuscript.
-
Limitations on modeling complexes and interactions. We appreciate the reviewer for pointing out this important limitation. While the ConfRover framework is architecture agnostic and can, in principle, incorporate more advanced components such as AlphaFold3-style encoder-decoders for modeling complex interactions, the current implementation does not explicitly model such interactions. Additionally, a key challenge in this direction is the scarcity of high-quality datasets capturing interaction dynamics. We have identified promising datasets such as MISATO [1], which include protein-ligand interaction trajectories. We will acknowledge this limitation in the revised manuscript and highlight it as an important avenue for future work.
-
Comparison with MLFF. Thanks for the suggestion. ConfRover differs from machine learning force fields (MLFFs) in terms of the trade-off between accuracy and efficiency, as well as the timescales targeted. MLFFs, such as MACE-OFF and Orb, are designed to integrate high-level quantum accuracy into classical molecular dynamics (MD) simulations, aiming to improve accuracy while retaining scalability. However, these models still rely on MD-like sequential sampling with small integration steps (fs level) and are often more computationally expensive than classical MD, hindering their ability to efficiently capture long-timescale conformational changes. In contrast, dynamic generative models like ConfRover are tailored for capturing slower dynamics on longer timescales, such as ps to ns. In addition, ConfRover enables non-sequential sampling strategies, including time-independent conformation generation and interpolation between terminal states, making it well-suited for exploring broad conformational landscapes. We will add this to discussion.
[1] Siebenmorgen T, Menezes F, Benassou S, et al. MISATO: machine learning dataset of protein–ligand complexes for structure-based drug discovery. Nat Comput Sci. 2024;4(5):367-378.
Dear reviewer,
Thank you again for taking the time to review our paper and for your insightful feedback and constructive suggestions. We hope our responses have adequately addressed your concerns. Specifically:
- We have repeated experiments for five times and reported the mean and standard deviations across the benchmarks, enabling more robust comparisons.
- We provided detailed analysis of inference time across proteins of varying sizes, along with their comparisons to typical MD simulations on the same hardware. The results shows that ConfRover offers significant speedups, particularly for larger proteins.
- We added extended discussion on several key points you raised, including unified framework, writing clarity, current limitations on modeling complexes and interactions, and comparison with state-of-the-art MLFF models.
We sincerely hope these updates and clarifications address your questions. If so, we would be grateful if you would consider updating your score based the response.
The authors' rebuttal addresses my concerns well. I have updated my scores and am in favour of accepting the paper.
I think that, if possible, authors should try to run repeats of AlphaFlow, too, for completing the revised tables with replicates.
We appreciate the reviewer’s time and valuable feedback, and we are glad our previous responses addressed your concerns.
Regarding the repeated AlphaFlow experiments, the inference is still in progress. Despite the long runtime and some job scheduling issues, we have finished approximately 80% of the ATLAS test proteins (across five replicates). We will update the table with the full set of repeated results once the remaining runs are finish and verified.
We appreciate the reviewer's suggestions and would like to provide a quick update regarding the AlphaFlow repeats. While we had hoped to complete all inferences during the discussion period, AlphaFlow model has been taking much longer time than expected, particularly for longer proteins (also with some out of memory issues). As of now, we have finished 3 out of 5 replicates for all 82 proteins, with the remaining 2 experiments still running on the longest protein (6LRD-A, 705 residues), which may still require an additional 8~10 hours to complete. Nevertheless, we would like to share the preliminary results obtained from three completed experiments, which align well with the original single experiment.
| exp | Pairwise RMSD r | Per target RMSF r | RMWD | MD PCA W2 | Joint PCA W2 | Weak contacts J | Transient contacts J | Exposed residue J |
|---|---|---|---|---|---|---|---|---|
| AlphaFlow (N=3) | 0.55±0.05 | 0.84±0.01 | 2.62±0.01 | 1.48±0.03 | 2.25±0.03 | 0.62±0.01 | 0.41±0.01 | 0.69±0.00 |
We appreciate your patience and will include the full set of repeats in the final manuscript.
The paper presents a method that holistically takes into account both time-independent and time-sequence protein conformations. This is done by a representation where the encoder is at the level of individual frames, and the sequence are modeled using masked auto-regression. The same model is used for conformation dynamics, time-independent conformations and conformation interpolation.
优缺点分析
Strenghts:
- The paper is very well written and pleasant to read (though I advise thoroughly proof-reading during rebuttal for typo's, see my minor suggestions below). The authors have carefully thought about which concepts to explain, and I appreciate the space given to practical insights and discussions, and motivations for the experimental setup, while leaving prior (relatively well-known) concepts, such as the SE(3) diffusion model, for the appendix or relevant literature.
- Experimental validation yields convincing results and the setup seems reasonable and is well motivated.
- I appreciate the clear and thorough Limitations section.
Weaknesses:
- The method is relatively simple, from a theoretical perspective, though this is not necessarily a weakness.
- Validation is a bit tailored to the problem, but this seems common in this relatively new research area, where standard benchmarks are not yet widely established (as the authors point out themselves). I've only recently switched to this application domain, so other reviewer might be better positioned to gauge this.
问题
Q1: From the perspective of molecular dynamics, the concept of (machine learning based) coarse grained dynamics is related to your work, I suspect. Could you please comment on this? See e.g.:
- Arts et al., "Two for One: Diffusion Models and Force Fields for Coarse-Grained Molecular Dynamics".
Q2: You start the paper with Langevin dynamics, which is a second order differential equation, thus the full state would not only be the positions of the atoms, but also their velocities. As far as I understand, you don't model velocities, and don't take them into account in any way. Do you have any comments or thoughts on this?
Q3: You explicitly do not take any temporal information into account in the FrameEncoder, because you want to model both time-independent frames, and frame sequences. Specifically for the task of frame sequences, do you think this causes a decrease in performance?
Minor remarks/suggestions:
- Fig. 1: iii): the second block has a black outline. Should this not be the last block, corresponding to ?
- line 89: insights in the
- line 129: A similar'
- line 140: sequence
- line 148: Since both the encoder ..
- line 151: Details .. are
- line 183: is invariant .. and is ..
- line 219 Baselines. Can you please provide citations here for MDGen, AlphaFlow and ConfDiff?
- line 248: compared
局限性
yes
最终评判理由
Thank you for the extra evaluations and the clear answers to my questions. I retain my score
格式问题
None
We thank the reviewer for affirming our manuscript as well-written, well-organized, and pleasant to read, and for recognizing the value of our insights and discussions, as well as the soundness of our results.
Below are our responses to the questions and weaknesses raised:
Q1: Relations and differences with two-for-one
We appreciate the reviewer for bringing the "Two-for-One" paper to our attention. While both Two-for-One and ConfRover explore the relationship between conformational distributions and dynamics using diffusion-based generative models, their focuses and methodologies differ: Two-for-One does not explicitly learn a temporal generative process ; instead, it uses a score function learned via diffusion as an approximate coarse-grained force field and requires simulating dynamics through MD; In contrast, our model directly captures both the static distribution and temporal dynamics within the generative framework itself, allowing us to generate both time-correlated and uncorrelated samples via diffusion.
Nonetheless, we find the Two-for-One paper insightful in unifying distribution learning and dynamics generation from a physics perspective, and we will include it appropriately in the revised manuscript.
Q2: ConfRover does not explicitly consider velocity
It is correct that velocity is not explicitly encoded in ConfRover, as it is absent in ATLAS. As a result, the model relies solely on temporal relationships between coordinate frames. However, we expect the attention-based temporal module to infer higher-order motion information, akin to velocity, by reasoning over coordinate differences and time intervals between frames. Unlike velocity, which captures local changes over short time scales, frame-to-frame differences across multiple scales may provide richer dynamic information. Yet, explicit velocity data could offer additional guidance on motion direction, and we would be happy to explore its potential benefits when such data becomes available in future datasets.
Q3: FrameEncoder does not contain temporal information
In ConfRover, the FrameEncoder encodes only structural information, while temporal information is incorporated separately in the temporal module using rotary position encoding (RoPE) [1]. Specifically, we encode time as discrete snapshot indices (e.g., 128, 256, 384, ...) corresponding to each frame (0, 1, 2, ...). We use RoPE for temporal encoding based on two key considerations: (1) It enables the model to learn dynamics at real temporal scales, matching the snapshot interval resolution (e.g., 10 ps in ATLAS). (2) Unlike hard-coding time into frame-level embeddings, RoPE encodes relative time rather than absolute time, maintaining invariance under global time shifts.
While we intentionally avoid injecting absolute time into the FrameEncoder, it may be beneficial to encode the trajectory "stride" to enhance multi-scale learning. We leave this direction for future investigation.
Weakness 1: The method is relatively simple
We consider the simplicity of our model a merit rather than a weakness. Despite its simple form, the core idea of learning different perspectives of the protein conformational landscape via conditional probability is carefully designed so that we can use a modern causal language model architecture enables a unified and flexible approach across multiple protein conformation tasks.
Weakness 2: the evaluation are not standard
This is an inherent challenge in developing models for protein dynamics, and we see it as a potential contribution of our work. Standardized evaluation protocols are lacking for this problem, especially for large proteins where the MD trajectory are not fully equilibrated. We designed metrics to capture aspects of dynamic behavior as comprehensively as we can. Our evaluation assesses how models respond to different initial conformations and time scales and, for long-time simulation, whether they capture main dynamics and recover conformational states.
During the rebuttal, we also added structural quality and energy-based evaluations using standard pipelines to complement our analysis, particularly on whether generated conformations are plausible. Specifically, we used MolProbity [2] to assess geometric accuracy and MadraX [3] for heavy-atom energy evaluation. We evaluated 38 cases tested in forward simulation and conformation interpolation tasks, and compare across different models. Results are summarized below (all values reported as mean ± std; energy values as 95% percentile ranges). It shows both ConfRover (forward simulation) and ConfRover-interp have similar structural quality, comparable to MD references in backbone dihedral (Ramachandran), bond geometry, and overall energy. While there are more clashes detected than MD references, it also depends on ad hoc hydrogen addition and may bias evaluation of heavy-atom-only models. Overall, ConfRover consistently outperforms MDGen across all metrics.
| exp | Ramachandran outliers % | Ramachandran favored % | Rotamer outliers | Clashscore | RMS(bonds) | RMS(angles) | MolProbity score | MadraX Energy |
|---|---|---|---|---|---|---|---|---|
| MD Reference | 0.38±0.49 | 97.41±1.49 | 1.02±0.89 | 0.04±0.16 | 0.01±0.00 | 1.88±0.05 | 0.72±0.18 | -519.3 (-1793.0, -53.4) |
| MDGen | 0.93±0.86 | 94.98±2.04 | 2.86±1.59 | 16.14±20.05 | 0.02±0.02 | 2.13±0.30 | 2.24±0.40 | -314.7 (-1483.8, 263.6) |
| ConfRover | 0.58±0.63 | 96.93±1.45 | 1.98±1.48 | 7.81±6.52 | 0.01±0.01 | 1.88±0.25 | 1.72±0.38 | -522.2 (-1858.9, -53.4) |
| ConfRover-interp | 0.71±0.94 | 96.90±1.95 | 1.86±1.46 | 7.25±8.74 | 0.02±0.01 | 1.91±0.32 | 1.61±0.51 | -469.7 (-1712.3, -42.8) |
We hope these efforts provide a comprehensive understanding of ConfRover and inform future benchmarks.
Minor points:
- Clarification of Figure 1A (iii): This figure illustrates how we condition on both the first and last frames in the causal sequence model. The second box represents frame 0, the original first frame, and the first box represents the last frame (arrows indicate moving it to the front of the sequence). Both boxes have black outlines to mark them as conditioning frames. We will clarify this further in the revised manuscript.
- We thank the reviewer for pointing out typos and will correct them in the revision.
[1] Su, Jianlin, et al. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063. [2] Williams CJ, Headd JJ, Moriarty NW, et al. MolProbity: More and better reference data for improved all-atom structure validation. Protein Sci. 2018;27(1):293-315. [3] Orlando G, Serrano L, Schymkowitz J, Rousseau F. Integrating physics in deep learning algorithms: a force field as a PyTorch module. Bioinformatics. 2024;40(4):btae160.
Thank you for the extra evaluations and the clear answers to my questions. I retain my score
Thank you for your time and effort throughout the review process. We are grateful that you found ConfRover to be a valuable contribution to generative modeling for protein dynamics, and our responses have addressed your initial questions and concerns.
Best,
The authors
The authors introduces Confrover, an autoregressive framework that couples a transformer “trajectory module” with an SE(3) diffusion decoder so that one model can (i) roll out protein trajectories, (ii) draw single, time-independent conformations, and (iii) interpolate between two end states. Training is performed on the ATLAS MD data set with a hybrid loss that mixes trajectory, single-frame and (optionally) interpolation objectives. On held-out proteins the authors report better dynamical metrics than MDGen (but worse than MD) and competitive equilibrium-ensemble metrics with AlphaFlow and ConfDiff.
优缺点分析
Strengths:
- The motivation and objective for the paper are clear, the task is important, and the modular design of the architecture is sensible. The observation that frame autoregression subsumes the three target tasks is very sensible.
- The joint training for the different tasks makes sense and improves equilibrium sampling.
- Empirically, the paper demonstrates reasonable (albeit still not excellent) performances.
- Overall, the paper can be easily followed.
Weakness:
- This paper is primarily empirical, with a few previous studies exploring similar ideas (e.g., autoregressive + diffusion has been explored in MarDini for image/video, and AlphaFolding). The architecture also reuses existing components and fuses them. That said, this should not result in the rejection of the paper.
- The evaluation is primarily based on RMSD, PC-L2, JSD, etc. These statistical metrics have no physical meaning. In my opinion, kinetics (e.g. implied timescale)/energies/Rama plots should be evaluated to show that the model is physically grounded. Specifically, since the model breaks peptide bonds, I suspect it is not very good at distinguishing which dihedrals are more rigid and which ones are more flexible. The interpolation task should also have physical validation (e.g. minimum free energy paths; especially since there are no baselines). 2a. I note the empirical performance of the paper, while better than MDGen (unclear how good it is compared to AlphaFolding or other similar methods due to closed-source code), it is still very far from MD itself. The model often breaks peptide bonds (which should never happen) and has some C-alpha steric clashes. I am open to raising my score if evaluations comparisons can be made with AlphaFolding or more SOTA methods.
- There's a potential data leakage problem. Presumably, the frozen OpenFold has been trained on (some) PDB structures of the ATLAS test proteins.
问题
- Given deterministic encoder & time modules, how diverse are the multiple samples from identical conditioning frames?
- Have the authors checked energy conservation or compared implied timescales against MD?
- How is CA clash/peptide bond break calculated exactly? How about other clashes/bond length distribution?
- There are many things for which ML based MD sampling like AlphaFlow does not work, e.g. https://www.biorxiv.org/content/10.1101/2024.08.31.610605v1, does the Confrover work better?
局限性
yes
最终评判理由
The paper has shown an interesting technique that allows can draw single conformations but also generate protein dynamics trajectories. While the empirical results are yet to match MD, this work does mark a step forward in modelling protein dynamic trajectories using generative methods.
格式问题
no
We appreciate the reviewer's recognition that our work addresses important tasks and the proposed autoregressive framework for multiple tasks sensible. We also thank the reviewer for their valuable input and constructive feedback. We organize our response as follows:
Weakness 1: originality and relation to similar ideas
The key contribution of this work is the introduction of a typical autoregressive framework into the modeling of MD trajectories. We demonstrate that this formulation naturally aligns with several important tasks in protein conformation sampling by managing sequential dependencies across frames. By leveraging modern causal transformers, ConfRover offers efficient training and can be readily extended to generate longer trajectories.
- vs MarDini: While ConfRover shares the general idea of combining autoregressive modeling with diffusion, there are key differences. First, the two methods target fundamentally different application domains. MarDini focuses on video data, whereas ConfRover is designed for molecular dynamics simulations. Second, ConfRover emphasizes a broader multi-task setting, including time-independent sampling and interpolation tasks, which were not explored in MarDini.
- vs AlphaFolding: While AlphaFolding extends MD trajectories by iteratively generating fixed-size blocks conditioned on motion and reference frames, our causal autoregressive model differs in key ways: (1) ConfRover generates long trajectories frame-by-frame using KV-cache, unlike AlphaFolding’s block-wise extension that discards historical context. (2) ConfRover operates at the frame level with clear causal dependencies, whereas AlphaFolding’s block-level autoregression lacks explicitly defined intra-block causality. (3) ConfRover supports forward simulation, time-independent sampling, and interpolation - capabilities not directly supported by AlphaFolding.
- Reuse existing components: This work focuses on proposing a general autoregressive framework. To demonstrate this idea, we intentionally use well-established architectural components. Architectural improvements are beyond the scope of this paper but represent a promising direction for future work.
Weakness 2: conformation evaluation and baseline comparison
1. Physically grounded evaluation. thank you for the suggestion. We agree that demonstrating the physical plausibility of generated conformations is important. To this end, we introduced standard structural and energy evaluation pipelines using MolProbity [1] for geometric quality and MadraX [2] for heavy-atom-based energy assessment. We compared conformations from reference MD, MDGen, and ConfRover (in both forward simulation and interpolation settings). All structures were evaluated using the default settings of MolProbity and MadraX. We found that energy and structural metrics can be sensitive to minor deviations and it is necessary to apply local relaxations even for MD reference conformations. We applied the standard protocol from OpenFold to all conformations to ensure fair comparison and reliable energy estimates. Results are summarized in the following table (values are reported as mean ± std; energy values as 95% percentile ranges):
| exp | Ramachandran outliers % | Ramachandran favored % | Rotamer outliers | Clashscore | RMS(bonds) | RMS(angles) | MolProbity score | MadraX Energy |
|---|---|---|---|---|---|---|---|---|
| MD Reference | 0.38±0.49 | 97.41±1.49 | 1.02±0.89 | 0.04±0.16 | 0.01±0.00 | 1.88±0.05 | 0.72±0.18 | -519.3 (-1793.0, -53.4) |
| MDGen | 0.93±0.86 | 94.98±2.04 | 2.86±1.59 | 16.14±20.05 | 0.02±0.02 | 2.13±0.30 | 2.24±0.40 | -314.7 (-1483.8, 263.6) |
| ConfRover-fwd | 0.58±0.63 | 96.93±1.45 | 1.98±1.48 | 7.81±6.52 | 0.01±0.01 | 1.88±0.25 | 1.72±0.38 | -522.2 (-1858.9, -53.4) |
| ConfRover-interp | 0.71±0.94 | 96.90±1.95 | 1.86±1.46 | 7.25±8.74 | 0.02±0.01 | 1.91±0.32 | 1.61±0.51 | -469.7 (-1712.3, -42.8) |
The results show that ConfRover generates high-quality conformations, matching MD references in backbone dihedrals, bond geometry, and energy, while outperforming MDGen across all metrics. Interpolation quality: when comparing ConfRover-interp with ConfRover-fwd, they shows similar structural quality, indicating the conformation generated from interpolation maintain physically plausible.
2. Dihedral flexibility analysis. While prior results suggest the conformations are physically plausible, we additionally performed dihedral flexibility analysis as suggested by the reviewer. We extracted dihedral angles from the 100 ns simulations and computed their circular variance. Pearson correlation with the MD reference was used to assess how well the model captures ground-truth flexibility. The result shown below shows reasonable agreement between ConfRover generated sample and MD reference.
| phi | psi | |
|---|---|---|
| ConfRover | 0.480 ± 0.158 | 0.455 ± 0.163 |
3. Comparison with other models like AlphaFolding. The lack of baseline models is a common challenge in this field. We contacted the authors of AlphaFolding to request their model for comparison, but the weights are not available at this time. However, we identified code from a related open source project and are currently working on retraining their model for evaluation on the ATLAS benchmark. We are happy to update our results if preliminary comparisons become available during the discussion phase.
Weakness 3: potential data leakage from OpenFold model
Our objective is to model the distribution and dynamics of protein conformations, not to predict a single folded structure. Accordingly, our setup and evaluation metrics focus on structural variation and temporal behavior rather than alignment to a fixed reference. Since OpenFold is trained on static PDB structures and lacks such dynamic information, we do not believe data leakage is a concern under the current setting.
Questions:
-
Data diversity conditioned on the same conditioning frames. We are currently investigating the diversity of frame-conditioned generation and will share updated results during the discussion phase.
-
Regarding energy conservation or implied timescales against MD. Implied timescales are commonly used in Markov State Models (MSMs) to characterize the dynamics of equilibrated systems. However, the MD trajectories in ATLAS are not well equilibrated, making standard implied timescale analysis unreliable in this context. To address this limitation, we instead analyze the dominant tICA component across varying lag times (Figure 5A) as a proxy. We are currently investigating energy conservation during trajectory generation using the newly integrated MadraX and will update the relevant results during the discussion phase.
-
CA clash peptide bonds quality metrics. We calculated the CA clash/break rate and peptide bond break rate following the evaluation code of ConfDiff [3]. Specifically, CA clash is determined when the distance of two alpha carbon atoms are within 2 * 1.7 - 0.4 = 3.0 Å, and CA break is determined if CA of two consecutive residues are greater than 3.8 + 0.4 = 4.2 Å. For peptide bond, we consider a broken peptide bond with have bond length greater than 1.4 Å. As discussed in the response to Weakness 2, we have include a standard set of protein conformation quality evaluation using MolProbity.
-
Conformation sampling for proteins with stable alternative locations. We thank the reviewer for bringing this work to our attention. The referenced paper presents challenging protein cases where conformational changes are difficult to sample using state-of-the-art generative models. As the benchmark includes experimental data that differs from the MD-based benchmark we use, we are currently investigating these cases and are willing to share results if obtained during the discussion period.
[1] Williams CJ, Headd JJ, Moriarty NW, et al. MolProbity: More and better reference data for improved all-atom structure validation. Protein Sci. 2018;27(1):293-315. [2] Orlando G, Serrano L, Schymkowitz J, Rousseau F. Integrating physics in deep learning algorithms: a force field as a PyTorch module. Bioinformatics. 2024;40(4):btae160. [3] https://github.com/bytedance/ConfDiff
Reproducing AlphaFolding
We thank the reviewer for their patience. It took some time to set up the training and evaluation pipeline, but we successfully reproduce AlphaFolding closely aligned with the original settings: we set the motion token count to 2 and generation horizon to 16 frames. To match MDGen, we increased the stride from the default 1 to 40.
We encountered out of memory error when training on A100-80GB, therefore kept the authors’ protein length filter of maximum 256 residues. Similarly, we excluded the five largest proteins from the ATLAS test set during inference. We trained the model for 65K steps until convergence. For evaluation, we iteratively extended the 16-frame outputs to generate 256-frames and retained the first 250 frames, to match our 100ns simulation setting.
We are happy to share the preliminary results from our reproduction:
Autoregressive models (AlphaFolding and ConfRover) performs better than MDGen in capturing main dynamics (e.g., tICA plots), though AlphaFolding slightly lags behind ConfRover (Table 1). However, AlphaFolding show lower quality compared to both MDGen and ConfRover (Table 2). This is likely due to more prominent error accumulation from its iterative block-wise extension, which disregards earlier high-quality frames (Table 3) when generating future frames. In contrast, MDGen uses non-autoregressive attention across all frames, and ConfRover maintains full attention history via KV-cache.
Given the structural noise in AlphaFolding's outputs, its high recall in state recovery is unsurprising (Table 4), as noisy structures may artificially increase coverage in state space. To cross verify its ensemble accuracy, we compared time-independent sampling metrics (Table 5) and found that AlphaFolding underperforms MDGen in several cases (e.g., RMSD, Joint PCA W2), while both fall short of ConfRover, whose parallel sampling more efficiently approximate the conformation distribution.
Finally, AlphaFolding is currently restricted to smaller proteins during training and inference due to memory limitations.
-
Table 1: Pearson's correlation of main tICA components | Component | Model | 1 | 5 | 10 | 20 | |:----|---|-----:|---:|---:|---:| | PC 1 | MDGen | 0.11 | 0.12 | 0.10 | 0.12 | | PC 1 | AlphaFolding | 0.15 | 0.15 | 0.16 | 0.18 | | PC 1 | ConfRover | 0.19 | 0.17 | 0.19 | 0.19 | | PC 2 | MDGen | 0.12 | 0.10 | 0.10 | 0.11 | | PC 2 | AlphaFolding | 0.18 | 0.15 | 0.16 | 0.20 | | PC 2 | ConfRover | 0.19 | 0.17 | 0.18 | 0.17 |
-
Table 2: Structural quality from MolProbity, all conformations are unrelaxed. | exp | Ramachandran outliers % ↓| Ramachandran favored % | Rotamer outliers ↓ | Clashscore ↓ | RMS(bonds) ↓ | RMS(angles) ↓ | MolProbity score ↓| |:---|------:|------:|-----:|------:|-------:|-------:|---------:| | AlphaFolding | 2.91 | 91.91 | 15.37 | 151.35 | 0.07 | 4.76 | 3.94 | | MDGen | 1.87 | 93.12 | 2.85 | 128.08 | 0.04 | 3.14 | 3.28 | | ConfRover | 1.75 | 94.62 | 3.30 | 76.89 | 0.05 | 3.69 | 3.00 |
-
Table 3: Average backbone dihedral Ramachandran outliers % in each frame ranges | Frame range | (0, 15] | (15, 31] | (31, 47] | (47, 63] | (63, 79] | |:-------------|:-----------------|:---------------|:---------------|:---------------|:---------------| | AlphaFolding | 1.77 | 2.78 | 3.21 | 3.34 | 3.57 | | MDGen | 1.59 | 1.85 | 1.96 | 1.95 | 2.05 | | ConfRover | 1.34 | 1.61 | 1.77 | 1.93 | 2.08 |
-
Table 4: Conformational state recovery in 100ns simulation | exp | JS-Dist | Recall | F1 | |:-------------|--------------:|-------------:|---------:| | MDGen | 0.55 | 0.29 | 0.43 | | AlphaFolding | 0.47 | 0.51 | 0.65 | | ConfRover | 0.51 | 0.42 | 0.58 |
-
Table 5: Conformation ensemble metrics from time-independent sampling. | exp | Pairwise RMSD r | Global RMSF r | RMWD ↓ | MD PCA W2 ↓ | Joint PCA W2 ↓ | Weak contacts J | Transient contacts J | Exposed residue J | |:-------------|------------------:|----------------:|-------:|------------:|---------------:|------------------:|-----------------------:|--------------------:| | MDGen | 0.51 | 0.50 | 2.74 | 1.85 | 2.39 | 0.50 | 0.29 | 0.57 | | AlphaFolding | 0.63 | 0.63 | 4.03 | 1.74 | 3.88 | 0.47 | 0.20 | 0.50 | | ConfRover (time-indep) | 0.50 | 0.62 | 2.67 | 1.47 | 2.24 | 0.62 | 0.36 | 0.66 |
Q1: Trajectory diversity from identical conditioning frames
Although the encoder and temporal modules are deterministic, the diffusion decoder enables diverse conformation generation. To verify this, we randomly selected 100 starting conditions and generated 5 trajectories of 9 frames for each condition, at strides from 128 to 1024.
We report the mean pairwise RMSD (Å) at each frame time as mean ± std across all 100 cases. ConfRover generates diverse samples from identical conditioning frames, with diversity increasing alongside trajectory length and stride size, consistent with expected variability from longer autoregressive sampling and stride-conditioned dynamics.
| stride | frame=0 | frame=2 | frame=4 | frame=6 | frame=8 |
|---|---|---|---|---|---|
| 128 | 0.0 | 2.17±1.07 | 2.33±1.12 | 2.47±1.23 | 2.52±1.37 |
| 256 | 0.0 | 2.36±1.18 | 2.54±1.24 | 2.71±1.39 | 2.74±1.46 |
| 512 | 0.0 | 2.57±1.32 | 2.79±1.43 | 2.9±1.53 | 2.96±1.60 |
| 1024 | 0.0 | 2.93±1.65 | 3.07±1.62 | 3.15±1.78 | 3.22±1.87 |
Q2: Energy conservation.
In MD and machine learning force fields, directly predicting forces, rather than deriving them from predicted energies, does not guarantee energy conservation and may lead to energy drift or structural collapse over long trajectories. Similarly, models that directly generate future coordinates also lack strict energy conservations.
That said, ConfRover leverages generative models like ConfDiff pretrained on large structural datasets, may incorporate strong priors that help maintain physical plausibility and the full attention history through KV-cache may mitigate error accumulation. We selected 16 proteins, generate 100 ns trajectories, and evaluate the energies using MadraX to asses whether there is severe energy drift. While energy increases observed in three cases (6L3R-E, 6JWH-A, 5ZNJ-A), for most (13/16) proteins, the potential energy fluctuates but in general maintained in a reasonable region.
| Frame range | 0-10 | 10-20 | 20-30 | 30-40 | 40-50 | 50-60 | 60-70 | 70-80 |
|---|---|---|---|---|---|---|---|---|
| MadraX Energy | -598.6 | -607.2 | -609.6 | -582.9 | -579.9 | -596.3 | -595.8 | -573.1 |
Importantly, trajectory generative models primarily aim to capture relevant dynamics or explore conformational space, where strict energy conservation may be less critical as long as the generated conformations are reasonable surrogates of physical states. That said, reliably generating long, physically coherent trajectories might benefit from mechanism improves energy conservation and error accumulation: for example, self-correct deviations using noisy augmentation commonly used in long trajectory generation [1], integrating energy/force-guided sampling [2] or energy-based objectives during training [3]. These directions may be valuable for future exploration but are beyond the scope of this work, which focuses on proposing the general ConfRover framework.
[1] Valevski, Dani, et al. "Diffusion models are real-time game engines." arXiv preprint arXiv:2408.14837 (2024).
[2] Wang, Yan, et al. "Protein conformation generation via force-guided se (3) diffusion models." arXiv preprint arXiv:2403.14088 (2024).
[3] Lu, Jiarui, et al. "Aligning Protein Conformation Ensemble Generation with Physical Feedback." arXiv preprint arXiv:2505.24203 (2025).
Q4: Case studies on altloc conformation changes from Rosenberg et al (2024)
In [4], the authors identify stable alternative conformations from PDB altloc records that are challenging to sample via MD due to steric barriers. These cases were also found tough for existing generative models. To test these, we focus on two cases: 4OLE‑B and 3AZY‑A, which contain substantial (~10 residues) conformational changes involving secondary structure transitions and coil-fold switching.
Leveraging ConfRover’s flexibility, we were able to evaluated two settings: (1) time‑independent sampling from sequence, and (2) starting from a generated conformation near the difficult state and proceeding with sequential forward simulation. ConfRover performed similarly to other generative methods, primarily reproducing the easy-to-sample state. This unsurprising result reflects the use of ConfDiff with its pretraining on PDB datasets that lack altloc diversity. Interestingly, when starting near the target conformation, trajectories for both cases reverted to the easier conformation, suggesting that ConfRover might implicitly learn the energy barrier from steric effects that separate these states, mirroring MD behavior rather than generating unrealistic transitions across barriers.
We appreciate the reviewer highlighting these cases, as they underscore the need to incorporate altloc information in PDB data as potential augmentation to better sample such conformation states.
[4] Rosenberg, Aviv, et al. "Seeing Double: Molecular dynamics simulations reveal the stability of certain alternate protein conformations in crystal structures" bioRxiv 2024.08.31.610605 (2024)
We sincerely thank the reviewer for their constructive feedback, which helped us improve our evaluation, incorporate comparisons with additional state-of-the-art methods, and conduct important analyses to better understand ConfRover’s performance. We believe these results further strengthen the case for ConfRover’s strong performance and demonstrate its flexibility as a general learning framework for MD trajectory data.
We respectfully ask the reviewer to consider raising the score if our responses have addressed your questions and concerns. Thank you again for your thoughtful review.
I thank the authors for the very detailed responses. Most of the concerns are somewhat addressed (although the empirical evidence does suggest ConfRover is not tremendously better than other existing methods, and dihedral sampling is still far from MD). I will increase my score accordingly.
Thank you for the thoughtful questions, which have help improve current analysis greatly. We are glad to hear that the new results have addressed your concerns and you will increase the score.
Dear Reviewer tPch,
Thank you again for your constructive feedbacks. Just a gentle heads-up in case there is any system malfunction: we received your acknowledgement, but our console still shows the score is not adjusted. If you plan to adjust it later, please feel free to disregard this message.
Best regard, The authors
I apologize for the delay, the final score has been updated.
In this paper, the authors propose a new model for the simulation of protein conformations that can perform multiple tasks: time-independent sampling, trajectory simulation and interpolation between two conformations. This is achieved by defining all tasks as conditioned generation, where the conditioning is generated by an encoding layer based on sequence only in the case of sampling or on one or multiple frames for the other tasks. The support for multiple frames is ensured by a specialized trajectory module working on a sequence of frame embeddings with a causal mask. The model is trained on a dataset with a large amount of proteins but with limited MD time sampling, ATLAS, and shown to achieve better results with respect to the open-source state of the art on each of the tasks.
优缺点分析
Strengths:
- The model is based on validated components in its modular design, e.g. openfold2, confdiff. These models have been shown to work well for their respective tasks.
- The model can generate structures in multiple settings, unifying conformation generation and molecular dynamics trajectory generation, altho it is unclear how important it is to unify these.
- The model performs well on the ATLAS dataset, appearing to simulate convincing MD trajectories for some of the proteins outside of the training set according to visual inspection, however numerical validation is incomplete.
- Introduction of a modern causal transformer for capturing the dependency between frames is a significant improvement for the practitioners in the field.
Weaknesses:
- Although the modular design is great, the modules are computationally very expensive: encoding layer of the model is based on a pre-trained OpenFold model, which may be limiting for large protein or multimer use case. The trajectory module is also based on the pairformer architecture, based on triangle operations, which are notoriously expensive. This leads to a rather expensive model, making it potentially unfeasible for the simulation of larger proteins. Although there are some math libraries coming up for such operations, the cost still should be discussed.
- While the results shown indicate the ability of the model to produce more realistic trajectories than a similar approach, a more rigorous analysis of the trajectories would be needed to convince the reader of the statistical validity of the sampled states. Authors may follow analysis from the papers they cite, e.g. TimeWarp or EquiJump. Altho I 100% agree with the issue of not having open source baselines limiting the reproduction of former work and 1-to-1 comparison, there needs to be more analysis on the statistical validity of the MD trajectories. The unfiltered results in Figure 12 in Appendix appear somewhat lacking alignment with MD.
- Not a weakness but a comment that authors may want to address in review stage for the general authorship that might share the same opinion as this reviewer - If I understand correctly, there is a strong remark about the unification of non-time aligned vs time aligned sampling, ie conformer generation vs trajectory generation, being an important contribution, and this being the first model to do so. I am not sure this is as significant as the paper puts it to be. Non autoregressive trajectory generation models cited in the paper can be easily modified to be conformer generation models as well, since they are based on simple conditioning using an interpolant scheme. If they are trained with interpolation from noise to data frame, rather than previous frame to next frame, these models would function as conformer generators. Repeating that I don’t believe a unified framework is a weakness, but just may not be the strength it is emphasised to be, especially if the conformer generation model becomes too computationally heavy in order to use the same framework of the trajectory generation model.
Minor:
- Line 187: updateing -> updating
- Line 188: StructualUpdate -> StructuralUpdate
- Line 189: StructralUpdate -> StructuralUpdate
- Line 306: trajecotry -> trajectory
问题
- It is not very clear to me how the “MD timestep” is inserted in the model. Is it an explicit input to the model, or is the model left to infer it from the previous frames? And in this second case: are a few frames sufficient to univocally infer the desired timeframe? In either case, it would be interesting to see the trajectories generated for different timesteps and compare their statistics.
- What is the exact setting of the multi-frames task? I.e. how many frames are presented to the model and how many frames are generated?
- This may be a miss on my part in appendix but from the main description of the model, it does not seem to be equivariant under rotations of the protein. Can the authors comment on this point? Is any augmentation used on the training dataset?
- For the interpolation task, the trajectories are shown to interpolate “smoothly” between start and end point. However, from the PCA plots, several trajectories seem to pass through potentially unphysical paths. Would it be possible to perform some rough energetic analysis to verify that the intermediate states retain some physical plausibility?
- While some information on model cost is presented in appendix, it would be nice to see this section slightly expanded, especially covering a more direct comparison for specific proteins on the same hardware, since the scaling of this model to larger systems is a concern.
局限性
yes
格式问题
N/A
We thank the reviewer for acknowledging the significance of our overall design improvements, the strong performance across tasks in transferable settings, and the significant value of introducing the causal transformer for researchers in the field. Here we include our point-to-point response to the questions raised:
Q1: Encoding of MD timesteps.
The MD timestep is encoded in the Llama layer of the TemporalModule using rotary position encoding (RoPE) [1]. Specifically, we represent time as discrete snapshot indices (e.g., 128, 256, 384, ...) corresponding to frame indices (0, 1, 2, ...). RoPE allows the model to learn dynamics across multiple temporal scales and, by encoding relative rather than absolute time, ensures invariance to global time shifts. In our multi-start experiments, we observe that the diversity of generated trajectories increases with larger stride values (see below), highlighting the model's ability to capture scale-dependent dynamics:
| Stride | Pairwise RMSD (A) |
|---|---|
| 128 | 1.63 |
| 256 | 1.78 |
| 512 | 1.89 |
| 1024 | 2.04 |
Q2: Frame settings for training and evaluation.
We train models on sub-trajectories of 8 frames sampled from original MD. For evaluation of different tasks: 9 frames for multi-start and conformation interpolation, 80 frames for Atlas-100ns simulation, and one frame for time-independent sampling
Q3: Model equivariance.
We did not require data augmentation thanks to the use of invariant sequence and structure embeddings, SE(3)-diffusion, and equivariant denoising models. Specifically, the encoding layer is invariant to global translations and rotations: it encodes input structures via pairwise distances between pseudo-beta atoms, and the resulting single and pair embeddings are invariant, enabling invariant reasoning in the TemporalModule. For the decoder, we adopt Invariant Point Attention from AlphaFold, which predicts residue rigid-body updates relative to the input structure, making the denoising network equivariant to global rotations.
Q4: Additional quality metrics.
Thank you for the suggestion. We have added detailed structural quality and energy evaluation of generated intermediate states. Specifically, we used MolProbity [2] to assess geometric accuracy and MadraX [3] for heavy-atom-only energy evaluation. To verify whether interpolation tasks still generate physically plausible conformations, we evaluated the intermediate conformations (excluding the first and last frames), labeled ConfRover-interp, and compared them to conformations from forward simulations (without terminal constraints, labeled ConfRover-fwd) and reference MD simulations (oracle). All structures were processed using constrained local relaxation and evaluated with the default settings of MolProbity and MadraX. We found such relaxation necessary to obtain accurate energy evaluations, even for MD references. Results are summarized in the following table (all values reported as mean ± std; energy values as 95% percentile ranges):
- Comparing ConfRover-interp to ConfRover-fwd, we observe similar structural quality, suggesting that interpolated conformations are physically realistic. Slightly higher energy in interpolated samples may reflect the presence of regions with energy barriers.
- Other comparisons show ConfRover generates conformation with high quality. ConfRover achieves comparable performance to MD references in backbone dihedral (Ramachandran), bond geometry, and overall energy. The largest discrepancy lies in clash scores, which depend on ad hoc hydrogen addition and may bias evaluation of heavy-atom-only models. ConfRover consistently outperforms MDGen across all metrics. | exp | Ramachandran outliers % | Ramachandran favored % | Rotamer outliers | Clashscore | RMS(bonds) | RMS(angles) | MolProbity score | MadraX Energy | |-------------------|--------------------------|--------------------------|-------------------|-------------|-------------|---------------|-------------------|----------------------------------------| | MD Reference | 0.38±0.49 | 97.41±1.49 | 1.02±0.89 | 0.04±0.16 | 0.01±0.00 | 1.88±0.05 | 0.72±0.18 | -519.3 (-1793.0, -53.4) | | MDGen | 0.93±0.86 | 94.98±2.04 | 2.86±1.59 | 16.14±20.05 | 0.02±0.02 | 2.13±0.30 | 2.24±0.40 | -314.7 (-1483.8, 263.6) | | ConfRover-fwd | 0.58±0.63 | 96.93±1.45 | 1.98±1.48 | 7.81±6.52 | 0.01±0.01 | 1.88±0.25 | 1.72±0.38 | -522.2 (-1858.9, -53.4) | | ConfRover-interp | 0.71±0.94 | 96.90±1.95 | 1.86±1.46 | 7.25±8.74 | 0.02±0.01 | 1.91±0.32 | 1.61±0.51 | -469.7 (-1712.3, -42.8) |
Q5: inference time across different protein sizes.
We measured the wall-clock time (minutes) required to generate 100 ns ATLAS trajectories (80 frames) for proteins of varying sizes and report the average inference time per size bucket. For comparison, we also selected a representative protein from each bucket and estimated the time required to simulate 100 ns using OpenMM with implicit solvent. As shown on the table below, ConfRover can sample at least 10x speedup over implicit solvent MD simulations. The acceleration becomes more pronounced as protein sizes increases, suggesting ConfRover might be more advantageous for larger proteins. Notably, the cost of generating additional frames remains nearly constant due to our autoregressive generation with KV-caching, which eliminates redundant computation of attention activations.
| seqlen | (0,150) | [150,300) | [300,450) | [450,600) | [600, 724] |
|---|---|---|---|---|---|
| ConfRover | 6.99 | 7.53 | 10.92 | 15.88 | 20.83 |
| MD | 104.54 | 207.92 | 386.69 | 651.29 | 1099.13 |
| Speedup | 14.95 | 27.61 | 35.41 | 41.01 | 52.77 |
Weakness 1: Heavy architecture.
The detailed computational cost of applying ConfRover to larger proteins is summarized in our response to Q5. We agree that standard architectures in the field, such as Pairformer, are computationally intensive. However, designing efficient neural architectures for protein structure modeling remains a broad challenge in the field. In this work, our focus is on proposing a framework for learning protein dynamics, rather than advancing neural architectural innovations. Therefore, we adopt proven architectures to build ConfRover, which demonstrates competitive performance, scalability to proteins with over 700 amino acids, and reasonable acceleration compared to MD. As the reviewer noted, recent acceleration libraries such as cuEquivariance [4] could be beneficial for further speeding up our model, and we will consider integrating them in future work.
Weakness 2: Lack statistical analysis of dynamics.
Unlike Timewarp or EquiJump, which focus on small systems with well-equilibrated dynamics, our work specifically targets modeling dynamics for large proteins in transferable settings. As noted in MDGen, trajectories in ATLAS are not fully equilibrated, making it difficult to take the Markov state models (MSMs) approach for evaluating long-term dynamics.
Due to this limitation, we additionally designed experiments and metrics to assess whether the model captures dynamic changes across different time scales and initial conditions (multi-start benchmark) and whether major dynamic modes are correctly reflected (as in the TICA analysis in Figure 5). While these metrics are not perfect, we believe they provide statistical insights into the dynamic behavior of the models and are effective in revealing some limitations of current state-of-the-art approaches.
Weakness 3: Significance of a unified framework.
A unified framework for protein conformational dynamics is appealing for several reasons. First, tasks like forward simulation, interpolation, and time-independent sampling share similarities, naturally to address in a single framework. Second, both time-dependent and independent conformations stem from the same underlying distribution, as also noted in prior work [5], supporting joint learning. Third, our framework’s ability to leverage both structures and trajectories enables flexible training and paves the way for foundation models in protein conformations.
While alternative approaches like stochastic interpolation are possible, our early experiments with them were unsuccessful. In contrast, our autoregressive LLM + diffusion model performs robustly across tasks and offers a strong basis for learning conformational dynamics.
Finally, we clarify that ConfRover adds minimal overhead for structure generation: the temporal module is lightweight, and for time-independent tasks, it reduces to a simple MLP. This design enables dynamic modeling with little additional cost over the conformation diffusion model.
[1] Su, Jianlin, et al. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.
[2] Williams CJ, Headd JJ, Moriarty NW, et al. MolProbity: More and better reference data for improved all-atom structure validation. Protein Sci. 2018;27(1):293-315.
[3] Orlando G, Serrano L, Schymkowitz J, Rousseau F. Integrating physics in deep learning algorithms: a force field as a PyTorch module. Bioinformatics. 2024;40(4):btae160.
[4] cuEquivariance: https://docs.nvidia.com/cuda/cuequivariance/
[5] Arts M, Satorras VG, Huang CW, et al. Two for One: Diffusion Models and Force Fields for Coarse-Grained Molecular Dynamics. http://arxiv.org/abs/2302.00600
Dear Reviewer,
Thank you again for your time and valuable feedbacks. We hope our responses have clarified key points and provided further evidence that:
- ConfRover properly encodes temporal information and can generate trajectories on different stride scales.
- ConfRover generates high quality conformations in both forward simulation and interpolation.
- ConfRover offers clear accelerations over MD simulation, particularly for larger proteins.
Please let us know if there is any other information we could provide to further strengthen the work. if you feel that our responses have resolved your original concerns, we would sincerely appreciate if you could consider updating your scores and rating based on the new results.
Best regards,
The authors
(5,5,5,5) This paper introduces ConfRover, a generative model for protein conformations learned from MD data. The model explicitly captures temporal dependencies via an autoregressive framework that unifies time-independent sampling, forward simulation, and conformational interpolation. All reviewers were positive, noting strong performance on the ATLAS benchmark (100 ns MD simulations), where ConfRover outperforms other learning-based approaches and achieves substantial speedups over classical MD, though still falling short of full physical fidelity. The rebuttal added thorough structural/energy evaluations, runtime analysis, and new baselines.