Dear Reviewer ZSKy,

Thank you for this valuable feedback. The manuscript was originally structured as a deep dive into the architectural development for the challenging D2R system. We now agree that showcasing the framework's broader applicability is essential. Accordingly, the revised paper will integrate a new baseline comparison for D2R in the main text and present our comprehensive generalization studies in a dedicated appendix.

Below, we address your specific concerns in detail.

Weaknesses

1.1 Architectural/Loss Details & Readability

You are correct that key details were deferred to the appendix without sufficient signposting. We apologize for this oversight. In the revised manuscript, we will reference relevant appendices upon first mention of key concepts. Specifically:

Section 3.3 (Decoder Architectures) will reference Appendix C for detailed configurations.
Section 3.5 (Loss Functions) will reference Appendix E for formal mathematical definitions.

Additionally, important details will be incorporated into the main text:

The primary mathematical equation for our decoder loss () will be included in Section 3.5.
A compact summary table of key hyperparameters for our models will be added to the Methodology section.

We will also revise paragraph structure throughout the Methodology section to enhance readability and logical flow.

1.2. Baselines and Generalization

This is an essential point. To demonstrate our model's performance in context, we have performed a new benchmark for the D2R system and will now include key results from our broader validation studies.

Benchmarking against State-of-the-Art: We compared LD-FPG against leading models (AlphaFlow, BioEmu, and Boltz-2 with MD conditioning).

Model	Backbone JSD () ↓	Backbone lDDT ↑	Backbone TM-score ↑	Backbone RMSF (nm) ↔
LD-FPG (Ours)	0.007	~0.80	~0.96	1.22
BioEmu	~0.022	0.999	0.925	0.09
AlphaFlow	~0.023	0.859	0.993	0.84
Boltz-2 (MD-cond)	~0.034	0.997	0.975	0.07
Ground Truth MD	(Ref)	(Ref)	(Ref)	1.34

Our benchmark reveals a critical distinction between our approach and existing methods by evaluating both static fidelity (lDDT) and dynamic diversity (RMSF).

Baselines Are Overly Rigid: The baselines produce high-fidelity but static structures. While their lDDT scores are nearly perfect, their extremely low Root-Mean-Square Fluctuation (RMSF) confirms they fail to capture the protein's native flexibility.
LD-FPG Correctly Models Flexibility: In contrast, our model correctly reproduces the dynamic ensemble. Its average RMSF (1.22 nm) almost perfectly matches the ground truth (1.34 nm), demonstrating it captures the correct conformational diversity.
Superior Geometry without Sacrificing Dynamics: Crucially, LD-FPG achieves this realism while also delivering superior local geometry. Its backbone dihedral JSD of 0.007 is 3-5x better than all baselines, proving our model generates a flexible, all-atom ensemble with state-of-the-art accuracy.

Generalization: To address your valid concern about applicability, the appendix will now present results from our broader validation studies. These studies, conducted during our initial model development to confirm the framework's robustness, show that our model consistently generates high-fidelity ensembles across a diverse set of targets, including:

Other Class A GPCRs (D1, A2A, 1-adrenergic).
Diverse Folds from ATLAS, including large systems like cytochrome P450 (~500 residues) and different architectures (-helical, -sheet, mixed ).
Folding Dynamics, demonstrated by modeling the full folding trajectory of the TRP-cage miniprotein.

All-atom lDDT scores across this set are in the range of 0.68–0.78, with total JS divergences between 0.007 and 0.014. Simulation data for the additional GPCRs is now publicly available via our Zenodo link; a comprehensive list of all validated ATLAS structures will be included in the appendix.

Questions

2.1 Loss/Architecture Definitions

As addressed above (1.1), all key concepts will be clearly defined and cross-referenced within the main text.

2.2 Comparison with Existing Ensemble Generation Models

As we detailed in our response above (Point 1.2), we have performed an extensive new benchmarking study against AlphaFlow, BioEmu, and Boltz-2. Regarding MD-Gen: We chose not to include it in our benchmark since it is designed for trajectory generation and is currently tuned for small peptides (≤50 residues) in solvent whereas our work focuses on sampling conformations from the equilibrium ensemble. Applying it to a 273-residue membrane protein like D2R is outside its intended domain and would not produce a fair comparison.

2.3 Latent Space Diffusion

Thank you for this question. It highlights a critical design choice. We perform diffusion in latent space for three key reasons:

Dynamics-Focused Learning: By encoding only the deformations in a pooled latent space (relative to a static reference structure), the diffusion model can focus on learning the protein's internal, functionally relevant motions. The static fold and global orientation are factored out, allowing the model to learn a more targeted and meaningful distribution.
Numerical Tractability and Stability: The dimensionality is drastically reduced from ~6,600 degrees of freedom in Cartesian space for the D2R protein to ~100 dimensions in our pooled latent space. This makes the optimization stable and efficient. In our ablation studies, an equivalent DDPM trained directly in coordinate-space failed to converge entirely.
Implicit Physical Priors: By learning to perturb a physically-correct reference structure, the decoder inherently preserves correct local bond geometry and chirality, improving realism. This is supported by our low side-chain dihedral JSDs.

Our approach is consistent with strategies used in recent generative models. Given the challenges of all-atom diffusion, models like AlphaFold 3 [1] and Boltz-2 [2] use all-atom attention for detailed features, but then pool these representations to the residue level before applying diffusion in a lower-dimensional space—much like our residue-based pooling. Similarly, La-Proteina [3] keeps backbone coordinates explicit and diffuses side-chain information in a separate latent space, which resembles our sequential pooling setup. This confirms that our design addresses a widely recognized challenge in the field using practical, effective methods.

[1] Abramson et al., Nature 2024 (AlphaFold 3)
[2] Wohlwend et al., bioRxiv 2025 (Boltz-2)
[3] Geffner et al., NVIDIA Research 2025 (La-Proteina)

2.4 D2R as a GPCR and Biological Relevance

You are right that we should have made the GPCR context and its biological relevance more explicit. To address this directly, we are strengthening the manuscript with new, targeted analyses and revising the text for clarity.

First, to validate that our model captures functionally critical motions, we performed a new analysis on two canonical GPCR activation metrics: the TM6 outward movement ( distance) and the NPxxY hinge twist ( distance). Our results confirm that LD-FPG accurately reproduces the free-energy profiles of these key transitions.

Furthermore, a key part of our original model validation involved testing its ability to learn from complex, non-equilibrium data. We trained LD-FPG on a metadynamics trajectory (~500ns) that explicitly samples the rare intermediate conformations along the D2R activation pathway. The model successfully learned from this sparse data and can generate realistic conformations that trace the entire transition path. This provides compelling evidence that the framework learns the underlying energy landscape, not just the dense regions of an equilibrium simulation.

Both of these new analyses will be detailed in a new Appendix G. Finally, we will revise the Introduction, Related Work, and figure captions to consistently highlight the GPCR context.

2.4 Clash Counts

Your concern is valid. Our response is two-fold:

The Correct Baseline for Clashes is the MD Ensemble: There is no universal threshold for "acceptable" clashes. Standard validation tools like MolProbity are calibrated for static, minimum-energy crystal structures and are ill-suited for evaluating dynamic MD ensembles, which naturally explore higher-energy states and exhibit transient van der Waals violations. Therefore, the most relevant benchmark is the reference MD simulation itself. As we report in the manuscript (Line 300), the ground truth MD ensemble has a non-zero average clash score of ≈1023. Our best-performing models (e.g., Residue Pooling at 1145 clashes) are remarkably close to this physical reference, indicating a high degree of realism.
An Area for Future Work: Managing steric clashes is a challenge for all generative models. We acknowledge the moderately higher clash count is a trade-off of our current approach. We are actively exploring physics-informed strategies, similar to those in models like Boltz-1, to further refine physical realism in future work. We will clarify this context in the revised manuscript.

2.6 Appendix Reference Typo

The "Appendix ??" typo in Algorithm 1 will be corrected in the final manuscript.