审稿意见

评分: 4置信度: 52025-06-24

This paper introduces JAMUN, a generative machine learning framework to accelerate molecular dynamics. JAMUN performs Langevin dynamics on a smoothed, noised space by leveraging walk-jump sampling. The model is fast and demonstrates transferability to unseen systems.

优缺点分析

Strengths:

Emprical results suggest better generalization in terms of mode coverage with respect to baselines.
Fast all-atom sampling and only need to learn one unconditional score model.

Weakness:

Comparable sampling time and worst performance in terms of Jenson-Shannon distances wrt to most modern (not all atom) baseline (MDGen).
Unclear how to disentangle the influence of the Walk-Jump Sampling because no comparison with respect to standard diffusion models is provided.

问题

The authors claim: This physical prior enables JAMUN to transfer well – just like force fields for molecular dynamics can – to unseen systems. I don’t think this claim is justified. For example, standard diffusion model also run Langevin dynamics over a “latent” space. What is exactly meant here and what in JAMUN should make it generalize better? Is it just that only partially corrupting samples is better than fully corrupting them to a standard gaussian? If so, why is that a physical prior?
The authors claim: generation with JAMUN yields converged sampling of the conformational ensemble faster than MD with a standard force field, even outperforming several state-of-the-art baselines. I think this claim is misleading since the authors do not evaluate convergence to the Boltzmann distribution. They only evaluate convergence of torsional projections. In order to measure real speed-up, post-processing of these samples to energetically meaningful conformations should be accounted for in the total runtime.
Could the authors share more details about how the MSMs in Figure 1 were made? How do authors explain a substantial number of points with very high probabilities in Fig 1 b)?
How do the authors explain that MDGen performs better in terms of Jensen-Shannon differences but worse in terms of mode coverage in sample test molecules?
Have the authors made sure that the MDGen model is sampled with long-enough lag times to be comparable with models of the Boltzmann distribution? Please add sampling details and mark initial structure in the Ramachandran/TICA plots.
Do the authors expect this strategy to scale seamlessly to larger systems? Results for larger systems would significantly strengthen the contribution.

Suggestions:

Add parameter count for all baselines models.
Add energy evaluations of generated structures.

Minor corrections:

Line 94: space after coma.
Line 167: implicit solvent.

局限性

It’s unclear to me how to disentangle the influence of the Walk-Jump Sampling to the one of the model. The advantage of using Walk-Jump Sampling would become clearer if the same architecture you use (with minimal modifications for noise conditioning) was used to train a baseline regular diffusion model.
The manuscript would benefit from a thorough linguistic review to enhance clarity and readability.
In Figure 1 b) and c) correlation coefficients of other baselines should be compared against.
Please mark best (and second) performing models in bold in Table 1.
Authors do not show examples for when JAMUN “hallucinates” metastable states. I suggest showing structure, computing energy and comparing with reference simulation.
I understand the lack of baselines for the extrapolation exercise, but I am not sure how relevant the comparison with Boltz-1 and BioEmu in Figure 5 are since the scope of these models is very different. Can the authors provide a better baseline?

最终评判理由

JAMUN is a cheaper and performant method, however, improvement seems to be limited and entails some performance loss and I beleive some evaluations could be improved. I recommend acceptance.

格式问题

.

作者回复

2025-07-31

Common Response for All Reviewers

We would like to thank all of our reviewers for their thoughtful feedback, especially the encouraging comments on our "simple but powerful core idea" and "comprehensive experimentation" demonstrating "competitive performance" in recovering Boltzmann distributions for unseen proteins.

The main concerns about the paper shared by the reviewers can be broadly placed in the following categories:

A lack of comparison with diffusion models (Reviewers u7xj, E7vk and tyev)
A lack of analysis on potential energy and physical validities of JAMUN samples (Reviewers E7vk and tyev)
An ablation of the jump step (Reviewer E7vk)
A lack of results on larger systems (Reviewers iqm5 and tyev)

We have now performed additional experiments and analyses which we believe considerably improve our work addressing many of the concerns above. In short, we have added a fair comparison to a diffusion model with the exact same architecture, performed a Posebusters analysis on JAMUN samples, identified the effect of the jump step in terms of Jenson-Shannon divergence, and finally, demonstrated that our model can be trained on the fast-folding proteins ( $\approx$ 50 AA) and the shortest ATLAS proteins ( $\approx$ 180 AA), which are an order-of-magnitude larger than the original systems studied in the paper. Some of these experiments are preliminary due to the limited amount of time available to us, but we believe that the reviewers will find our results very encouraging.

Comparison with Diffusion Models

To highlight the speedup of walk-jump sampling over diffusion, we train a diffusion model on the Timewarp 2AA dataset, as recommended by Reviewer tyev. Note that our model already contains noise conditional blocks so this requires no architectural changes. We then sample from this model using both diffusion and walk-jump and compare the Jensen-Shannon divergence (JSD) of the backbone torsions averaged over the Timewarp 2AA test set as a function of number of samples and number of function evaluations (NFE):

Sampler	Samples	NFE	JSD-Backbone Torsions
Walk-Jump	3149	6298	0.150198
Diffusion	3149	399923	0.136336
Walk-Jump	200000	400000	0.049639
Diffusion	200000	25400000	0.046087

Essentially, we find that walk-jump sampling is $64\times$ faster with only minor loss in fidelity. We would like to emphasize that the diffusion results can likely be improved by additional tuning of both training and sampling. For simplicity we decided to start with a reasonable setup that is known to work well in other settings (EDM noise schedule of 64 steps from 0.01A to 10A using the Heun second order method from Karras, et al. 2022). However, even with additional optimization of the diffusion model, walk-jump is likely to remain more efficient because it works in a partially noised spaced instead of having to generate every sample from an uninformative Gaussian prior over many steps.

We would also like to highlight that the TBG model (Klein, et al. 2024) that we compare to in the paper is a flow-matching model with a Gaussian prior distribution, which is functionally equivalent to a diffusion model.

Physical Validity and Energy Analysis of JAMUN Samples

We have performed a physical validity analysis on the generated JAMUN samples on MDGen 4AA, using the popular Posebusters (Buttenschoen, et al. 2024) package. We randomly selected 20 unseen test peptides from the MDGen 4AA-Explicit dataset for this analysis.

Posebusters Metric	Average Pass Rate
bond_lengths	97.0%
bond_angles	99.4%
internal_steric_clash	100.0%
internal_energy	97.5%
overall	94.7%

We see that the bond lengths are correctly captured with high probability by JAMUN (as requested by Reviewer E7vk). Further, the overall quality of the JAMUN samples is high. For a finer-grained view into the performance across the 20 test peptides, we report the empirical CDFs of the pass rates:

Posebusters Metric / Pass Rate	>90%	>92%	>94%	>96%	>98%	100%
bond_lengths	20/20	18/20	17/20	14/20	10/20	5/20
bond_angles	20/20	20/20	20/20	20/20	18/20	14/20
internal_steric_clash	20/20	20/20	20/20	20/20	20/20	20/20
internal_energy	20/20	19/20	19/20	17/20	10/20	5/20
overall	19/20	14/20	13/20	11/20	6/20	2/20

Next, we compute the force field energies for JAMUN samples, and find that they overlap well with the reference MD:

Sequence	MDGen 4AA-Explicit Energy (kJ/mol)	JAMUN Energy (kJ/mol)	Difference (kJ/mol)
FHSE	-675.6 ± 20.1	-633.2 ± 113.3	42.4
FKKL	-699.0 ± 24.0	-562.5 ± 192.1	136.4
FLRH	-1272.7 ± 18.9	-1198.5 ± 129.3	74.2
FSDP	-697.5 ± 23.7	-687.6 ± 91.5	9.9
FSRK	-1333.0 ± 21.9	-1349.9 ± 0.0	-16.9
GCIC	-557.5 ± 21.4	-538.8 ± 56.4	18.6
GGHN	-905.3 ± 21.9	-821.3 ± 129.6	84.0
GLIL	-743.3 ± 20.7	-711.6 ± 71.8	31.7
HELI	-794.2 ± 25.5	-780.6 ± 73.5	13.6
HENV	-1156.9 ± 17.3	-1123.4 ± 148.7	33.5
HTIQ	-762.9 ± 17.0	-726.8 ± 105.8	36.1
IAMI	-428.0 ± 15.3	-426.8 ± 74.8	1.2
IDRH	-1416.8 ± 18.1	-722.8 ± 2608.1	694.1
IHNV	-845.4 ± 21.1	-864.4 ± 48.8	-19.1
IMRY	-1230.7 ± 23.6	-1100.6 ± 207.0	130.1
INVH	-793.6 ± 22.9	-745.3 ± 128.6	48.3
IPGD	-611.5 ± 15.0	-582.5 ± 55.2	29.1

Ablation of the Jump Step

As requested by Reviewer E7vk, we show the JSD of the first two TICA components using noisy Y-samples generated during the walk step before denoising. We compare this with the JSD of the first two TICA components in X-space. As we expect, at a non-trivial noise level of 0.4 Angstrom, the Y-samples are quite diffuse and hence have far higher divergence metrics. In fact, the modes present in X-space are not easy to recover. This analysis highlights the necessity of the jump step to accurately sample.

Sequence	JSD-TICA of De-Noised X	JSD-TICA of Noisy Y
FGGW	0.149416	0.322338
FKKL	0.189984	0.490095
FSDP	0.261191	0.390349
HENV	0.232173	0.406107
HTIQ	0.251693	0.384356
IAMI	0.129338	0.399676
IMRY	0.278553	0.362170

Results on Larger Systems

We recently identified the memory usage of the standard e3nn.FullyConnectedTensorProduct in the E3ConvNet as a big bottleneck when scaling to larger systems. We switched to the SeparableTensorProduct (as in Equiformer) to reduce the number of weights in the tensor product and hence, the memory utilization. We found that the MSE denoising loss of the model increased only slightly after this switch. However, we can now train and perform inference on much larger systems.

We have trained one model on the fast-folding proteins TrpCage (20 AA, 152 atoms) and Protein G (56 AA, 439 atoms) from Majewski, et al. 2023. We have also trained one model on the shortest 1000 proteins (upto 183 AA, 2000 atoms) from the ATLAS dataset (Y Vander Meersche, et al.). These models were trained on 4 RTX A100 GPUs for a day.

Importantly, we do not change any of the training or sampling hyperparameters (including the noise level, which we keep at 0.4 A).

For the 20AA protein Trp-Cage which has high helical content, we start a simulation at a fully extended conformation and recover the secondary structure in a sampling run of wall clock time around 5 minutes. This would take more than a day of simulation time for a classical MD simulation running on the same hardware. For the ATLAS model, we were able to successfully run walk-jump sampling on unseen validation set proteins of slightly larger lengths than the training set proteins, indicating transferability even in this setting.

We compute some simplistic metrics - fluctuations in native contacts and secondary structure content - and find that we recover the qualitative characteristics of the distributions. The PMFs on fraction of native contacts look qualitatively similar for all three proteins, but there is definitely a bias towards folded states in JAMUN. The correlation plot between MD and JAMUN values for total fraction of time that a native contact persists has an $R^2$ of 0.96, with JAMUN contacts consistently remaining for longer times. In summary, more training is definitely required, but we hope that these preliminary experiments show that JAMUN can be scaled up to biologically relevant molecules.

Response to Reviewer tyev

The idea is that walk-jump sampling shares many parallels with MD, but you are correct that we do not currently identify which specific aspects help the transferability of JAMUN. We would be happy to rephrase this in the paper. In a standard diffusion model, the noise level is continuously reducing during sampling which makes it hard to give a physical interpretation.
Our analysis of the force field energy of JAMUN samples should address these concerns; we do not do any post-processing of the JAMUN samples. It is true that there is some bias in the final conformational distributions, which we cannot fully correct for, but our analysis does show that we are able to get quite close.
The MSMs were made with the analysis code from MDGen (Jing, et al.) which we have modified and uploaded as part of our supplementary code. We utilize the same MSM hyperparameters (eg. TICA lag), but it is likely that those hyperparameters are not optimal for the Timewarp datasets.
We ran the exact commands (and checkpoints) used by the MDGen authors for sampling, according to their README. It is possible that the recommended sampling parameters/model is not ideal for this comparison.
We hope that you will find the results on the fast-folding proteins and ATLAS encouraging.

We would be very happy to improve the formatting and the clarity of the paper in the camera-ready version, if accepted.

2025-08-05

I thank the authors for their clarifications and extra results. Some of my concerns have been resolved, but some critical ones still remain. Based on the presented evidence JAMUN is indeed a cheaper and performant method, however, improvement seems to be limited and entails some performance loss. I also remain concerned about some analysis, for example it is unclear to me the comparisson with MDGen is entirely fair because of the lack of curation of inference parameters. The preliminary scaling results are encoraging, but all in all, I will keep my score for now.

2025-08-07

Thank you for the feedback. We would like to highlight here that the lag time in MDGen is fixed during the training of the model, since it diffuses over the entire trajectory at once, and not frame-by-frame. This means that sampling with another lag time would require us to retrain their model from scratch. Furthermore, it is unclear apriori what lag time to choose; especially in light of the fact that the original MDGen authors would have optimized for this hyperparameter already.

Note that we utilize the MDGen authors' pretrained model with their recommended inference parameters on the MDGen authors' simulated datasets for all comparisons. The MDGen inference script (https://github.com/bjing2016/mdgen/blob/master/sim_inference.py#L3) clearly shows that the only real inference parameters are the number of rollouts and the number of frames per rollout.

Please do let us know if there are any other comparisons that would help convince you about the utility of our method; we do believe our current comparisons are quite fair, and we explicitly do call out the fact that MDGen does perform better on several of the metrics we use.

审稿意见

评分: 5置信度: 42025-07-01

This paper addresses issues encountered when sampling the Boltzmann distribution using machine learning models, specifically limited sampling speed and transferability. Instead of performing Langevin dynamics directly in molecular conformation space, the authors propose performing Langevin dynamics in a noisy space, where atom coordinates are perturbed by Gaussian noise with a fixed variance. The authors demonstrate that only by knowing the score of the noisy data distribution, both Langevin Dynamics (“Walk”) and Projection to noise-free structures (“Jump”) can be performed.

优缺点分析

Overall, I find this to be a nice paper where I like the idea of combining “Walks” via Langevin dynamics and projection via “Jumps” using the same score network. The performed experiments are clear and the paper is nicely written and easy to follow. In its current form, I recommend acceptance with a “borderline accept”. However, I am happy to raise my score if the authors add an ablation for the “Jump” step.

Strengths

All-atom representation for point clouds -> not specific to proteins
Training the score network works only from samples of the Boltzmann distribution; physical trajectories or force labels are not needed
Only a single network must be trained for both “Walk” and “Jump” steps, where the model predicts the score of the noisy data distribution with fixed noise variance. The network is queried only twice to generate a new sample (once for “Walk” and once for the “Jump” step), which leads to high sampling speed.
The proposed method seems to be simple to implement, fast to train with modest amounts of parameters and effective in recovering Boltzmann distributions for unseen proteins.

Weaknesses

From figure 2a it seems like JAMUN needs comparatively many sampling steps in comparison to other ML-based Boltzmann samplers, which in part diminishes the speed advantage of JAMUN.
Missing ablations: Especially ablating the “Jump” step (generating TICA plots from trajectories only using “Walk” and no “Jump”) would greatly enhance the quality of the paper to demonstrate that both “Walk” and “Jump” steps are indeed necessary. My intuition here is based on the following: Large structural dynamics analysed via TICA plot might be captured well even when local details like bond-distances or bond-angles are not accurately described.
The paper doesn’t analyze the physical validity of the generated structures; therefore, it doesn’t become clear if a single “Jump” step from a low noise level can already generate highly accurate structures compared to many consecutive sampling steps in standard generative models. An analysis about fine detail fidelity would allow better judging this (see “Questions”).

问题

Did the authors consider comparing in more detail to [1]? It seems like the approaches are closely related, since [1] also performs Langevin Dynamics using conformations with added noise, where they treat the noise level as a hyperparameter. However, [1] are missing the “Jump” step, projecting back to the data manifold. I would expect that the structures generated using JAMUN are of better fidelity and contain less noise in comparison. [2] show in Figure 4 that structures generated by [1] via Langevin Dynamics suffer from unphysical bond lengths due to the added noise. It would be great if the authors of JAMUN could show that their model doesn’t suffer from this problem.
The approach has been demonstrated to work for stationary observables only. It remains unclear if it would be possible to derive dynamic (time-dependent) observables from the generated trajectories as well?
Please consider adding captions to Table 1 for the different datasets
Jenson-Shannon distance could be briefly explained as one of the main metrics in the paper
Can you specify which settings in figure 2a correspond to which rows in table 1? E.g. how many samples did TBG (20x shorter) generate exactly?

References

[1] Arts, Marloes, et al. "Two for one: Diffusion models and force fields for coarse-grained molecular dynamics." Journal of Chemical Theory and Computation 19.18 (2023): 6151-6159.

[2] Plainer, M., Wu, H., Klein, L., Günnemann, S., & Noé, F. (2025). Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models. ArXiv. https://arxiv.org/abs/2506.17139

局限性

yes

最终评判理由

During rebuttal, the authors successfully addressed the concerns raised in my original review. I am thus happy to increase my score and recommend acceptance.

格式问题

no

作者回复

2025-07-31

Common Response for All Reviewers

We would like to thank all of our reviewers for their thoughtful feedback, especially the encouraging comments on our "simple but powerful core idea" and "comprehensive experimentation" demonstrating "competitive performance" in recovering Boltzmann distributions for unseen proteins.

The main concerns about the paper shared by the reviewers can be broadly placed in the following categories:

A lack of comparison with diffusion models (Reviewers u7xj, E7vk and tyev)
A lack of analysis on potential energy and physical validities of JAMUN samples (Reviewers E7vk and tyev)
An ablation of the jump step (Reviewer E7vk)
A lack of results on larger systems (Reviewers iqm5 and tyev)

We have now performed additional experiments and analyses which we believe considerably improve our work addressing many of the concerns above. In short, we have added a fair comparison to a diffusion model with the exact same architecture, performed a Posebusters analysis on JAMUN samples, identified the effect of the jump step in terms of Jenson-Shannon divergence, and finally, demonstrated that our model can be trained on the fast-folding proteins ( $\approx$ 50 AA) and the shortest ATLAS proteins ( $\approx$ 180 AA), which are an order-of-magnitude larger than the original systems studied in the paper. Some of these experiments are preliminary due to the limited amount of time available to us, but we believe that the reviewers will find our results very encouraging.

Comparison with Diffusion Models

To highlight the speedup of walk-jump sampling over diffusion, we train a diffusion model on the Timewarp 2AA dataset, as recommended by Reviewer tyev. Note that our model already contains noise conditional blocks so this requires no architectural changes. We then sample from this model using both diffusion and walk-jump and compare the Jensen-Shannon divergence (JSD) of the backbone torsions averaged over the Timewarp 2AA test set as a function of number of samples and number of function evaluations (NFE):

Sampler	Samples	NFE	JSD-Backbone Torsions
Walk-Jump	3149	6298	0.150198
Diffusion	3149	399923	0.136336
Walk-Jump	200000	400000	0.049639
Diffusion	200000	25400000	0.046087

Essentially, we find that walk-jump sampling is $64\times$ faster with only minor loss in fidelity. We would like to emphasize that the diffusion results can likely be improved by additional tuning of both training and sampling. For simplicity we decided to start with a reasonable setup that is known to work well in other settings (EDM noise schedule of 64 steps from 0.01A to 10A using the Heun second order method from Karras, et al. 2022). However, even with additional optimization of the diffusion model, walk-jump is likely to remain more efficient because it works in a partially noised spaced instead of having to generate every sample from an uninformative Gaussian prior over many steps.

We would also like to highlight that the TBG model (Klein, et al. 2024) that we compare to in the paper is a flow-matching model with a Gaussian prior distribution, which is functionally equivalent to a diffusion model.

Physical Validity and Energy Analysis of JAMUN Samples

We have performed a physical validity analysis on the generated JAMUN samples on MDGen 4AA, using the popular Posebusters (Buttenschoen, et al. 2024) package. We randomly selected 20 unseen test peptides from the MDGen 4AA-Explicit dataset for this analysis.

Posebusters Metric	Average Pass Rate
bond_lengths	97.0%
bond_angles	99.4%
internal_steric_clash	100.0%
internal_energy	97.5%
overall	94.7%

We see that the bond lengths are correctly captured with high probability by JAMUN (as requested by Reviewer E7vk). Further, the overall quality of the JAMUN samples is high. For a finer-grained view into the performance across the 20 test peptides, we report the empirical CDFs of the pass rates:

Posebusters Metric / Pass Rate	>90%	>92%	>94%	>96%	>98%	100%
bond_lengths	20/20	18/20	17/20	14/20	10/20	5/20
bond_angles	20/20	20/20	20/20	20/20	18/20	14/20
internal_steric_clash	20/20	20/20	20/20	20/20	20/20	20/20
internal_energy	20/20	19/20	19/20	17/20	10/20	5/20
overall	19/20	14/20	13/20	11/20	6/20	2/20

Next, we compute the force field energies for JAMUN samples, and find that they overlap well with the reference MD:

Sequence	MDGen 4AA-Explicit Energy (kJ/mol)	JAMUN Energy (kJ/mol)	Difference (kJ/mol)
FHSE	-675.6 ± 20.1	-633.2 ± 113.3	42.4
FKKL	-699.0 ± 24.0	-562.5 ± 192.1	136.4
FLRH	-1272.7 ± 18.9	-1198.5 ± 129.3	74.2
FSDP	-697.5 ± 23.7	-687.6 ± 91.5	9.9
FSRK	-1333.0 ± 21.9	-1349.9 ± 0.0	-16.9
GCIC	-557.5 ± 21.4	-538.8 ± 56.4	18.6
GGHN	-905.3 ± 21.9	-821.3 ± 129.6	84.0
GLIL	-743.3 ± 20.7	-711.6 ± 71.8	31.7
HELI	-794.2 ± 25.5	-780.6 ± 73.5	13.6
HENV	-1156.9 ± 17.3	-1123.4 ± 148.7	33.5
HTIQ	-762.9 ± 17.0	-726.8 ± 105.8	36.1
IAMI	-428.0 ± 15.3	-426.8 ± 74.8	1.2
IDRH	-1416.8 ± 18.1	-722.8 ± 2608.1	694.1
IHNV	-845.4 ± 21.1	-864.4 ± 48.8	-19.1
IMRY	-1230.7 ± 23.6	-1100.6 ± 207.0	130.1
INVH	-793.6 ± 22.9	-745.3 ± 128.6	48.3
IPGD	-611.5 ± 15.0	-582.5 ± 55.2	29.1

Ablation of the Jump Step

As requested by Reviewer E7vk, we show the JSD of the first two TICA components using noisy Y-samples generated during the walk step before denoising. We compare this with the JSD of the first two TICA components in X-space. As we expect, at a non-trivial noise level of 0.4 Angstrom, the Y-samples are quite diffuse and hence have far higher divergence metrics. In fact, the modes present in X-space are not easy to recover. This analysis highlights the necessity of the jump step to accurately sample.

Sequence	JSD-TICA of De-Noised X	JSD-TICA of Noisy Y
FGGW	0.149416	0.322338
FKKL	0.189984	0.490095
FSDP	0.261191	0.390349
HENV	0.232173	0.406107
HTIQ	0.251693	0.384356
IAMI	0.129338	0.399676
IMRY	0.278553	0.362170

Results on Larger Systems

We recently identified the memory usage of the standard e3nn.FullyConnectedTensorProduct in the E3ConvNet as a big bottleneck when scaling to larger systems. We switched to the SeparableTensorProduct (as in Equiformer) to reduce the number of weights in the tensor product and hence, the memory utilization. We found that the MSE denoising loss of the model increased only slightly after this switch. However, we can now train and perform inference on much larger systems.

We have trained one model on the fast-folding proteins TrpCage (20 AA, 152 atoms) and Protein G (56 AA, 439 atoms) from Majewski, et al. 2023. We have also trained one model on the shortest 1000 proteins (upto 183 AA, 2000 atoms) from the ATLAS dataset (Y Vander Meersche, et al.). These models were trained on 4 RTX A100 GPUs for a day.

Importantly, we do not change any of the training or sampling hyperparameters (including the noise level, which we keep at 0.4 A).

For the 20AA protein Trp-Cage which has high helical content, we start a simulation at a fully extended conformation and recover the secondary structure in a sampling run of wall clock time around 5 minutes. This would take more than a day of simulation time for a classical MD simulation running on the same hardware. For the ATLAS model, we were able to successfully run walk-jump sampling on unseen validation set proteins of slightly larger lengths than the training set proteins, indicating transferability even in this setting.

We compute some simplistic metrics - fluctuations in native contacts and secondary structure content - and find that we recover the qualitative characteristics of the distributions. The PMFs on fraction of native contacts look qualitatively similar for all three proteins, but there is definitely a bias towards folded states in JAMUN. The correlation plot between MD and JAMUN values for total fraction of time that a native contact persists has an $R^2$ of 0.96, with JAMUN contacts consistently remaining for longer times. In summary, more training is definitely required, but we hope that these preliminary experiments show that JAMUN can be scaled up to biologically relevant molecules.

Response to Reviewer E7vk

Our detailed Posebusters analysis above shows that JAMUN captures bond lengths correctly; we hope you will find this sufficient, but we are happy to perform further analyses.
In the current framework, no, we indeed lose kinetic information and cannot map the dynamics in $Y$ to those in $X$ .
and 4. We will fix the caption of Table 1 and also detail the JSD metric for clarity in the updated paper.
According to Figure 2 a), TBG generates 5000 structures, so we used 5000/20 = 250 structures to represent TBG (20x shorter). JAMUN generated 100000 structures.

2025-08-04

The authors successfully highlight the advantage of their method compared to diffusion models in terms of the tradeoff between sampling speed and accurate sampling of the Boltzmann distribution. They further demonstrate the physical validity of the generated samples via a range of different tests and conduct an ablation experiment to illustrate the necessity of the jump step. Those changes resolved most of my concerns; I will raise my score accordingly and recommend acceptance.

2025-08-07

We are very happy to hear that our rebuttal addressed most of your concerns. Thank you again for all of the helpful feedback that has helped improve our paper significantly!

审稿意见

评分: 5置信度: 32025-07-03

The paper presents JAMUN, a walk-jump sampler that couples smoothed Langevin dynamics on a latent space with an denoiser. On short-peptide benchmarks the method delivers $10^{1}$ – $10^{2}$ accelerations while preserving ensemble fidelity, as verified by Jensen–Shannon distances and MSM state populations. The core idea is simple but powerful, the empirical study is convincing for small systems. Though several challenges remain such as scaling to larger proteins, accelerated MD methods for ensemble sampling will likely become increasingly important in the near future.

优缺点分析

Strength

The main contributions are both conceptually neat useful. By operating in a single-noise latent space, the authors avoid the heavy diffusion-model yet still retain a learned score that drives Langevin integration. Experiments demonstrate that JAMUN decorrelates conformations more than 10 $\times$ faster than classical MD or recent ML baselines while maintaining high ensemble quality, as seen in low Jensen–Shannon distances to reference trajectories and accurate MSM-derived state populations.

Weakness

The molecular system is small; all quantitative results concern peptides of at most five residues. Because the core claim is “length-agnostic generalisation”, the absence of tests on mini-proteins (tens of residues) or multi-chain complexes leaves open whether the sampler remains stable, efficient, and accurate once the conformational landscape becomes vastly higher-dimensional. Until such scale-up validation is provided, the generality of JAMUN’s speed and quality advantages must be regarded as promising but unproven.

问题

Noise-level transferability. The optimal latent noise $\sigma$ was tuned on four-residue peptides and fixed at 0.4 \AA. Have you investigated whether the same $\sigma$ remains effective when the system size increases by an order of magnitude, say, a 30-residue mini-protein, particularly when secondary structures such as $\alpha$ -helices or $\beta$ -hairpins emerge?
Scale up with collective variables JAMUN is demonstrated in an all-atom coordinate space. Scaling to larger proteins often benefits from reduced representations, for example by propagating only selected collective variables while reconstructing atomistic detail on demand. Can the walk-jump framework be adapted to such collective-variable spaces, perhaps by learning the score in CV coordinates and performing the jump in a coarse-grained latent?

局限性

The authors have adequately addressed the limitations and potential negative societal impact of their work.

最终评判理由

The authors addressed the key points I raised. They conducted the additional experiments, especially application on larger systems, which help to support their claims of scalability. While the results are preliminary, it suggests that JAMUN can generalize beyond the peptide scale without retuning.

格式问题

No formatting issues

作者回复

2025-07-31

Common Response for All Reviewers

We would like to thank all of our reviewers for their thoughtful feedback, especially the encouraging comments on our "simple but powerful core idea" and "comprehensive experimentation" demonstrating "competitive performance" in recovering Boltzmann distributions for unseen proteins.

The main concerns about the paper shared by the reviewers can be broadly placed in the following categories:

A lack of comparison with diffusion models (Reviewers u7xj, E7vk and tyev)
A lack of analysis on potential energy and physical validities of JAMUN samples (Reviewers E7vk and tyev)
An ablation of the jump step (Reviewer E7vk)
A lack of results on larger systems (Reviewers iqm5 and tyev)

We have now performed additional experiments and analyses which we believe considerably improve our work addressing many of the concerns above. In short, we have added a fair comparison to a diffusion model with the exact same architecture, performed a Posebusters analysis on JAMUN samples, identified the effect of the jump step in terms of Jenson-Shannon divergence, and finally, demonstrated that our model can be trained on the fast-folding proteins ( $\approx$ 50 AA) and the shortest ATLAS proteins ( $\approx$ 180 AA), which are an order-of-magnitude larger than the original systems studied in the paper. Some of these experiments are preliminary due to the limited amount of time available to us, but we believe that the reviewers will find our results very encouraging.

Comparison with Diffusion Models

To highlight the speedup of walk-jump sampling over diffusion, we train a diffusion model on the Timewarp 2AA dataset, as recommended by Reviewer tyev. Note that our model already contains noise conditional blocks so this requires no architectural changes. We then sample from this model using both diffusion and walk-jump and compare the Jensen-Shannon divergence (JSD) of the backbone torsions averaged over the Timewarp 2AA test set as a function of number of samples and number of function evaluations (NFE):

Sampler	Samples	NFE	JSD-Backbone Torsions
Walk-Jump	3149	6298	0.150198
Diffusion	3149	399923	0.136336
Walk-Jump	200000	400000	0.049639
Diffusion	200000	25400000	0.046087

Essentially, we find that walk-jump sampling is $64\times$ faster with only minor loss in fidelity. We would like to emphasize that the diffusion results can likely be improved by additional tuning of both training and sampling. For simplicity we decided to start with a reasonable setup that is known to work well in other settings (EDM noise schedule of 64 steps from 0.01A to 10A using the Heun second order method from Karras, et al. 2022). However, even with additional optimization of the diffusion model, walk-jump is likely to remain more efficient because it works in a partially noised spaced instead of having to generate every sample from an uninformative Gaussian prior over many steps.

We would also like to highlight that the TBG model (Klein, et al. 2024) that we compare to in the paper is a flow-matching model with a Gaussian prior distribution, which is functionally equivalent to a diffusion model.

Physical Validity and Energy Analysis of JAMUN Samples

We have performed a physical validity analysis on the generated JAMUN samples on MDGen 4AA, using the popular Posebusters (Buttenschoen, et al. 2024) package. We randomly selected 20 unseen test peptides from the MDGen 4AA-Explicit dataset for this analysis.

Posebusters Metric	Average Pass Rate
bond_lengths	97.0%
bond_angles	99.4%
internal_steric_clash	100.0%
internal_energy	97.5%
overall	94.7%

We see that the bond lengths are correctly captured with high probability by JAMUN (as requested by Reviewer E7vk). Further, the overall quality of the JAMUN samples is high. For a finer-grained view into the performance across the 20 test peptides, we report the empirical CDFs of the pass rates:

Posebusters Metric / Pass Rate	>90%	>92%	>94%	>96%	>98%	100%
bond_lengths	20/20	18/20	17/20	14/20	10/20	5/20
bond_angles	20/20	20/20	20/20	20/20	18/20	14/20
internal_steric_clash	20/20	20/20	20/20	20/20	20/20	20/20
internal_energy	20/20	19/20	19/20	17/20	10/20	5/20
overall	19/20	14/20	13/20	11/20	6/20	2/20

Next, we compute the force field energies for JAMUN samples, and find that they overlap well with the reference MD:

Sequence	MDGen 4AA-Explicit Energy (kJ/mol)	JAMUN Energy (kJ/mol)	Difference (kJ/mol)
FHSE	-675.6 ± 20.1	-633.2 ± 113.3	42.4
FKKL	-699.0 ± 24.0	-562.5 ± 192.1	136.4
FLRH	-1272.7 ± 18.9	-1198.5 ± 129.3	74.2
FSDP	-697.5 ± 23.7	-687.6 ± 91.5	9.9
FSRK	-1333.0 ± 21.9	-1349.9 ± 0.0	-16.9
GCIC	-557.5 ± 21.4	-538.8 ± 56.4	18.6
GGHN	-905.3 ± 21.9	-821.3 ± 129.6	84.0
GLIL	-743.3 ± 20.7	-711.6 ± 71.8	31.7
HELI	-794.2 ± 25.5	-780.6 ± 73.5	13.6
HENV	-1156.9 ± 17.3	-1123.4 ± 148.7	33.5
HTIQ	-762.9 ± 17.0	-726.8 ± 105.8	36.1
IAMI	-428.0 ± 15.3	-426.8 ± 74.8	1.2
IDRH	-1416.8 ± 18.1	-722.8 ± 2608.1	694.1
IHNV	-845.4 ± 21.1	-864.4 ± 48.8	-19.1
IMRY	-1230.7 ± 23.6	-1100.6 ± 207.0	130.1
INVH	-793.6 ± 22.9	-745.3 ± 128.6	48.3
IPGD	-611.5 ± 15.0	-582.5 ± 55.2	29.1

Ablation of the Jump Step

As requested by Reviewer E7vk, we show the JSD of the first two TICA components using noisy Y-samples generated during the walk step before denoising. We compare this with the JSD of the first two TICA components in X-space. As we expect, at a non-trivial noise level of 0.4 Angstrom, the Y-samples are quite diffuse and hence have far higher divergence metrics. In fact, the modes present in X-space are not easy to recover. This analysis highlights the necessity of the jump step to accurately sample.

Sequence	JSD-TICA of De-Noised X	JSD-TICA of Noisy Y
FGGW	0.149416	0.322338
FKKL	0.189984	0.490095
FSDP	0.261191	0.390349
HENV	0.232173	0.406107
HTIQ	0.251693	0.384356
IAMI	0.129338	0.399676
IMRY	0.278553	0.362170

Results on Larger Systems

We recently identified the memory usage of the standard e3nn.FullyConnectedTensorProduct in the E3ConvNet as a big bottleneck when scaling to larger systems. We switched to the SeparableTensorProduct (as in Equiformer) to reduce the number of weights in the tensor product and hence, the memory utilization. We found that the MSE denoising loss of the model increased only slightly after this switch. However, we can now train and perform inference on much larger systems.

We have trained one model on the fast-folding proteins TrpCage (20 AA, 152 atoms) and Protein G (56 AA, 439 atoms) from Majewski, et al. 2023. We have also trained one model on the shortest 1000 proteins (upto 183 AA, 2000 atoms) from the ATLAS dataset (Y Vander Meersche, et al.). These models were trained on 4 RTX A100 GPUs for a day.

Importantly, we do not change any of the training or sampling hyperparameters (including the noise level, which we keep at 0.4 A).

For the 20AA protein Trp-Cage which has high helical content, we start a simulation at a fully extended conformation and recover the secondary structure in a sampling run of wall clock time around 5 minutes. This would take more than a day of simulation time for a classical MD simulation running on the same hardware. For the ATLAS model, we were able to successfully run walk-jump sampling on unseen validation set proteins of slightly larger lengths than the training set proteins, indicating transferability even in this setting.

We compute some simplistic metrics - fluctuations in native contacts and secondary structure content - and find that we recover the qualitative characteristics of the distributions. The PMFs on fraction of native contacts look qualitatively similar for all three proteins, but there is definitely a bias towards folded states in JAMUN. The correlation plot between MD and JAMUN values for total fraction of time that a native contact persists has an $R^2$ of 0.96, with JAMUN contacts consistently remaining for longer times. In summary, more training is definitely required, but we hope that these preliminary experiments show that JAMUN can be scaled up to biologically relevant molecules.

Response to Reviewer iqm5

We believe our preliminary experiments on the fast-folding proteins and ATLAS where we keep all the training and sampling hyperparameters fixed as before, should answer your question. It does seem like these hyperparameters are robust.
Yes, indeed the framework is general enough to support the sampling of collective variables (similar to how diffusion models can be learned for those data). Here, we wanted to utilize the connection between walk-jump sampling and MD which does operate in an all-atom space, to show that it might be possible to build transferable yet efficient samplers for molecular conformations.

2025-08-06

Thank you for addressing the key points I raised. I appreciate the additional experiments, especially those on larger systems, which help to support your claims of scalability. While the results are preliminary, they are very encouraging and suggest that JAMUN can generalize beyond the peptide scale without retuning. I would recommend including at least a subset of these findings—perhaps visualized in the style of Figure 2a—in the camera-ready version, as they would strengthen the paper and clarify the model's applicability to more realistic biomolecular systems. It was a pleasure to review such a well-executed and thoughtfully extended work, and I especially enjoyed seeing the promising results on larger biomolecular systems.

2025-08-07

Thank you for the very kind comments, and for the helpful feedback that has helped us improve our paper significantly! We will be sure to add our preliminary results on larger biomolecules in our camera-ready version, if accepted.

审稿意见

评分: 4置信度: 22025-07-04

The authors present JAMUN, a walk-jump sampling model for generating ensembles of molecular conformations, outperforming the state-of-the-art TBG model, and competitive with the performance of MDGen with no protein-specific parametrization. This is an application paper which applys Walk-Jump sampling in MD data.

优缺点分析

Strength

The idea of connecting MD and Score-based Learning ( more specifically, Langevin Dynamics) is interesting
The experimental results presented in the paper are comprehensive and demonstrate competitive performance.

Weakness

The overall presentation of the paper would benefit from improvement. Adding an algorithm flowchart or illustrative diagram summarizing the proposed approach would help readers better grasp the core methodology and overall workflow.
The motivation for using Walk-Jump Sampling in this paper needs to be further clarified. It is recommended to supplement the introduction with a discussion of the existing limitations of current Diffusion and Flow Matching methods to better justify the proposed approach.

问题

How does the Walk-Jump process introduce physical priors? Both noise injection and Langevin dynamics are essentially probabilistic sampling techniques, which still differ significantly from real physical processes.
Can JAMUN benefit from MD trajectory data, or is it solely designed for sampling from equilibrium states?
If so, how does MD connects with Langevin Dynamics in JAMUN in physical sense?
Why single noise-level is used? Following the score function will lead to decreasing noise level, how to make sure the score estimation is still correct in the decreasing noise level?

局限性

Yes, the authors have discussed the limitations.

最终评判理由

The Walk and Jump process was initially proposed in dWJS (Discrete Walk-Jump Sampling in Protein Generative Modeling), and this paper borrows this process for molecular dynamics. The application is appropriate although with limited innovations in generative modeling. I think with very valuable MD process data, the authors could try to learn the true kinetic process to help obtain physics-favored energetics distributions. Meanwhile, many bridge processes have been adopted to simulate the MD process and obtain the energetics distributions. The motivation and advantages of WJS over diffusion bridges should be made more clear. For summary, I appreciate the detailed explanations from the authors during rebuttal and this paper is a nice application paper of WJS to MD. It is important to try out different generative models for different tasks and the authors did a good job.

格式问题

N/A

作者回复

2025-07-31

Common Response for All Reviewers

We would like to thank all of our reviewers for their thoughtful feedback, especially the encouraging comments on our "simple but powerful core idea" and "comprehensive experimentation" demonstrating "competitive performance" in recovering Boltzmann distributions for unseen proteins.

The main concerns about the paper shared by the reviewers can be broadly placed in the following categories:

A lack of comparison with diffusion models (Reviewers u7xj, E7vk and tyev)
A lack of analysis on potential energy and physical validities of JAMUN samples (Reviewers E7vk and tyev)
An ablation of the jump step (Reviewer E7vk)
A lack of results on larger systems (Reviewers iqm5 and tyev)

We have now performed additional experiments and analyses which we believe considerably improve our work addressing many of the concerns above. In short, we have added a fair comparison to a diffusion model with the exact same architecture, performed a Posebusters analysis on JAMUN samples, identified the effect of the jump step in terms of Jenson-Shannon divergence, and finally, demonstrated that our model can be trained on the fast-folding proteins ( $\approx$ 50 AA) and the shortest ATLAS proteins ( $\approx$ 180 AA), which are an order-of-magnitude larger than the original systems studied in the paper. Some of these experiments are preliminary due to the limited amount of time available to us, but we believe that the reviewers will find our results very encouraging.

Comparison with Diffusion Models

To highlight the speedup of walk-jump sampling over diffusion, we train a diffusion model on the Timewarp 2AA dataset, as recommended by Reviewer tyev. Note that our model already contains noise conditional blocks so this requires no architectural changes. We then sample from this model using both diffusion and walk-jump and compare the Jensen-Shannon divergence (JSD) of the backbone torsions averaged over the Timewarp 2AA test set as a function of number of samples and number of function evaluations (NFE):

Sampler	Samples	NFE	JSD-Backbone Torsions
Walk-Jump	3149	6298	0.150198
Diffusion	3149	399923	0.136336
Walk-Jump	200000	400000	0.049639
Diffusion	200000	25400000	0.046087

Essentially, we find that walk-jump sampling is $64\times$ faster with only minor loss in fidelity. We would like to emphasize that the diffusion results can likely be improved by additional tuning of both training and sampling. For simplicity we decided to start with a reasonable setup that is known to work well in other settings (EDM noise schedule of 64 steps from 0.01A to 10A using the Heun second order method from Karras, et al. 2022). However, even with additional optimization of the diffusion model, walk-jump is likely to remain more efficient because it works in a partially noised spaced instead of having to generate every sample from an uninformative Gaussian prior over many steps.

We would also like to highlight that the TBG model (Klein, et al. 2024) that we compare to in the paper is a flow-matching model with a Gaussian prior distribution, which is functionally equivalent to a diffusion model.

Physical Validity and Energy Analysis of JAMUN Samples

We have performed a physical validity analysis on the generated JAMUN samples on MDGen 4AA, using the popular Posebusters (Buttenschoen, et al. 2024) package. We randomly selected 20 unseen test peptides from the MDGen 4AA-Explicit dataset for this analysis.

Posebusters Metric	Average Pass Rate
bond_lengths	97.0%
bond_angles	99.4%
internal_steric_clash	100.0%
internal_energy	97.5%
overall	94.7%

We see that the bond lengths are correctly captured with high probability by JAMUN (as requested by Reviewer E7vk). Further, the overall quality of the JAMUN samples is high. For a finer-grained view into the performance across the 20 test peptides, we report the empirical CDFs of the pass rates:

Posebusters Metric / Pass Rate	>90%	>92%	>94%	>96%	>98%	100%
bond_lengths	20/20	18/20	17/20	14/20	10/20	5/20
bond_angles	20/20	20/20	20/20	20/20	18/20	14/20
internal_steric_clash	20/20	20/20	20/20	20/20	20/20	20/20
internal_energy	20/20	19/20	19/20	17/20	10/20	5/20
overall	19/20	14/20	13/20	11/20	6/20	2/20

Next, we compute the force field energies for JAMUN samples, and find that they overlap well with the reference MD:

Sequence	MDGen 4AA-Explicit Energy (kJ/mol)	JAMUN Energy (kJ/mol)	Difference (kJ/mol)
FHSE	-675.6 ± 20.1	-633.2 ± 113.3	42.4
FKKL	-699.0 ± 24.0	-562.5 ± 192.1	136.4
FLRH	-1272.7 ± 18.9	-1198.5 ± 129.3	74.2
FSDP	-697.5 ± 23.7	-687.6 ± 91.5	9.9
FSRK	-1333.0 ± 21.9	-1349.9 ± 0.0	-16.9
GCIC	-557.5 ± 21.4	-538.8 ± 56.4	18.6
GGHN	-905.3 ± 21.9	-821.3 ± 129.6	84.0
GLIL	-743.3 ± 20.7	-711.6 ± 71.8	31.7
HELI	-794.2 ± 25.5	-780.6 ± 73.5	13.6
HENV	-1156.9 ± 17.3	-1123.4 ± 148.7	33.5
HTIQ	-762.9 ± 17.0	-726.8 ± 105.8	36.1
IAMI	-428.0 ± 15.3	-426.8 ± 74.8	1.2
IDRH	-1416.8 ± 18.1	-722.8 ± 2608.1	694.1
IHNV	-845.4 ± 21.1	-864.4 ± 48.8	-19.1
IMRY	-1230.7 ± 23.6	-1100.6 ± 207.0	130.1
INVH	-793.6 ± 22.9	-745.3 ± 128.6	48.3
IPGD	-611.5 ± 15.0	-582.5 ± 55.2	29.1

Ablation of the Jump Step

As requested by Reviewer E7vk, we show the JSD of the first two TICA components using noisy Y-samples generated during the walk step before denoising. We compare this with the JSD of the first two TICA components in X-space. As we expect, at a non-trivial noise level of 0.4 Angstrom, the Y-samples are quite diffuse and hence have far higher divergence metrics. In fact, the modes present in X-space are not easy to recover. This analysis highlights the necessity of the jump step to accurately sample.

Sequence	JSD-TICA of De-Noised X	JSD-TICA of Noisy Y
FGGW	0.149416	0.322338
FKKL	0.189984	0.490095
FSDP	0.261191	0.390349
HENV	0.232173	0.406107
HTIQ	0.251693	0.384356
IAMI	0.129338	0.399676
IMRY	0.278553	0.362170

Results on Larger Systems

We recently identified the memory usage of the standard e3nn.FullyConnectedTensorProduct in the E3ConvNet as a big bottleneck when scaling to larger systems. We switched to the SeparableTensorProduct (as in Equiformer) to reduce the number of weights in the tensor product and hence, the memory utilization. We found that the MSE denoising loss of the model increased only slightly after this switch. However, we can now train and perform inference on much larger systems.

We have trained one model on the fast-folding proteins TrpCage (20 AA, 152 atoms) and Protein G (56 AA, 439 atoms) from Majewski, et al. 2023. We have also trained one model on the shortest 1000 proteins (upto 183 AA, 2000 atoms) from the ATLAS dataset (Y Vander Meersche, et al.). These models were trained on 4 RTX A100 GPUs for a day.

Importantly, we do not change any of the training or sampling hyperparameters (including the noise level, which we keep at 0.4 A).

For the 20AA protein Trp-Cage which has high helical content, we start a simulation at a fully extended conformation and recover the secondary structure in a sampling run of wall clock time around 5 minutes. This would take more than a day of simulation time for a classical MD simulation running on the same hardware. For the ATLAS model, we were able to successfully run walk-jump sampling on unseen validation set proteins of slightly larger lengths than the training set proteins, indicating transferability even in this setting.

We compute some simplistic metrics - fluctuations in native contacts and secondary structure content - and find that we recover the qualitative characteristics of the distributions. The PMFs on fraction of native contacts look qualitatively similar for all three proteins, but there is definitely a bias towards folded states in JAMUN. The correlation plot between MD and JAMUN values for total fraction of time that a native contact persists has an $R^2$ of 0.96, with JAMUN contacts consistently remaining for longer times. In summary, more training is definitely required, but we hope that these preliminary experiments show that JAMUN can be scaled up to biologically relevant molecules.

Response to Reviewer u7xj

and 3. The Walk-Jump process corresponds to performing Langevin dynamics in the smoothed $Y$ space, just like how classical MD performs Langevin dynamics in clean $X$ space. Langevin dynamics is an approximation for the friction and thermal fluctuations induced by solvent particles: https://en.wikipedia.org/wiki/Langevin_dynamics#Overview.
We usually train on the entire MD trajectory data. Our results on cyclic peptides in Appendix J are on post-processed equilibrium states.
Note that we perform Langevin dynamics to sample $Y$ at a fixed noise level. This is different from the standard diffusion SDE where the noise level is continuously reduced during sampling.

We would be happy to update Figure 1) for clarity to show the sampling scheme. We would also add our analysis that shows that diffusion models spend many sampling steps due to sampling from an uninformative Gaussian prior, whereas walk-jump sampling operates from a smoothed, partially noised distribution which avoids the need to sample many steps.

2025-08-06

We hope that the rebuttal addressed many of your initial concerns. Please do let us know if there is any other information we can provide to help clarify our contributions.

评论- Thanks for your reply.

2025-08-07

Thanks for the gentle reminder from 6mMK and sorry for the late reply. After reading the rebuttal, other reviews and the paper again, some of my concerns have been addressed. I want to ask the follow-up questions:

Shared concerned with tyev. The diffusion process defined by SDE equations itself could correspond to a langevin dynamics process, thus initially I am confused the difference between Walk process and SDE integrations. From my perspective, it is similar to perform several denoising steps ( or the Walk Process) and then de-noise the samples to the last (or the Jump process), which is common in practice for few-step generation (e.g. consistency model) and speed-up. I know the main difference is that JAMUN only perform denoising in single noise level, but the score function could also be interpreted as denoising direction, so could we regard the Walk process as also denoising process?
The usage of the MD trajectory. I am confused that how the MD trajectory is used in training. Standard Practice is treat the MD trajectory as Video and perform video generation but I think JAMUN is not the case. From my understanding, JAMUN may perform de-noise on the independent frame of a MD trajectory and treat this 'de-noise' as some force fields similar to classic MD. When sampling, JAMUN iteratively perform this force fields on noisy input and de-noise back to obtain the final state. With this design, the temporal information in MD trajectory, the true force-field is ignored and the intermediate peptides may serve as some data augmentation for the final state. And as I mentioned before, the Walk process in noisy manifold may just be a probabilistic simulation to the MD and may not reflect the true MD process itself.

I appreciate the efforts from the authors in conducting extra experiments and addressing my concerns. My comments are mainly based on the perspective of generative models and I am okay with acceptance of this paper. Sorry again for the late rely.

2025-08-08

Thank you for your response!

1 . Walk process vs. SDE denoising It is true that the jump step is related to diffusion denoising in that they use the same score function concept (although the optimal Bayes projection is different from the Euler-Maruyama step). However, the walk step is very different: it is generated by an SDE that samples the stationary distribution of the Gaussian-convolved Hamiltonian, not denoising at all. A diffusion trajectory is generated by an entirely different reverse diffusion SDE which transports samples from the prior density to the data density by effectively decreasing the noise level each step. The walk process traverses a smoothed space at a fixed noise level and helps obtain diversity, unlike SDE denoising steps which are refining to lower noise levels. Also, unlike multiple denoising steps that all lead to a single denoised sample, each walk step is used to denoise with a single jump and obtain a clean sample.

In summary: the walk step samples a stationary noised distribution for diversity, while the diffusion steps are a non-equilibrium trajectory towards a single clean sample.

2 . How MD trajectories are used Apart from MDgen, most methods attempting to emulate Boltzmann distributions do not treat this problem as video-generation. While some methods do learn "integrators" (for example Timewarp), even those train on pairs of frames and not trajectories. The vast majority of SoTA methods though (e.g Bio-EMU, TBG, AlphaFlow, etc.) are similar to JAMUN in that each frame is taken separately, and the goal is not to capture kinetics but to capture the right equilibrium distribution. For most (though not all) biological and drug discovery purposes, the capturing the energetics is more important than the kinetics. We absolutely agree that the walk step does not reflect true kinetics in the noised space, and in fact would argue this is part of the benefit of using our method-- we do not get stuck in "kinetic traps", but still produce the right distribution. While retaining kinetic information during training and/or learning a map to the correct kinetics is a small part of our current research directions, it is not the goal for this project, and is not standard practice in the field either.

评论- Thank you for your clarification

2025-08-09

Sorry for the late reply. I appreciate the detailed explanation from the authors and my confusions have been resolved. I will raise my score accordingly. Good Luck!

最终决定Accept (poster)

2025-09-17

(4,5,5,4) This paper introduces JAMUN, a generative framework that accelerates molecular dynamics by performing Langevin dynamics in a smoothed, noised space using walk-jump sampling. Reviewers found the method simple, efficient, and transferable to unseen systems, with experiments showing substantial speedups over MD and competitive performance with ML baselines. While some concerns about baselines and validation remain, the overall assessment was positive, and I recommend acceptance.

JAMUN: Bridging Smoothed Molecular Dynamics and Score-Based Learning for Conformational Ensemble Generation

摘要

评审与讨论

优缺点分析

问题

局限性

最终评判理由

格式问题

Common Response for All Reviewers

Comparison with Diffusion Models

Physical Validity and Energy Analysis of JAMUN Samples

Ablation of the Jump Step

Results on Larger Systems

Response to Reviewer tyev

优缺点分析

问题

局限性

最终评判理由

格式问题

Common Response for All Reviewers

Comparison with Diffusion Models

Physical Validity and Energy Analysis of JAMUN Samples

Ablation of the Jump Step

Results on Larger Systems

Response to Reviewer E7vk

优缺点分析

Strength

Weakness

问题

局限性

最终评判理由

格式问题

Common Response for All Reviewers

Comparison with Diffusion Models

Physical Validity and Energy Analysis of JAMUN Samples

Ablation of the Jump Step

Results on Larger Systems

Response to Reviewer iqm5

优缺点分析

Strength

Weakness

问题

局限性

最终评判理由

格式问题

Common Response for All Reviewers

Comparison with Diffusion Models

Physical Validity and Energy Analysis of JAMUN Samples

Ablation of the Jump Step

Results on Larger Systems

Response to Reviewer u7xj