PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
4
3
3
4
ICML 2025

Scalable Equilibrium Sampling with Sequential Boltzmann Generators

OpenReviewPDF
提交: 2025-01-22更新: 2025-08-16
TL;DR

We scale up Boltzmann Generators using large all-atom transformers and apply sequential Monte Carlo to improve sampling.

摘要

关键词
Normalizing FlowsBoltzmann GeneratorsAnnealed Importance Sampling

评审与讨论

审稿意见
4

This paper proposes Sequential Boltzmann Generators, consisting of two conceptual ingredients: first, that invertible normalizing flows operating on Cartesian coordinates can scale to molecules as large as hexapeptides by leveraging non-equivariant transformers and the recent TarFlow framework; second, that annealing from the flow likelihood to the target with AIS and SMC can dramatically improve sampling efficiently. The authors show across-the-board improvements over CNFs on ALDP and scale these comparisons up to chignolin.

给作者的问题

I have no important questions for the authors.

论据与证据

The main claim of the work is that the proposed collection of strategies enables scaling of Boltzmann generators to systems of unprecedented size without the use of coordinate transformations. This claim is well supported by experiments — this is the first demonstration of out-of-the-box Boltzmann generators on hexapeptides or molecules of similar complexity. However, there are significant caveats to some details, discussed below.

方法与评估标准

Yes, the choice of model system and evaluation criteria is reasonable and effective at demonstrating the paper’s key claims.

理论论述

I did not carefully check the proofs of theoretical claims.

实验设计与分析

There are deficiencies with some of the experimental designs or analyses

  • [Severe] For the proposed method, the authors report ESS after SMC resampling, which is a completely meaningless metric as by definition the ESS can be maintained to be arbitrarily high via resampling. The authors must address this point (ideally via the next suggestion); otherwise I will change the recommendation to Reject.

  • [Moderate] The authors should separately report the performance of SBG with and without resampling (i.e., only with AIS) to allow a more direct comparison of I.I.D. proposal quality and ESS.

  • [Moderate] The authors should clarify (or even better, report both) the Wasserstein distances of the proposal w/o reweighting vs the samples with reweighting (which should presumably approach 0 without finite sample effects).

  • [Minor] It would be great if the authors reported more fine-grained Wasserstein metrics, for example in TICA space for the larger molecules.

  • [Minor] It would be great if ESS was also reported for Chignolin.

  • [Minor] The proposal appears to still contain very high-energy structures. It could be interesting to analyze the types of errors exhibited in these structures.

补充材料

Yes, I reviewed parts of the additional discussion in the supplementary material.

与现有文献的关系

The paper contributes to the literature on training Boltzmann generators with access to data, where the main technical challenge is in the parameterization of the learned distribution in a way that permits exact likelihoods. The choice adopted by this paper is quite novel compared to previous works, which have generally used continuous flows or flows over internal coordinates. In particular, this work opens up the long-sought possibility of scalable, transferable, exact likelihood flows. The contribution made by this paper should significantly change the course of future work in this area.

遗漏的重要参考文献

All essential works were discussed.

其他优缺点

Since the model permits fast likelihood evaluation, there is a missed opportunity to explore a mix of data-based and energy-based training as done in the original Boltzmann generator paper.

Also, with the freedom from internal coordinates, there is also a missed opportunity to explore the training of transferable models.

其他意见或建议

The definition of T-W2 distance is not clear. Could the authors clarify? Is it a multidimensional W2 distance or an average of one-dimensional W2 distances?

The distinction between 100k vs 10k samples is not clear — could the authors clarify when this downsampling happens?

I would also note that the discussion in Appendix A.1 and Appendix C seems unnecessary and might be confusing. Most readers will understand that AIS cannot be accomplished without fast likelihoods, and it would be strange to propose an additional control so as to exactly cancel out the annealing terms. The Ito filtering paper is also very new and this paper should not feel obliged to guard against misunderstandings of their key result.

Also, the writing of Proposition 1 seems to be obfuscating the fact that the additional term is just a Gaussian on the CoM and the partition function a power of 2π2\pi. Actually, could the authors clarify where the logc2/σ3\log ||c^2||/\sigma^3 comes from in the log density of the χ(3)\chi(3) distribution?

作者回复

# Rebuttal Reviewer S3Vr

We would like to thank the reviewer for their time, feedback, and positive appraisal of our work. We are heartened to hear that the reviewer feels that the “contribution made by this paper should significantly change the course of future work in this area.” We also thank the reviewer for acknowledging that the main claims of our work are “well supported by experiments”. We now address the questions and suggestions raised by the reviewer in order, and note that additional results are included in this link: https://anonymous.4open.science/api/repo/sbg/zip

ESS after SMC and SBG without resampling

The reviewer makes an astute observation that using SMC with adaptive resampling, ESS can be maintained to an artificially high value. We agree that such a metric in this case is not meaningful, and we had opted to include this to be in line with the broader literature, e.g. in pure sampling NETS includes ESS after SMC. We will update the paper to remove ESS after SMC and, as the reviewer suggested, include SBG without resampling (SBG-AIS) which is now included as another row in our rebuttal document Tables 1 and 2. We find that SBG with AIS outperforms the previous SOTA which is ECNF and our introduced ECNF++ for AL3-AL6 across sample-based metrics TW2\mathcal{T}-\mathcal{W}_2 and EW1\mathcal{E}-\mathcal{W}_1, and is slightly worse in terms of ESS to ECNF++.

We thank the reviewer for allowing us to strengthen and clarify our empirical results, and we hope this new SBG-AIS variant alleviates this particular concern.

Wasserstein distances of proposal w/o reweighting

We thank the reviewer for this insightful idea. We have included Table 3 in the rebuttal link to quantify the performance of our proposal without reweighting, with importance sampling i.e., just BG, and with SBG. We observe a drastic improvement in EW1\mathcal{E}-\mathcal{W}_1, for IS over the proposal, and an even greater improvement with SMC. We will include these results in our updated draft.

Wasserstein in TICA space

Thank you for the suggestion! We have now included in the rebuttal experiments (Tables 1, 2 and 3) TICA Wasserstein metrics for all AL3+. We find SBG outperforms all baselines (ECNF/ENCF++) in this metric. We will include these results in our updated draft.

High energy structures

That is a very interesting question. Upon further investigation we found that the highest energy AL6 samples result from steric clashes, as visualized in Fig 4. In Fig 5 we show the histogram and log histogram of shortest non-bonded interatomic distance between MD (ground truth) and SBG samples. An additional source of high energy samples is covalent bonds of insufficient length (Fig 6). We will include these results in our new appendix.

Mixing energy-based training

We thank the reviewer for this great suggestion. We have experimented with mixing energy-based training with normal data-based training and find preliminary evidence that this leads to better performance on TW2T-\mathcal{W}_2 and the newly introduced TICA metric but marginally worse EW1\mathcal{E}-\mathcal{W}_1 vs normal SBG (Tables 1-2). Given these promising initial results we will run a full set of experiments using this method and report these in the updated paper.

Transferable BG

We appreciate the reviewers' comments that operating in Cartesian coordinates more easily leads to training transferable models. In this work, our primary focus was on proving the scalability of SBG as it relied on normalizing flows and inference scaling through SMC. Consequently, we believe that extending this to transferable systems is a natural direction of future work, but out of the scope of the current paper.

Other questions

The definition of T-W2 distance is not clear.

This is a multi-dimensional Wasserstein distance on torsion angles, accounting for the torus geometry of the angle space.

100k vs 10k samples

The subsampling is done after the final SMC step at the end of inference.

Proposition 1

To give further empirical credibility to CoM adjustment, we perform additional ablations in rebuttal Fig 2 and 3. We first find that the C||C|| of the proposal samples does indeed follow an approximate Chi distribution, as expected given the training data augmentations. We also find that CoM adjustment is both important for stable IS reweighting (with a large but finite number of samples) and that sample metrics improve when the adjustment is employed in SMC.

We will update the proposition statement and proof to better convey our theoretical result.

Closing comments

We would like to thank the reviewer for their time and effort. We hope all our answers here allow the reviewer to continue to positively endorse our paper, and we would love to have the opportunity to clarify any lingering questions should the reviewer have them.

审稿人评论

I appreciate the substantive additional results provided by the authors. I think these new results raise several subtle and interesting points. These should not be construed as changes in my evaluation of the paper but rather suggestions for a deeper analysis and discussion in the camera ready.

It is disappointing to see that there is no widespread evidence that reweighting of any kind is able to improve the NF proposal, and further that SMC does not consistently improve upon AIS in terms of Wasserstein metrics.

I will also complain about the organization of the new results. It would be extremely informative to have a table like the following:

ESSE-W1T-W2TICA-W2
ECNF++ proposalN/A
ECNF++ reweighted
SBG proposalN/A
SBG reweighted
SBG AIS
SBG SMCN/A

I will maintain, however, my high rating on the paper on the basis of the fact that this is the first work of any kind to offer a glimmer of hope that high-quality NFs with fast, tractable likelihoods can be developed. I believe that strategies such as AIS or SMC will be essential to future work in Boltzmann generators and this paper provides important signal that NF architectures exist to realize such strategies.

作者评论

We thank the reviewer for their further consideration of our work in light of our rebuttal and extended results. We agree with the reviewers' welcome suggestion and intend to perform a deeper analysis and discussion of our rebuttal results in the camera-ready version.

We acknowledge the reviewer's comment that “there is no widespread evidence that reweighting of any kind is able to improve the NF proposal”, but we would like to very politely push back. Specifically, in our rebuttal experiments, we found a reduction in our energy metrics after reweighting as empirically observed in Table 3 of our rebuttal PDF. We do, however, agree with the reviewer that reweighting does not appear to benefit macrostructure in our dihedral angles and TICA metrics. We hypothesize that there is potentially a trade-off between energy and macrostructure. Furthermore, we do not believe this to be unique to our proposed SBG and will add a similar comparison (proposal vs IS) for the ECNF++ in updated versions of the paper.

We thank the reviewer for their suggestion to improve the presentation of our results, and we will adopt the reviewer's recommendation in modifying the presentation of the final table results in our camera-ready version.

We are glad the reviewer shares our excitement for this work, and thank them greatly for their supportive comments that allowed us to strengthen the empirical caliber of our work.

审稿意见
3

This paper improves data-driven learning-based Boltzmann Generators (BG) with Sequential Monte Carlo (SMC), based specifically on a non-equilibrium transport method (NETS) recently proposed by (Albergo & Vanden-Eijnden (2024)). Unlike NETS whose source energy is based on a pre-defined prior, here the source energy is learned from data with a normalizing flow. Empirical results demonstrate the effectiveness of the proposed method.

给作者的问题

  • It seems redundant to introduce CNF in the main paper, given that NF is used in practice to model log p_\theta. What's the reason for mentioning CNF in Sec 2?

论据与证据

Y

方法与评估标准

To the best of my understanding, the proposed method represents a specific instantiation of NETS, wherein a data-driven source energy is employed. Given that NETS is theoretically applicable to any choice of source energy, the proposed approach seems incremental by incorporating a pre-trained normalizing flow (NF) as the source energy.

理论论述

Y

实验设计与分析

  • What's the experiment setup for learning p_\theta? How many samples are required in D\mathcal{D}? Since the proposed method is a two-stage approach. Training detail should be clarified in Sec 4.

补充材料

Y

与现有文献的关系

Efficient sampling method from Boltzmann distribution could benefit AI4Science area for applications such as drug and material discovery.

遗漏的重要参考文献

Y

其他优缺点

See above sections

其他意见或建议

N/A

作者回复

# Rebuttal Reviewer HyU2

We thank the reviewer for their time and effort. We are glad that the reviewer found our empirical results to “demonstrate the effectiveness” of our method SBG. We next clarify the main points raised in the review and note that additional results are included in this link: https://anonymous.4open.science/api/repo/sbg/zip

Novelty

We acknowledge the reviewer's concern that SBG can be thought of as an application of NETS-style inference on a learned proposal. We first highlight that our approach differs from NETS as we do not learn an extra drift term through an auxiliary loss e.g. A PINN objective. In NETS, such a term is crucial to reduce the variance of the importance weights during a linear interpolation. However, SBG does not need to employ this computationally intense learning objective precisely because of our learned proposal. In contrast to an uninformative proposal, which has a very small overlap with the target, a learned proposal mitigates the need to learn a drift term. Thus, we would like to politely push back against the assertion that SBG is a simple instantiation of NETS, as it in fact SMC with a computationally efficient manner to perform Langevin dynamics that is unlocked by using a normalizing flow rather than an equivariant CNF which we argue is a novel insight that we exploit in the BG context.

With regards to our framework, we again would like to politely disagree with the reviewer as our design choices fundamentally challenge the direction of BG research. More precisely, our main technical novelty lies in demonstrating the scalability of non-equivariant classical flows in contrast to the dominant trend to leverage equivariant CNFs. In addition, we include new algorithmic novelties such as CoM adjusted resampling; rebuttal Fig 2 and 3 highlight the improved stability and performance of IS / SMC when accounting for CoM augmentation. Furthermore, the overall scalability of our method to hexapeptides is a novel result, and is due to a combination of each component in SBG: 1) a non-equivariant classical flow and 2) SMC that leverages the exact energy of a classical flow. We highlight that these findings enabled Reviewer S3Vr to remark in their review, “The contribution made by this paper should significantly change the course of future work in this area.”

Finally, we highlight that our paper contains several theoretical results that provide quantification of bias added through various thresholding schemes. Such schemes have routinely been employed in existing literature without justification or analysis. Our paper is the first to provide an exact characterization of the bias added—allowing practitioners to negotiate a select a problem-dependent thresholding value.

Training details

We appreciate the reviewer's comment regarding the training details for training the proposal flow. Whilst many of the training, inference and dataset details are included in Appendices D and E, we recognize the importance of including details in the main paper. We will update this in future versions of the work to include additional details in section 4 but briefly state that for all TarFlow models we train for 1,000 epochs with lr 1e-4 weight decay 4e-4 using the AdamW optimizer with a cosine lr schedule and EMA with decay 0.999. Furthermore, directly answering the reviewer's question concerning training samples, Appendix E includes a description of dataset construction from MD trajectories for each of the peptide systems. For each system we use a training set of 100k contiguous samples from a single MD trajectory. We understand that aspects of such details are important to be included directly in the experiments section, and we will revise the paper in the next draft to include the key details in the main body.

Background on CNF

We value the reviewer's feedback that the discussion of CNFs might seem ancillary to a paper that leverages classical normalizing flows as the model. The key reason to introduce CNFs is to illustrate that our inference time scaling through SMC, while theoretically possible using a CNF proposal, faces significant challenges in scalability due to the need to simulate the ODE using equation 4 to compute logpθ\log p_{\theta} for every energy evaluation along the interpolation used in Langevin dynamics. In contrast, our TarFlow requires only 1 call for each Langevin step. However, we understand that this section may not have been as tightly integrated into the paper, and we will improve the clarity of presentation in the updated draft.

Closing comments

We thank the reviewer again for their questions, which allowed us an opportunity to clarify our paper. We hope that our rebuttal responses fully address all the important questions raised by the reviewer, and we kindly ask the reviewer to potentially upgrade their score if the reviewer is satisfied with our responses. We are also more than happy to answer any further questions that arise.

审稿意见
3

The manuscript presents the Sequential Boltzmann Generators (SBG), a novel extension to the existing Boltzmann generator framework for scalable sampling of molecular states in thermodynamic equilibrium. The framework removes the SE(3)-equivariance and encodes equivariance softly via data augmentations, achieving enhanced computation efficiency. The method leverages inference-time non-equilibrium transport via Sequential Monte Carlo (SMC), progressively refining proposal samples and improving their alignment with the target Boltzmann distribution. SBG employs a Transformer-based normalizing flow for efficient likelihood computation, avoiding the costly integration required by continuous normalizing flows. The experimental results demonstrate the state-of-the-art performance of SBG, successfully scaling equilibrium sampling to larger molecular systems that were previously intractable for standard Boltzmann generators.

给作者的问题

The evaluation in the paper is primarily focused on peptide systems, whereas it is unclear how well SBG generalizes to chemically diverse systems like small organic molecules or metal-organic frameworks. Therefore, have you considered any test on different chemical systems, and if not, do you anticipate any limitations when applying SBG to those systems?

论据与证据

Most of the claims are supported by clear and convincing evidence. However, the following claim might be problematic to some extent:

  1. The paper claims that the model can generate uncorrelated samples efficiently, given the ESS results. However, while a high ESS and improved sampling efficiency are impressive, they do not directly confirm the independence of samples as an autocorrelation analysis.

方法与评估标准

  1. The proposed method aligns well with the problem of scalable equilibrium sampling for molecular systems. Using a Transformer-based normalizing flow coupled with non-equilibrium transport via SMC is a reasonable approach to improve sample quality and importance weighting.

  2. The evaluation metrics, namely Effective Sample Size (ESS), Wasserstein distances for energy distributions and dihedral angles, and Ramachandran plots, are reasonable for the task in the paper and provide a comprehensive assessment of sampling quality. However, to demonstrate that the model can generate uncorrelated samples, autocorrelation plots might be another indicative evaluation method.

  3. The datasets are about different peptide systems (di-, tri-, tetra-, and hexapeptides), which are typical for molecular sampling work. However, additional benchmarks on more chemically diverse molecules could further validate the generalizability of SBG.

理论论述

I checked propositions 1-3 and the sampling algorithms, and they are theoretically sound and mathematically consistent.

实验设计与分析

  1. The paper benchmarks SBG against existing SOTA methods, including SE(3)-equivariant coupling flows and equivariant continuous normalizing flows. Therefore, the comparison between SBG and baselines gives indicating evidence of SBG's performance.

  2. The comparison of computational efficiency is insightful, as it highlights the scalability of SBG compared to baseline models.

补充材料

I reviewed the supplementary material, mainly focusing on the proposition proofs and experimental details.

与现有文献的关系

The key contributions of the paper contribute to the research in this area in terms of:

  1. It builds upon the Boltmann generator (Noe'2019) and introduces non-equilibrium transport via Sequential Monte Carlo, improving sample quality and efficiency significantly.

  2. It removes the explicit constraint of equivariance but implicitly learns this via data augmentation, which is also supported by some recent work in this area like AlphaFold3 (Abramson'2024) and MCF (Wang'2024).

遗漏的重要参考文献

All the related and essential references are discussed.

其他优缺点

Strengths:

The paper novelly integrates Sequential Monte Carlo with Boltzmann Generators, enabling non-equilibrium transport for improved sample quality and scalability. Consequently, the normalizing flow operates directly on all-atom Cartesian coordinates, which was previously intractable for existing methods. In addition, the claims in the paper are well supported by the mathematical derivations and proofs.

Weaknesses:

While the paper claims that the model generates independent samples more efficiently, it does not analyze sample autocorrelation and compare the results to traditional MCMC methods.

其他意见或建议

See Strengths and Weaknesses.

作者回复

# Rebuttal Reviewer 9RzA

We thank the reviewer for their thoughtful comments and feedback. We value that the reviewer found that most of our claims were supported by “clear and convincing evidence” and that our mathematical claims were “theoretically sound and mathematically consistent”. We are also glad that the reviewer found our use of SMC in BGs to be “novel” and allows for “scalability,” which is supported by our “insightful” comparison on computational efficiency. We next address all the salient points raised in the review and note that additional results are included in this link: https://anonymous.4open.science/api/repo/sbg/zip

(Auto) Correlation of samples

We acknowledge the reviewer's request for the inclusion of autocorrelation analysis. However, we emphasize that the samples generated by SBG can not suffer from autocorrelation. This is because only samples from τ=1\tau = 1 are returned as generated samples, hence there is no time dimension on which autocorrelation could exist. This is in contrast to methods such as MD in which a single trajectory is returned as generated samples, and samples may exhibit autocorrelation over sampling time.

To bolster our study, we have included a new variant of SBG (SBG-AIS) that does not perform any resampling during the annealing process (See Tables 1 and 2). This ablation principally serves to eliminate the correlation that SMC might introduce during inference scaling. As we observe SBG-AIS, it achieves acceptable ESS and outperforms other baselines in sample-based metrics. We hope that this makes the impact of correlation during inference clearer in SBG.

Benchmarking MCMC

We thank the reviewer for the suggestion of additional baselines. We however, are unsure exactly which MCMC methods they believe to be suitable for benchmarking on the evaluation systems. We note that the goal of BGs is to amortize sampling and train a model which is able to quickly draw many uncorrelated samples. This is a fundamentally different approach with both benefits and drawbacks as compared to MCMC methods.

Evaluating SBG beyond peptides

We value the reviewer's feedback as exploring SBG’s generalizability to experimental settings beyond peptides considered in this paper is interesting. We first note that before SBG, modern BG’s utilized equivariant CNFs instead of classical normalizing flows that employed invertible architectures. This was due to the widely held belief that equivariant CNFs were more expressive and more amenable to larger-scale problems of interest. Despite this promise, these equivariant CNFs—as demonstrated in our ablations—struggle on datasets beyond AL3/4 and are much slower to evaluate due to the need for simulation. Thus, we argue that even demonstrating that a classical normalizing flow paired with an inference scaling strategy allows tackling even larger peptides in hexapeptide (AL6) is extremely interesting. These results demonstrate the expressive power of our framework in comparison to previous BGs.

Such results enabled Reviewer S3Vr to highlight in their review, “The contribution made by this paper should significantly change the course of future work in this area.” As a result, we believe the experimental validation of our introduced SBG approach is well supported in the current peptide datasets and testing on other chemically diverse systems, while extremely interesting, is beyond the scope of the current paper, but remains an exciting direction of future work. At this time we don’t anticipate any limitations specific to chemically diverse systems such as small organic molecules or metal-organic frameworks, but chose to evaluate on peptides as has been the norm in many prior BG works.

Closing comments

We thank the reviewer for their time and effort in reviewing our work, and we hope the reviewer will kindly consider a fresh evaluation of our work, given the main clarifying points outlined above. We are also eager to engage in further discussion if the reviewer has any lingering doubts.

审稿人评论

Thank the authors for the comprehensive rebuttal. The rebuttal has addressed most of my concerns, especially the correlation of samples and the scalability of SBG in terms of sampling efficiency and tacking AL6. The only thing that is not fully resolved is how the SBG can do something with scientific insight, which is actually what my question is asking about. For example, can the model sample rare events for large peptides, or can it be generalized to other chemical systems besides peptides?

With the above, I'd like to raise my score from 2 to 3.

作者评论

We thank the reviewer for their time and consideration of our work, and are pleased to have addressed most of their concerns in our rebuttal. We additionally thank the reviewer for their score increase.

The reviewer raises highly intriguing questions concerning scientific applications of the SBG, which we intend to explore in future work. We thank the reviewer for their thoughtful suggestions; we do not anticipate any inherent issues with exploring non-peptide systems, and agree that scaling to larger peptides to be of great scientific relevance, including the sampling of rare events.

审稿意见
4

This paper introduces Sequential Boltzmann Generators (SBG), an extension to the Boltzmann generator framework. By replacing conventional importance sampling with a non-equilibrium annealing process , the authors aim to transport proposal samples toward the target Boltzmann distribution. The authors also propose normalizing flow which leverages a Transformer‐based exactly invertible TarFlow that is trained with soft equivariance penalties rather than using equivariant architectures. Experimentally, the paper demonstrates increased performance over BG baselines on a series of molecular systems (ranging from dipeptides up to decapeptides such as Chignolin) in terms of effective sample size (ESS), Wasserstein distances, and inference time scaling.

Update after rebuttal

My doubts were initially about the scalability of SMC / re-weighting approaches in general being able to go beyond classical force-fields and lack of comparison to alternative diffusion samplers like iDEM (which essentially amortizes force-field evaluation in training time, not required for sampling). However, I want to acknowledge that showing efficacy of invertible NF with exact likelihoods enables some interesting downstream applications as the authors have accomplished here (to my knowledge, not currently possible with iDEM). I think this work highlights some interesting avenues in sampling and is a high-quality paper, so I want to increase my support 3->4.

给作者的问题

If you have any important questions for the authors, please carefully formulate them here. Please reserve your questions for cases where the response would likely change your evaluation of the paper, clarify a point in the paper that you found confusing, or address a critical limitation you identified. Please number your questions so authors can easily refer to them in the response, and explain how possible responses would change your evaluation of the paper.

Although there is amortization here from the normalized flow, it seems that sampling still requires a lot of queries from the potential energy model. How would this technique work in situations where the potential energy is computationally expensive to evaluate?

论据与证据

SBG claims to improve scalability of equilibrium sampling in two ways: 1. by replacing the importance sampling reweighting of proposal samples with a target-informed non-equilibrium process, 2. Increase computational efficiency of the proposal model by using a non-equivariant invertible transformer architecture, enforcing equivariance softly via data augmentation.

Based on the peptide sampling experiments, the authors have shown evidence of scaling up compared to prior boltzmann generator techniques and may very well be the first to sample these larger peptides (at least in cartesian coordinates). However, there are other generative modeling methods which aim to perform scalable amortized boltzmann sampling without requiring invertible networks or additional importance weighting / SMC. One example, which is only ever mentioned in Table 1, is iDEM which has shown to scale well to high-dimensional configurations (55 particle Lennard Jones potential). If the goal is scaling up equilibrium sampling, I find it confusing why the iDEM method is cited in the table but never discussed or benchmarked. Flow-annealed Bootstrapped importance sampling (FAB) and path-integral sampler (PIS) may also be competitive here, which can fully make use of equivariant architectures.

If these concerns are clarified I would definitely reconsider my score.

方法与评估标准

I think the proposed methods are described clearly and many design choices are directly related to the goal sampling molecular configurations. As mentioned above, I find only comparing against Boltzmann generator techniques strange.

理论论述

The theoretical claims appear correct. Prop 2 is based on a recently established result. The energy thresholding technique suggested by Prop. 3 is interesting.

实验设计与分析

(mostly reiterated from claims-evidence section) Based on the peptide sampling experiments, the authors have shown thorough evidence of scaling up compared to prior boltzmann generator techniques and may very well be the first to sample these larger peptides (at least in cartesian coordinates). However, there are other generative modeling methods which aim to perform scalable amortized boltzmann sampling without requiring invertible networks or additional importance weighting / SMC. If the goal is scaling up equilibrium sampling, I find it confusing why the iDEM method is cited in the table but never discussed or benchmarked. Flow-annealed Bootstrapped importance sampling (FAB) and path-integral sampler (PIS) may also be competitive here, which can fully make use of equivariant architectures.

补充材料

I only reviewed some additional implementation details of the experiments and training run-times.

与现有文献的关系

I expect to see more application of these non-equilibrium processes as a drop in replacement of self-normalized importance sampling. I think the paper could have addressed other non-BG frameworks for sampling that have been proposed recently.

遗漏的重要参考文献

I think there definitely could be more discussion on recent generative modeling techniques for sampling outside of Boltzmann generators. In particular, there are diffusion-based samplers which all have different tradeoffs, but can possibly scale just as well here:

Cited but not discussed: Flow Annealed Importance Sampling Bootstrap (2023), iDEM (Akhound 2024), Transport meets variational inference ( Vargas 2024)

Not cited: Path-integral Sampler (Zhang 2021), Particle Denoising Diffusion Sampler (Phillips 2024), Sequential Controlled Langevin Diffusions (Chen 2024) (concurrent work)

其他优缺点

Strengths:

The paper offers some interesting alternatives for normalizing flow architectures that do not rely on explicit equivariant parameterizations. The inductive biases / symmetries for molecular sampling problems are still quite strong, but the non-equivariant architectures may be necessary as we scale up to even larger systems and datasets.

The use of NETs in place of self-normalized importance weighting is a generally smart drop-in design choice for Boltzmann samplers.

Prop 3 gives us a principled technique for energy thresholding.

Weaknesses:

For the purpose of scaling up equilibrium sampling, I believe the paper is missing some essential baselines (in particular iDEM) and discussion of related methods listed above. It seems the paper only shows evidence of scaling over previous Boltzmann generator frameworks, but not other amortized samplers.

I was a little bit disappointed in the novelty since the application of NETs and the architecture choice of the normalizing flows are somewhat orthogonal to each other. It seems they do work well together experimentally, but I believe more baselines are needed to prove its necessity for scalable equilibrium sampling.

As I said before, If these concerns are clarified I would definitely reconsider my score.

其他意见或建议

If you have any other comments or suggestions (e.g., a list of typos), please write them here.

The acronym EACF isnt defined until the appendix should be defined before the appendix.

Many previous works evaluate on these synthetic energy functions based on the Lennard Jones potential (LJ). They are not as interesting as peptides, but they might help place the work better wrt previous work.

作者回复

Rebuttal Reviewer VNPc

We thank the reviewer for their time, feedback, and nuanced comments. We are glad that the reviewer found the non-equivariant NF an “interesting alternative” which is “necessary as we scale up to even larger systems and datasets”. We also appreciate that the reviewer recognized our use of SMC instead of IS as a “smart drop-in design choice”. Finally, we are heartened to hear that the reviewer agrees that Prop 3 “is a principled technique”. We now address their key questions raised in the review and note that additional results are included in this link: https://anonymous.4open.science/api/repo/sbg/zip

iDEM as a Baseline

We appreciate the reviewers' valuable comments regarding iDEM as an additional baseline. We would like to first politely recall the setting of iDEM and other diffusion samplers as completely data-free (i.e. there is no training set), in contrast to the BG setting which includes training on (biased) data. This allows BGs to scale much more easily than the data-free amortized samplers. Consequently, we argue that scalability claims in each setting cannot be meaningfully compared. To our knowledge only one sampling method has successfully scaled to any molecular task, which is FAB on ALDP using an intrinsic coordinate system, while molecular tasks are the focus of most works on BGs. To our knowledge, no sampling method has successfully scaled to any molecular task on cartesian coordinates, the focus of this work.

To investigate this setting, we had private correspondence with the iDEM authors, who told us that iDEM failed to scale to ALDP even with substantial effort; to their knowledge no diffusion-based sampler can scale to molecular tasks. Following the reviewer's suggestion we trained iDEM on ALDP ourselves. In this case we use a vacuum instead of the implicit solvent used in our main results to speed up training (see Fig 1 in our link). We observe that iDEM is unable to successfully sample in this easier setting with most modes missing and a poor energy distribution.

Novelty

We value the reviewer's feedback that the application of SMC, and a general-purpose Normalizing Flow may initially appear to have limited technical novelty. We would like to politely push back against this assertion, as our framework and design choices fundamentally challenge the predominant approach in BG. Prior to SBG, all modern BGs in Cartesian coordinates have resorted to equivariant CNF’s; the fact that we can omit both exact equivariance and use an exact NF—is a novel insight. Moreover, our CoM adjustment strategy is also new. We ablate its importance in new plots in the rebuttal link (Fig 2 and 3), which shows that IS reweighting significantly benefits from this adaptation strategy.

In fact Reviewer S3Vr notes, “The contribution made by this paper should significantly change the course of future work in this area.” We further wish to highlight that the paper contains numerous theoretical results on thresholding, some of which are utilized in existing papers without proper justification. Our theoretical results allow us to quantify the impact of thresholding schemes in this process. We hope that the reviewer may join us in agreeing that the design choices utilized in SBG allow for a fresh approach to building BG’s using non-equivariant components with soft penalties that demonstrably scale better to larger peptides in Cartesian coordinates—which remained an open challenge until SBG.

Additional References

We acknowledge the reviewer's comment regarding the inclusion of more non-BG based samplers. We will update the paper with a dedicated discussion on non-BG based sampling and include the references suggested by the reviewer in FAB, PIS, PDDS, and the concurrent work SCLD.

LJ Potential

The reviewer is correct to highlight that many previous samplers and BG papers also evaluate on LJ potential systems. Such systems are however less challenging than even small peptides (e.g ALDP), making them of limited interest once peptides can be successfully tackled. We thank the reviewer for their suggestion that this may help better place the work, but instead draw attention to the evident failure of iDEM on ALDP in Fig 1 as clear evidence that BG methods are superior in the data-available setting.

Computational expense

We thank the reviewer for enquiring regarding the efficiency of our method with respect to energy / force evaluations. During sampling a single force evaluation is required per-particle per timestep, hence if the force is expensive this will present a computational cost.

Closing comments

We thank the reviewer for their valuable feedback and great questions. We hope that our rebuttal fully addresses all the important points raised, and we kindly ask the reviewer to potentially upgrade their score, as they indicated, if they are satisfied with our responses. We are also more than happy to answer any further questions that arise, please do let us know!

审稿人评论

I appreciate the authors' thorough response. Regarding diffusion samplers, I don't think there is some limitation that prevents pre-training a diffusion model on biased-data, so I'm not sure I agree with the authors point that BG are inherently more scalable.

SBG is going further than previous works here in the case of scaling classical force-fields, but as the authors acknowledge, more accurate force-fields (e.g. DFT solvers) will present significant computational and scalability challenges (as with almost any re-weighting or SMC approach). There are even huge efforts right now trying to learn large equivariant GNNs to predict these DFT force-field calculators (MACE-OFF [Kovacs 2023]) in order to achieve more computationally efficient evaluations that are more realistic than classical force-fields. Even still, I imagine these GNN would be quite challenging to incorporate into SBG. This is my reason for pushing back on the focus of scalability claims in the paper.

That being said, I have considered the additional iDEM results and my concerns for lack of diffusion sampling baselines are alleviated (based on the remark of the original iDEM authors, I trust that baseline is representative of iDEMs best performance on this benchmark) and I'm leaning towards accept now 2->3.

[Kovacs 2023] MACE-OFF23: Transferable Machine Learning Force Fields for Organic Molecules

作者评论

We appreciate the time the reviewer has taken to reconsider their evaluation of our work, and their score increase.

The reviewer is correct to identify that diffusion-based samplers could be pretrained on biased data, however this is not the standard approach as most of these works consider the data-free setting. It remains an open (and highly interesting) research question to establish if diffusion-based samplers pretrained on a similar dataset to a BG outperform the BGs themselves. However, as it stands, given a dataset of MD trajectory BGs remain the most successful method for Boltzmann sampling, with no diffusion-based sampler achieving acceptable performance on ALDP in cartesian coordinates.

We acknowledge the reviewers' concern regarding the computational cost of more accurate force fields. Whilst SMC requires more force evaluations than IS, methods including diffusion-based samplers also suffer from this requirement, hence we believe this to not be a limitation unique to our work (or to BGs). The SBG supports an arbitrary differentiable target energy function, hence it would be algorithmically trivial to incorporate learned DFT approximations as the reviewer suggests. Exploring such avenues, and improving the force-evaluation efficiency of the SBG is an exciting direction for future work.

We thank the reviewer for their acknowledgement of our iDEM results and our discussion with the iDEM authors, and are glad this has alleviated this concern they held.

We once again thank the reviewer for their comments and feedback, which has enhanced the empirical quality of our work.

最终决定

This paper introduces Sequential Boltzmann Generators, merging ideas from sequential Monte Carlo techniques with Boltzmann Generators. The approach aims to align an inference procedure of a surrogate of the Boltzmann distribution with the actual target density using a SMC scheme, instead of only doing importance sampling after inference. The approach is conceptually interesting and practical, providing state of the art results on challenging physical systems. All doubts were resolved during discussion period and the reviewers unanimously recommended acceptance.