PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
4
3
3
4
ICML 2025

Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

We introduce a highly scalable algorithm for learning to sample from only energy functions, the first of its kind in terms of efficiency, which we scale to new benchmarks on conformer generation.

摘要

关键词
SamplingStochastic Optimal Control

评审与讨论

审稿意见
4

The paper proposes Adjoint Sampling, a scalable and effective method for diffusion samplers. Authors build their ideas on top of adjoint matching and propose several advancements for scalable and effective training of diffusion samplers. Experiment results validate that the proposed method outperforms several baselines across various benchmarks.

给作者的问题

Please see above.

论据与证据

The claim made in the submission is supported by clear and convincing evidence.

方法与评估标准

The authors follow the conventional experiment setting for diffusion samplers.

理论论述

I check that the theoretical claims of the paper (ex. Proposition 3.1) are correct.

实验设计与分析

N/A

补充材料

I read the appendix of the paper to understand some details of each procedure.

与现有文献的关系

Diffusion samplers can be applied various scientific applications such as physical simulations.

遗漏的重要参考文献

N/A

其他优缺点

There are several comments and suggestions listed below.

  • It seems the training procedure can be conducted in an off-policy manner. Is there any reason why the samples are uniformly sampled from the buffer? Are there any other possibilities to improve the sample efficiency by using several off-policy training schemes?

[1] Sendera, Marcin, et al. "Improved off-policy training of diffusion samplers." The Thirty-eighth Annual Conference on Neural Information Processing Systems.

  • It will also be nice to compare the time complexity or performance of naive adjoint sampling (e.g., without reciprocal adjoint matching).

其他意见或建议

Please see above.

伦理审查问题

N/A

作者回复

We thank the reviewer for supporting our paper. Note that we have added additional experiments and figures to provide more insight into our work. See https://sites.google.com/view/adjointsamplingrebuttal. Below we answer the reviewer’s questions in detail.

It seems the training procedure can be conducted in an off-policy manner.

We wish to clarify that our method is not actually off-policy, in the sense that we cannot sample from an arbitrary SDE, which is what is often referred to as off-policy by other methods. That being said, there are many downsides to an off-policy method – such as requiring full trajectories for optimization, and our method is actually much more scalable due to requiring only (Xt,X1)(X_t, X_1) samples.

Are there any other possibilities to improve the sample efficiency by using several off-policy training schemes?

The existing off-policy methods come with their own drawbacks, such as requiring full trajectories and not using the gradient of the energy for the training loss. We added additional experiments using the log-variance loss [1], but it does not perform well. We find that such off-policy methods only work well when the gradient of the energy is used directly as part of the drift parameterization [2], which is something we need to avoid as we move on to utilizing more computationally costly energy functions.

Is there any reason why the samples are uniformly sampled from the buffer?

This is a good suggestion. We did not specifically look into prioritized replay buffer, but it is possible to use importance weights to design a prioritized replay buffer where higher weights are sampled more frequently [3]. We note that such an approach is difficult to implement for our proposed (amortized) benchmark, which aims at solving 24,000+ sampling sub-problems using a single model, as the priority over samples needs to be computed for each sub-problem separately. In order to train on all sub-problems and be able to generalize, we cannot sample each sub-problem too many times as it incurs a high computational cost so we just used a simple uniform buffer.

It will also be nice to compare the time complexity or performance of naive adjoint sampling (e.g., without reciprocal adjoint matching).

We have included additional ablations without the Reciprocal projection (see Tables 1 & 2 of the link), and also a runtime comparison between different methods (Figure 2 of the link). We note that the naive Adjoint Matching in terms of computational cost is on-par with Discrete Adjoint (PIS) in terms of runtime, while the ablation “Adjoint Sampling w/o Reciprocal” is on-par with Adjoint Sampling in terms of runtime. However, without the Reciprocal projection, the performance deteriorates because it does not as efficiently search the sample space. With Reciprocal AM, the samples XtX_t are uncorrelated, whereas without it, the samples across time are from the same trajectory (see Figure 5 of the link). We hypothesize that the improved performance when using Reciprocal AM (in terms of Energy W2 for Table 1 and Recall for Table 2) is due to the ability to see more diverse X_t samples during training.

[1] “Improved sampling via learned diffusions”

[2] “No Trick, No Treat: Pursuits and Challenges Towards Simulation-free Training of Neural Samplers”

[3] “Sequential Controlled Langevin Diffusions”

审稿人评论

Sorry, I sent an official comment and found it is not visible to the authors...

Thank you for the clarification and additional experiments regarding my concerns. As most of my concerns have been resolved, I keep my positive score.

审稿意见
3

The paper proposes an algorithm that uses diffusion-models for sampling from unnormalized densities which is rooted in stochastic optimal control. The proposed method is based on the adjoint-state and the resulting form of the objective is particularly simple as it is requires a regression to the (scaled) gradient of the terminal cost of the SOC problem. Furthermore, the paper explains how various symmetries such as periodic boundary conditions are included. Lastly, the authors propose a novel benchmark for sampling molecular conformers.

给作者的问题

  • Did the authors try the proposed method on Alanine Dipeptide, which is a common problem in the sampling community?
  • I might have missed it, but how high-dimensional are the proposed benchmark problems?
  • Does the proposed objective avoid mode-collapse such as e.g. relative entropy minimization?

论据与证据

The authors claim:

It is the first of its kind in allowing significantly more gradient updates than the number of energy evaluations

and

However, all of these methods are hindered by their computational requirements, including expensive differentiation through the sampling procedure, computation of higher-order derivatives in constructing the training objectives, or the need for importance sampling (i.e. multiple energy evaluations).

However, using the log-variance loss [1] has similar benefits as adjoint sampling. Can the authors comment on that?

[1] Richter, Lorenz, and Julius Berner. "Improved sampling via learned diffusions." arXiv preprint arXiv:2307.01198 (2023).

方法与评估标准

The reviewer cannot judge if the proposed benchmark is useful due to a lack of knowledge about conformer prediction. Moreover, the authors do not compare their method to any other sampling method which is confusing. It would be more convincing if the authors could show the performance of their method on other, more established benchmarks, see e.g. [1] for a recent study.

[1] Blessing, Denis, et al. "Beyond ELBOs: A large-scale evaluation of variational methods for sampling." arXiv preprint arXiv:2406.07423 (2024).

理论论述

The proofs of the results were skimmed and, to the best of the reviewers knowledge, appear to be correct.

实验设计与分析

The reviewer is familiar with the experimens apart from the new benchmark tasks.

补充材料

I reviewed parts A-C and E of the supplementary material.

与现有文献的关系

The paper proposes a novel objective rooted in SOC that has a simple form and allows for the usage of off-policy learning with a replay buffer. If the authors could demonstrate that their method consistently performs well on more established benchmarks, then the method could have a high impact.

遗漏的重要参考文献

The most relvant references have been discussed to the best of the reviewers knowledge.

其他优缺点

Weaknesses

  • The paper only consideres few sampling methods as baselines
  • The authors do not compare their method to any other sampling method on the novel benchmark

Strenghts

  • The authors propose several geometric extensions, some of which are, to the best of the reviewers knowledge, novel in the context of sampling from unnormalized densities.
  • The authors propose a new benchmark for sampling problems (although the reviewer cannot judge wheter or not this benchmark will have an impace)
  • The simple regression objective suggest numerical stability and scalability

其他意见或建议

None

作者回复

We understand and agree with the reviewer’s concerns regarding additional baselines and the lack of clarity around the proposed benchmark.

To respond, we (i) have additional experiments, (ii) expand the discussion to related off-policy methods, and (iii) clarify why the proposed benchmark is much harder than existing ones. See: https://sites.google.com/view/adjointsamplingrebuttal

the log-variance loss has similar benefits as adjoint sampling

There are problems with the log-variance loss that inhibits scalability. Firstly, it requires full trajectories of length N, and N network evals, per gradient. Partial trajectories can only be used if the time-marginals are either learned [1] or prescribed [2]. In contrast, Adjoint Samping only uses pairs of (X_t, X_1) samples, and one network eval, for each gradient.

Secondly, the log-variance loss doesn’t make use of the energy gradient information. For log-variance to work in high-dimensions, the gradient of the energy is used directly as part of the drift parameterization and it can fail without this [3], but we avoid this since our focus is on computationally expensive energy functions (such as the proposed benchmark). In our tests, the log-variance loss fails to learn on LJ because the potentials have extremely large values.

other sampling methods

We added multiple baselines, including log-variance, DDS, and ablations. Additionally, we have incorporated the suggested reference [4] to estimate ESS and ELBO metrics. Please see Tables 1 & 2 in the above link.

more established benchmarks

We believe the DW and LJ problems are very common and have been used by many prior works.We mainly focus on our proposed benchmark.

how high-dimensional are the proposed benchmark problems?

We agree this was not emphasized well and hope to remedy this. The number of atoms range from 3-50 (median 39), sample dimension is 3x the number of atoms, while the conditioning dimension is quadratic (bonds). See Figs 3 and 4.

This is an amortized setting for finding the distribution of conformers for over 24,000 molecules (i.e., 24,000 conditional sampling problems). The benchmark is designed to test for generalization to unseen molecules, whereas most existing sampling benchmarks (such as DW, LJ, alanine dipeptide) only test performance on a single & cheap energy function.

Furthermore, each molecule has a number of conformers (representing low energy regions), ranging from a handful to a few hundred (Figs 6,7 in submission). The benchmark is specifically designed to test if a sampling method can find ALL the conformers for every molecule. As such, the metrics of interest are precision and recall against a set of highly-diverse ground truth samples computed using density functional theory (DFT) that took days to obtain even on a large cluster.

Finally, the energy function for this benchmark is a large graph neural network. As such, care must be taken in order to not incur computational cost by evaluating the energy function too many times. See Fig 2 for a runtime plot.

To our knowledge, this is the most difficult sampling benchmark so far, which we believe can aid in directing future research on sampling methods. Performing well on this benchmark has direct consequences in advancing computational chemistry and drug discovery.

other sampling method on the novel benchmark

It is VERY difficult to get reasonable performance on this benchmark. The only other method that we are aware of that can train with only pairs of (X_t, X_1) samples is iDEM, which we now include (see Table 2 of the link). iDEM is biased when few MC samples are used; it typically uses 512 MC samples (== energy evaluations) per gradient, which is prohibitive (see Figure 2). We also added an ablation for the Reciprocal projection (see Table 1 & 2 of the link); without it, the model performs worse, which we hypothesize is due to the lack of exploration (see Figure 5 in the link).

[...] Alanine Dipeptide, which is a common problem in the sampling community?

We did not specifically test the alanine dipeptide setup (which is 22 atoms and uses a classical energy function). The molecules in our benchmark are much larger, and our energy function is much more expensive as it is a GNN trained to approximate forces from DFT.

Does the proposed objective avoid mode-collapse such as e.g. relative entropy minimization? Our method is related to the reverse KL rather than the forward KL (i.e., relative entropy). Mode collapse is inherently difficult as there is no way to know where modes are without fully exploring the search space. Our method relies on the choice of base process to determine what region to search. Finally, we again note that the new benchmark is specifically designed to test mode coverage as it requires finding all minima.

[1] https://arxiv.org/abs/2310.02679

[2] https://arxiv.org/abs/2412.07081

[3] https://arxiv.org/abs/2502.06685

[4] https://arxiv.org/abs/2406.07423

审稿意见
3

This paper introduces Adjoint Sampling, a novel framework for efficiently sampling from an unnormalized density function. The framework reformulates the sampling problem as a stochastic optimal control problem. Building on the adjoint matching method, the authors propose the Reciprocal Adjoint Matching method, which allows for multiple gradient updates without requiring evaluations of the energy model. Additionally, the authors provide theoretical results to validate the convergence of the Reciprocal Adjoint Matching method. Empirical evaluations on a molecular conformation sampling task demonstrate the effectiveness of the proposed approach.

给作者的问题

Please refer to the sections above.

论据与证据

I believe the answer is no. While the authors claim that the adjoint sampling method is more scalable compared to the original adjoint matching method, a thorough analysis and comparison are necessary to evaluate the efficiency of the adjoint sampling method. Furthermore, experiments on large-scale datasets are needed to provide evidence of the method's scalability.

方法与评估标准

The method is well-suited to addressing the problem.

理论论述

I have checked the proofs for the theorems.

实验设计与分析

As I have mentioned before, experiments on large-scale datasets are needed.

补充材料

I only read the proofs of the theorems.

与现有文献的关系

The method primarily combines the adjoint matching technique with the reciprocal projection approach.

遗漏的重要参考文献

The related works are essential to understanding the context for key contributions of the paper.

其他优缺点

Strengths:

  1. The writing is very clear and easy to follow.
  2. The method is well-suited to the problem.

Weaknesses:

  1. Lack of large-scale experiments and efficiency analysis: As mentioned earlier, a thorough analysis and comparison are necessary to assess the efficiency of the adjoint sampling method. The original adjoint matching method would serve as a reasonable baseline. Furthermore, experiments on large-scale datasets are needed to demonstrate the scalability of the proposed method.

  2. Limited experiments on other domains: The paper lacks experiments in domains beyond the one studied. For instance, including experiments on text-to-image generation tasks and comparing the results to the adjoint matching method could further highlight the paper’s contributions and broaden its impact.

其他意见或建议

Please refer to the sections above.

作者回复

We thank the reviewer for being candid and providing us the opportunity to substantiate our claims, which we believe we can do. The reviewer is concerned with our claims regarding (i) the efficiency and analysis of Adjoint Sampling and (ii) our proposed large-scale sampling benchmark. We agree that we did not emphasize these contributions enough, and hope to clarify them in the following answers.

Please find additional ablations, experiments, and runtime figures here: https://sites.google.com/view/adjointsamplingrebuttal

Analysis and comparison to Adjoint Matching

Efficiency from a simple base processes

Firstly, note that the original Adjoint Matching (AM) method is designed for general base process, and has the same computational cost as the path integral sampler, requiring full trajectory simulation and solving the lean adjoint state. One of our first key observations is that the lean adjoint state can be solved in closed-form when the base process is simple.

Ablation and analysis of the Reciprocal projection

Furthermore, we note that the Reciprocal projection further enhances optimality, allowing us to explore the search space much faster (see Figure 5 of above link) as it allows training on uncorrelated X_t samples. We performed the requested ablation where we removed Reciprocal projection (see “Adjoint Sampling w/o Reciprocal Projection” in Tables 1 & 2). We see that Adjoint Sampling with Reciprocal projection performs better in all settings. On the LJ classical force fields, it avoids high energy regions much better (better Energy W2 metric), and on our amortized conformer generation task it covers the target distribution much better (better recall and precision metrics). We hypothesize this is because the Reciprocal projection lets the model see far more diverse trajectories across different molecular conditionings. In the amortized benchmark, the number of samples per molecule is extremely small, so having a strong learning signal for each energy evaluation becomes very important. Moreover, we know that the Reciprocal projection is theoretically grounded and preserves the optimal solution as a unique fixed point.

Runtime plot

We have included a run-time breakdown in Figure 2 of the link. It shows that Adjoint Sampling is computationally efficient in terms of time-spent evaluating the energy and sampling model. This is what enables us to scale up to both larger architectures and energy models, enabling us to sample from molecular energy foundation models. Other methods will either require full trajectory optimization (such as PIS, log-variance) or require an intractable number of energy evaluations (such as iDEM).

Large-scale experiments for sampling

As mentioned above, Adjoint Sampling is specifically designed for sampling from unnormalized distributions where a simple base process can be used. The text-to-image finetuning problem is a more general problem statement where a pre-trained generative model is used as the base process. Since our methodological contributions rely on the choice of a simple base process, we restrict ourselves to the pure sampling setting, where only an unnormalized density is given.

In the literature of sampling from unnormalized densities, our proposed benchmark is actually THE most difficult benchmark to date, requiring highly scalable algorithms. Prior related works have only experimented with classical (synthetic) force fields such as the Lennard-Jones experiments we have included. In contrast, the conformer generation benchmark uses a large graph neural network as the energy function, and hence each evaluation incurs a runtime cost. Furthermore, this benchmark is aimed at learning conditional sampling models, amortized over 24000+ molecules, with the test metrics being generalization to unseen molecules. This type of amortized benchmarks for sampling is incredibly rare and has not been well-explored. Due to this amortization, expensive methods that require full simulations for each gradient update (such as Adjoint Matching) does not scale well — again, note the runtime plot (Figure 2 in the link), Adjoint Matching in its basic form has a similar computation cost as the Discrete Adjoint (PIS).

Finally, the reward fine-tuning benchmarks are aimed at finding one good image sample per text prompt. Here, our benchmark requires the model to find a set of diverse samples per molecule (ALL local minima), and the metrics of interest are precision and recall against a set of highly-diverse ground truth samples from density functional theory (DFT) that took weeks to obtain even on a large cluster. Performing well on this benchmark has direct consequences in advancing computational chemistry and drug discovery.

We hope this answers the reviewer regarding (i) why Adjoint Sampling works only for the sampling task, and (ii) how our proposed new benchmark is precisely encouraging the sampling community to look into harder and larger-scale problems.

审稿人评论

I'm sorry that I have sent a comment which is not visible to authors... Thank you for your detailed rebuttal. All of my concerns have been adequately addressed, and I increase my score to 3.

审稿意见
4

This paper proposes a novel neural sampling method, Adjoint Sampling, based on stochastic optimal control (SOC) and recently published adjoint matching method. The proposed method uses reciprocal projections alternating with reciprocal adjoint matching, and allows for incorporating the key symmetries from the considered energies. Adjoint Sampling allows for a significant reducing the number of needed energies evaluations per gradient step. Moreover, the paper introduces novel benchmarks for conformer generations.

给作者的问题

Questions:

[1] It’s unclear to me why lean adjoint matching and adjoint sampling actually produce good gradients, since we are not minimizing the KL divergence anymore with the lean adjoint matching?

For other questions, please refer to the previous sections.

论据与证据

This paper introduces Adjoint Sampling and claims that is more efficient, highly-scalable and theoretically grounded. I agree that most of the claims have good evidences. However, I’m not sure if the scalability was properly based on convincing experimental setting, and the lack of other baselines (e.g., iDEM) on conformer prediction experiments.

方法与评估标准

The proposed evaluation setup is mostly typical for neural samplers community, but the number of baseline methods is limited. Moreover, the paper introduces novel benchmarks on conformer predictions. However, I strongly encourage the authors to include NLL and ESS metrics for the experiments on LJ potentials and DW-4. LJ potentials experiments should be equipped with the sampled energies histograms compared with the ground truth energy histogram.

理论论述

I agree with the authors that the method is interesting and theoretically grounded. As far as I’ve checked, I haven’t found any obvious flaws in the theoretical parts.

实验设计与分析

Overall, I think that the experimental setting is reasonable. I would recommend adding other samplers like FAB [1], or DDS [2] as the baselines for experiments, and add the previously mentioned metrics.

References:

[1] Midgley, Laurence Illing, Vincent Stimper, Gregor NC Simm, Bernhard Schölkopf, and José Miguel Hernández-Lobato. "Flow Annealed Importance Sampling Bootstrap." In The Eleventh International Conference on Learning Representations.

[2] Vargas, Francisco, Will Sussman Grathwohl, and Arnaud Doucet. "Denoising Diffusion Samplers." In The Eleventh International Conference on Learning Representations.

补充材料

I’ve briefly checked the whole supplementary material, and more deeply the Section C.

与现有文献的关系

This paper is relevant to the neural samplers community and proposes a novel method, based on very recently presented Adjoint Matching approach. However, I think that putting more attention on a recent approaches for scaling the training or abilities of diffusion samplers, would be beneficial for this work, e.g., [1] or [2].

References:

[1] Berner, Julius, Lorenz Richter, Marcin Sendera, Jarrid Rector-Brooks, and Nikolay Malkin. "From discrete-time policies to continuous-time diffusion samplers: Asymptotic equivalences and faster training." arXiv preprint arXiv:2501.06148 (2025).

[2] Sanokowski, Sebastian, Wilhelm Berghammer, Martin Ennemoser, Haoyu Peter Wang, Sepp Hochreiter, and Sebastian Lehner. "Scalable Discrete Diffusion Samplers: Combinatorial Optimization and Statistical Physics." arXiv preprint arXiv:2502.08696 (2025).

遗漏的重要参考文献

Please, refer to the previous sections and already mentioned references.

其他优缺点

Strengths:

[1] Introducing the novel sampler method with the strong theoretical results and good empirical evidence of its properties.

[2] Presenting a novel conformal generation benchmark.

Weaknesses:

[1] Limited evaluation in terms of low number of baselines and missing metrics.

[2] Missing important references to other works.

其他意见或建议

For other comments, please refer to the previous sections.

作者回复

We thank the reviewer for their insightful comments. We’ve incorporated new baselines and metrics, and have produced figures to better illustrate our claims.

Additional figures & results: https://sites.google.com/view/adjointsamplingrebuttal

I’m not sure if the scalability was properly based on convincing experimental setting

We agree we did not emphasize this enough. Please see the link above for a runtime plot. Adjoint Sampling is the only method so far that can work with (Xt,X1)(X_t, X_1) pairs — no full trajectories — and can scale to computationally expensive energy functions. Further details are provided in the answers of the reviewer’s other questions.

I strongly encourage the authors to include NLL and ESS metrics for the experiments on LJ potentials and DW-4. LJ potentials experiments should be equipped with the sampled energies histograms compared with the ground truth energy histogram.

I would recommend adding other samplers like FAB [1], or DDS [2] as the baselines for experiments

We have added several new baselines, including DDS. We have also included path-based ESS and ELBO estimates, and energy histograms. New results and figures can be found in the link above. For ESS and ELBO, we used the same method as [1,2,3]. However, we were not able to get reasonable estimates for iDEM at this time following the approach of [1] so have decided not to include them for now.

I think that putting more attention on a recent approaches for scaling the training or abilities of diffusion samplers, would be beneficial for this work, e.g., [1] or [2].

We completely agree! We will add more discussions to existing methods.

We note that there may be a conflation of “off-policy” and “scalability”, which are best decoupled. Off-policy methods, such as the log-variance divergence [4] and trajectory balance [5], do not take gradient of the energy function (as they do not differentiate through the model sample) and actually strongly relies on parameterizing the drift using the gradient of the energy function as discussed in [6]. In our setting, the energy function is more expensive than the drift network (see runtime plot), so our experiments do not use the energy function as part of parameterizing the drift. Another major downside is that these methods require the full trajectory, only being able to use sub-trajectories if the time-marginals are either learned [5] or prescribed [7]. In contrast, Adjoint Sampling is an on-policy method (resulting in a more direct update to the current model), directly works with the energy gradient rather than the energy values, and only requires (Xt,X1)(X_t, X_1) pairs.

In our additional experiments, we have added a log-variance baseline from [3] to the synthetic energy experiments which did not scale beyond DW4. This is because the LJ potentials can have extremely large values which an importance-sampling based method like log-variance does not handle well (again, without taking the gradient of the energy into the parameterization).

It’s unclear to me why lean adjoint matching and adjoint sampling actually produce good gradients, since we are not minimizing the KL divergence anymore with the lean adjoint matching?

Instead of optimizing the KL, Adjoint Matching (AM) directly regresses onto the control in the manner of a consistency loss. AM is still a fixed point iteration method and has the optimal control as the unique solution, and can be interpreted as removing a stochastic term from the KL gradient that has expectation zero at the optimum (as discussed in the original AM paper). In particular, the lean adjoint state does not depend on the learned control and only the base process.

This last property of the lean adjoint state is incredibly important in the sampling setting, and in the paper we use simple base processes so the lean adjoint state has a closed-form solution. This leads to allowing us to use only (Xt,X1)(X_t, X_1) samples for training, and the Reciprocal projection further enhances optimality. We plan to provide more details about these derivation steps in the main paper.

[1] “Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling”

[2] “Path Integral Sampler: a stochastic control approach for sampling”

[3] “Improved sampling via learned diffusions”

[4] “From discrete-time policies to continuous-time diffusion samplers: Asymptotic equivalences and faster training.”

[5] “Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization”

[6] “No Trick, No Treat: Pursuits and Challenges Towards Simulation-free Training of Neural Samplers”

[7] “Sequential Controlled Langevin Diffusions”

审稿人评论

I would like to thank the authors for their work on rebuttal. My concerns were addressed, so I will raise my score (3 -> 4).

最终决定

The authors present AdjointMatching, an algorithm based on stochastic optimal control, to learn diffusion processes that sample from non-normalized probability densities, e.g. the Boltzmann distribution, falling into the broad category of neural samplers. Beyond the related work mentioned by the reviewers, the approach bears some conceptual resemblance to neural thermodynamic integration (Mate et al 2024, JPhysChemLett and prior work from these authors on interpolation of Boltzmann densities) while being motivated from a completely different perspective, and providing a more transparent regularization approach. The reviewers have a strong positive sentiment towards the paper, and following the discussion phase all doubts and questions initially raised by the reviewers particularly with regard to metrics reported, have been resolved.